[DISCUSS] solve unstable build capacity problem on TravisCI

classic Classic list List threaded Threaded
55 messages Options
123
Reply | Threaded
Open this post in threaded view
|

[DISCUSS] solve unstable build capacity problem on TravisCI

bowen.li
Hi devs,

I've been experiencing the pain resulting from lack of stable build
capacity on Travis for Flink PRs [1]. Specifically, I noticed often that no
build in the queue is making any progress for hours, and suddenly 5 or 6
builds kick off all together after the long pause. I'm at PST (UTC-08) time
zone, and I've seen pause can be as long as 6 hours from PST 9am to 3pm
(let alone the time needed to drain the queue afterwards).

I think this has greatly impacted our productivity. I've experienced that
PRs submitted in the early morning of PST time zone won't finish their
build until late night of the same day.

So my questions are:

- Has anyone else experienced the same problem or have similar observation
on TravisCI? (I suspect it has things to do with time zone)

- What pricing plan of TravisCI is Flink currently using? Is it the free
plan for open source projects? What are the guaranteed build capacity of
the current plan?

- If the current pricing plan (either free or paid) can't provide stable
build capacity, can we upgrade to a higher priced plan with larger and more
stable build capacity?

BTW, another factor that contribute to the productivity problem is that our
build is slow - we run full build for every PR and a successful full build
takes ~5h. We definitely have more options to solve it, for instance,
modularize the build graphs and reuse artifacts from the previous build.
But I think that can be a big effort which is much harder to accomplish in
a short period of time and may deserve its own separate discussion.

[1] https://travis-ci.org/apache/flink/pull_requests
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] solve unstable build capacity problem on TravisCI

bowen.li
https://travis-ci.org/apache/flink/builds/549681530  This build request has
been sitting at **HEAD of the queue** since I first saw it at PST 10:30am
(not sure how long it's been there before 10:30am). It's PST 4:12pm now and
it hasn't started yet.

On Mon, Jun 24, 2019 at 2:48 PM Bowen Li <[hidden email]> wrote:

> Hi devs,
>
> I've been experiencing the pain resulting from lack of stable build
> capacity on Travis for Flink PRs [1]. Specifically, I noticed often that no
> build in the queue is making any progress for hours, and suddenly 5 or 6
> builds kick off all together after the long pause. I'm at PST (UTC-08) time
> zone, and I've seen pause can be as long as 6 hours from PST 9am to 3pm
> (let alone the time needed to drain the queue afterwards).
>
> I think this has greatly impacted our productivity. I've experienced that
> PRs submitted in the early morning of PST time zone won't finish their
> build until late night of the same day.
>
> So my questions are:
>
> - Has anyone else experienced the same problem or have similar observation
> on TravisCI? (I suspect it has things to do with time zone)
>
> - What pricing plan of TravisCI is Flink currently using? Is it the free
> plan for open source projects? What are the guaranteed build capacity of
> the current plan?
>
> - If the current pricing plan (either free or paid) can't provide stable
> build capacity, can we upgrade to a higher priced plan with larger and more
> stable build capacity?
>
> BTW, another factor that contribute to the productivity problem is that
> our build is slow - we run full build for every PR and a successful full
> build takes ~5h. We definitely have more options to solve it, for instance,
> modularize the build graphs and reuse artifacts from the previous build.
> But I think that can be a big effort which is much harder to accomplish in
> a short period of time and may deserve its own separate discussion.
>
> [1] https://travis-ci.org/apache/flink/pull_requests
>
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] solve unstable build capacity problem on TravisCI

Steven Wu
long and sometimes unstable build is definitely a pain point.

I suspect the build failure here in flink-connector-kafka is not related to
my change. but there is no easy re-run the build on travis UI. Google
search showed a trick of close-and-open the PR will trigger rebuild. but
that could add noises to the PR activities.
https://travis-ci.org/apache/flink/jobs/545555519

travis-ci for my personal repo often failed with exceeding time limit after
4+ hours.
The job exceeded the maximum time limit for jobs, and has been terminated.

On Mon, Jun 24, 2019 at 4:15 PM Bowen Li <[hidden email]> wrote:

> https://travis-ci.org/apache/flink/builds/549681530  This build request
> has
> been sitting at **HEAD of the queue** since I first saw it at PST 10:30am
> (not sure how long it's been there before 10:30am). It's PST 4:12pm now and
> it hasn't started yet.
>
> On Mon, Jun 24, 2019 at 2:48 PM Bowen Li <[hidden email]> wrote:
>
> > Hi devs,
> >
> > I've been experiencing the pain resulting from lack of stable build
> > capacity on Travis for Flink PRs [1]. Specifically, I noticed often that
> no
> > build in the queue is making any progress for hours, and suddenly 5 or 6
> > builds kick off all together after the long pause. I'm at PST (UTC-08)
> time
> > zone, and I've seen pause can be as long as 6 hours from PST 9am to 3pm
> > (let alone the time needed to drain the queue afterwards).
> >
> > I think this has greatly impacted our productivity. I've experienced that
> > PRs submitted in the early morning of PST time zone won't finish their
> > build until late night of the same day.
> >
> > So my questions are:
> >
> > - Has anyone else experienced the same problem or have similar
> observation
> > on TravisCI? (I suspect it has things to do with time zone)
> >
> > - What pricing plan of TravisCI is Flink currently using? Is it the free
> > plan for open source projects? What are the guaranteed build capacity of
> > the current plan?
> >
> > - If the current pricing plan (either free or paid) can't provide stable
> > build capacity, can we upgrade to a higher priced plan with larger and
> more
> > stable build capacity?
> >
> > BTW, another factor that contribute to the productivity problem is that
> > our build is slow - we run full build for every PR and a successful full
> > build takes ~5h. We definitely have more options to solve it, for
> instance,
> > modularize the build graphs and reuse artifacts from the previous build.
> > But I think that can be a big effort which is much harder to accomplish
> in
> > a short period of time and may deserve its own separate discussion.
> >
> > [1] https://travis-ci.org/apache/flink/pull_requests
> >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] solve unstable build capacity problem on TravisCI

bowen.li
Hi Steven,

I think you may not read what I wrote. The discussion is about "unstable
build **capacity**", in another word "unstable / lack of build resources",
not "unstable build".

On Mon, Jun 24, 2019 at 4:40 PM Steven Wu <[hidden email]> wrote:

> long and sometimes unstable build is definitely a pain point.
>
> I suspect the build failure here in flink-connector-kafka is not related to
> my change. but there is no easy re-run the build on travis UI. Google
> search showed a trick of close-and-open the PR will trigger rebuild. but
> that could add noises to the PR activities.
> https://travis-ci.org/apache/flink/jobs/545555519
>
> travis-ci for my personal repo often failed with exceeding time limit after
> 4+ hours.
> The job exceeded the maximum time limit for jobs, and has been terminated.
>
> On Mon, Jun 24, 2019 at 4:15 PM Bowen Li <[hidden email]> wrote:
>
> > https://travis-ci.org/apache/flink/builds/549681530  This build request
> > has
> > been sitting at **HEAD of the queue** since I first saw it at PST 10:30am
> > (not sure how long it's been there before 10:30am). It's PST 4:12pm now
> and
> > it hasn't started yet.
> >
> > On Mon, Jun 24, 2019 at 2:48 PM Bowen Li <[hidden email]> wrote:
> >
> > > Hi devs,
> > >
> > > I've been experiencing the pain resulting from lack of stable build
> > > capacity on Travis for Flink PRs [1]. Specifically, I noticed often
> that
> > no
> > > build in the queue is making any progress for hours, and suddenly 5 or
> 6
> > > builds kick off all together after the long pause. I'm at PST (UTC-08)
> > time
> > > zone, and I've seen pause can be as long as 6 hours from PST 9am to 3pm
> > > (let alone the time needed to drain the queue afterwards).
> > >
> > > I think this has greatly impacted our productivity. I've experienced
> that
> > > PRs submitted in the early morning of PST time zone won't finish their
> > > build until late night of the same day.
> > >
> > > So my questions are:
> > >
> > > - Has anyone else experienced the same problem or have similar
> > observation
> > > on TravisCI? (I suspect it has things to do with time zone)
> > >
> > > - What pricing plan of TravisCI is Flink currently using? Is it the
> free
> > > plan for open source projects? What are the guaranteed build capacity
> of
> > > the current plan?
> > >
> > > - If the current pricing plan (either free or paid) can't provide
> stable
> > > build capacity, can we upgrade to a higher priced plan with larger and
> > more
> > > stable build capacity?
> > >
> > > BTW, another factor that contribute to the productivity problem is that
> > > our build is slow - we run full build for every PR and a successful
> full
> > > build takes ~5h. We definitely have more options to solve it, for
> > instance,
> > > modularize the build graphs and reuse artifacts from the previous
> build.
> > > But I think that can be a big effort which is much harder to accomplish
> > in
> > > a short period of time and may deserve its own separate discussion.
> > >
> > > [1] https://travis-ci.org/apache/flink/pull_requests
> > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] solve unstable build capacity problem on TravisCI

Jark Wu-2
Hi Bowen,

Thanks for bringing this. We also suffered from the long build time.
I agree that we should focus on solving build capacity problem in the
thread.

My observation is there is only one build is running, all the others (other
PRs, master) are pending.
The pricing plan[1] of travis shows it can support concurrent build jobs.
But I don't know which plan we are using, might be the free plan for open
source.

I cc-ed Chesnay who may have some experience on Travis.

Regards,
Jark

[1]: https://travis-ci.com/plans

On Tue, 25 Jun 2019 at 08:11, Bowen Li <[hidden email]> wrote:

> Hi Steven,
>
> I think you may not read what I wrote. The discussion is about "unstable
> build **capacity**", in another word "unstable / lack of build resources",
> not "unstable build".
>
> On Mon, Jun 24, 2019 at 4:40 PM Steven Wu <[hidden email]> wrote:
>
> > long and sometimes unstable build is definitely a pain point.
> >
> > I suspect the build failure here in flink-connector-kafka is not related
> to
> > my change. but there is no easy re-run the build on travis UI. Google
> > search showed a trick of close-and-open the PR will trigger rebuild. but
> > that could add noises to the PR activities.
> > https://travis-ci.org/apache/flink/jobs/545555519
> >
> > travis-ci for my personal repo often failed with exceeding time limit
> after
> > 4+ hours.
> > The job exceeded the maximum time limit for jobs, and has been
> terminated.
> >
> > On Mon, Jun 24, 2019 at 4:15 PM Bowen Li <[hidden email]> wrote:
> >
> > > https://travis-ci.org/apache/flink/builds/549681530  This build
> request
> > > has
> > > been sitting at **HEAD of the queue** since I first saw it at PST
> 10:30am
> > > (not sure how long it's been there before 10:30am). It's PST 4:12pm now
> > and
> > > it hasn't started yet.
> > >
> > > On Mon, Jun 24, 2019 at 2:48 PM Bowen Li <[hidden email]> wrote:
> > >
> > > > Hi devs,
> > > >
> > > > I've been experiencing the pain resulting from lack of stable build
> > > > capacity on Travis for Flink PRs [1]. Specifically, I noticed often
> > that
> > > no
> > > > build in the queue is making any progress for hours, and suddenly 5
> or
> > 6
> > > > builds kick off all together after the long pause. I'm at PST
> (UTC-08)
> > > time
> > > > zone, and I've seen pause can be as long as 6 hours from PST 9am to
> 3pm
> > > > (let alone the time needed to drain the queue afterwards).
> > > >
> > > > I think this has greatly impacted our productivity. I've experienced
> > that
> > > > PRs submitted in the early morning of PST time zone won't finish
> their
> > > > build until late night of the same day.
> > > >
> > > > So my questions are:
> > > >
> > > > - Has anyone else experienced the same problem or have similar
> > > observation
> > > > on TravisCI? (I suspect it has things to do with time zone)
> > > >
> > > > - What pricing plan of TravisCI is Flink currently using? Is it the
> > free
> > > > plan for open source projects? What are the guaranteed build capacity
> > of
> > > > the current plan?
> > > >
> > > > - If the current pricing plan (either free or paid) can't provide
> > stable
> > > > build capacity, can we upgrade to a higher priced plan with larger
> and
> > > more
> > > > stable build capacity?
> > > >
> > > > BTW, another factor that contribute to the productivity problem is
> that
> > > > our build is slow - we run full build for every PR and a successful
> > full
> > > > build takes ~5h. We definitely have more options to solve it, for
> > > instance,
> > > > modularize the build graphs and reuse artifacts from the previous
> > build.
> > > > But I think that can be a big effort which is much harder to
> accomplish
> > > in
> > > > a short period of time and may deserve its own separate discussion.
> > > >
> > > > [1] https://travis-ci.org/apache/flink/pull_requests
> > > >
> > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] solve unstable build capacity problem on TravisCI

Kurt Young
Hi Bowen,

Thanks for bringing this up. We actually have discussed about this, and I
think Till and George have
already spend sometime investigating it. I have cced both of them, and
maybe they can share
their findings.

Best,
Kurt


On Tue, Jun 25, 2019 at 10:08 AM Jark Wu <[hidden email]> wrote:

> Hi Bowen,
>
> Thanks for bringing this. We also suffered from the long build time.
> I agree that we should focus on solving build capacity problem in the
> thread.
>
> My observation is there is only one build is running, all the others (other
> PRs, master) are pending.
> The pricing plan[1] of travis shows it can support concurrent build jobs.
> But I don't know which plan we are using, might be the free plan for open
> source.
>
> I cc-ed Chesnay who may have some experience on Travis.
>
> Regards,
> Jark
>
> [1]: https://travis-ci.com/plans
>
> On Tue, 25 Jun 2019 at 08:11, Bowen Li <[hidden email]> wrote:
>
> > Hi Steven,
> >
> > I think you may not read what I wrote. The discussion is about "unstable
> > build **capacity**", in another word "unstable / lack of build
> resources",
> > not "unstable build".
> >
> > On Mon, Jun 24, 2019 at 4:40 PM Steven Wu <[hidden email]> wrote:
> >
> > > long and sometimes unstable build is definitely a pain point.
> > >
> > > I suspect the build failure here in flink-connector-kafka is not
> related
> > to
> > > my change. but there is no easy re-run the build on travis UI. Google
> > > search showed a trick of close-and-open the PR will trigger rebuild.
> but
> > > that could add noises to the PR activities.
> > > https://travis-ci.org/apache/flink/jobs/545555519
> > >
> > > travis-ci for my personal repo often failed with exceeding time limit
> > after
> > > 4+ hours.
> > > The job exceeded the maximum time limit for jobs, and has been
> > terminated.
> > >
> > > On Mon, Jun 24, 2019 at 4:15 PM Bowen Li <[hidden email]> wrote:
> > >
> > > > https://travis-ci.org/apache/flink/builds/549681530  This build
> > request
> > > > has
> > > > been sitting at **HEAD of the queue** since I first saw it at PST
> > 10:30am
> > > > (not sure how long it's been there before 10:30am). It's PST 4:12pm
> now
> > > and
> > > > it hasn't started yet.
> > > >
> > > > On Mon, Jun 24, 2019 at 2:48 PM Bowen Li <[hidden email]>
> wrote:
> > > >
> > > > > Hi devs,
> > > > >
> > > > > I've been experiencing the pain resulting from lack of stable build
> > > > > capacity on Travis for Flink PRs [1]. Specifically, I noticed often
> > > that
> > > > no
> > > > > build in the queue is making any progress for hours, and suddenly 5
> > or
> > > 6
> > > > > builds kick off all together after the long pause. I'm at PST
> > (UTC-08)
> > > > time
> > > > > zone, and I've seen pause can be as long as 6 hours from PST 9am to
> > 3pm
> > > > > (let alone the time needed to drain the queue afterwards).
> > > > >
> > > > > I think this has greatly impacted our productivity. I've
> experienced
> > > that
> > > > > PRs submitted in the early morning of PST time zone won't finish
> > their
> > > > > build until late night of the same day.
> > > > >
> > > > > So my questions are:
> > > > >
> > > > > - Has anyone else experienced the same problem or have similar
> > > > observation
> > > > > on TravisCI? (I suspect it has things to do with time zone)
> > > > >
> > > > > - What pricing plan of TravisCI is Flink currently using? Is it the
> > > free
> > > > > plan for open source projects? What are the guaranteed build
> capacity
> > > of
> > > > > the current plan?
> > > > >
> > > > > - If the current pricing plan (either free or paid) can't provide
> > > stable
> > > > > build capacity, can we upgrade to a higher priced plan with larger
> > and
> > > > more
> > > > > stable build capacity?
> > > > >
> > > > > BTW, another factor that contribute to the productivity problem is
> > that
> > > > > our build is slow - we run full build for every PR and a successful
> > > full
> > > > > build takes ~5h. We definitely have more options to solve it, for
> > > > instance,
> > > > > modularize the build graphs and reuse artifacts from the previous
> > > build.
> > > > > But I think that can be a big effort which is much harder to
> > accomplish
> > > > in
> > > > > a short period of time and may deserve its own separate discussion.
> > > > >
> > > > > [1] https://travis-ci.org/apache/flink/pull_requests
> > > > >
> > > > >
> > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] solve unstable build capacity problem on TravisCI

Kurt Young
(Forgot to cc George)

Best,
Kurt


On Tue, Jun 25, 2019 at 10:16 AM Kurt Young <[hidden email]> wrote:

> Hi Bowen,
>
> Thanks for bringing this up. We actually have discussed about this, and I
> think Till and George have
> already spend sometime investigating it. I have cced both of them, and
> maybe they can share
> their findings.
>
> Best,
> Kurt
>
>
> On Tue, Jun 25, 2019 at 10:08 AM Jark Wu <[hidden email]> wrote:
>
>> Hi Bowen,
>>
>> Thanks for bringing this. We also suffered from the long build time.
>> I agree that we should focus on solving build capacity problem in the
>> thread.
>>
>> My observation is there is only one build is running, all the others
>> (other
>> PRs, master) are pending.
>> The pricing plan[1] of travis shows it can support concurrent build jobs.
>> But I don't know which plan we are using, might be the free plan for open
>> source.
>>
>> I cc-ed Chesnay who may have some experience on Travis.
>>
>> Regards,
>> Jark
>>
>> [1]: https://travis-ci.com/plans
>>
>> On Tue, 25 Jun 2019 at 08:11, Bowen Li <[hidden email]> wrote:
>>
>> > Hi Steven,
>> >
>> > I think you may not read what I wrote. The discussion is about "unstable
>> > build **capacity**", in another word "unstable / lack of build
>> resources",
>> > not "unstable build".
>> >
>> > On Mon, Jun 24, 2019 at 4:40 PM Steven Wu <[hidden email]> wrote:
>> >
>> > > long and sometimes unstable build is definitely a pain point.
>> > >
>> > > I suspect the build failure here in flink-connector-kafka is not
>> related
>> > to
>> > > my change. but there is no easy re-run the build on travis UI. Google
>> > > search showed a trick of close-and-open the PR will trigger rebuild.
>> but
>> > > that could add noises to the PR activities.
>> > > https://travis-ci.org/apache/flink/jobs/545555519
>> > >
>> > > travis-ci for my personal repo often failed with exceeding time limit
>> > after
>> > > 4+ hours.
>> > > The job exceeded the maximum time limit for jobs, and has been
>> > terminated.
>> > >
>> > > On Mon, Jun 24, 2019 at 4:15 PM Bowen Li <[hidden email]> wrote:
>> > >
>> > > > https://travis-ci.org/apache/flink/builds/549681530  This build
>> > request
>> > > > has
>> > > > been sitting at **HEAD of the queue** since I first saw it at PST
>> > 10:30am
>> > > > (not sure how long it's been there before 10:30am). It's PST 4:12pm
>> now
>> > > and
>> > > > it hasn't started yet.
>> > > >
>> > > > On Mon, Jun 24, 2019 at 2:48 PM Bowen Li <[hidden email]>
>> wrote:
>> > > >
>> > > > > Hi devs,
>> > > > >
>> > > > > I've been experiencing the pain resulting from lack of stable
>> build
>> > > > > capacity on Travis for Flink PRs [1]. Specifically, I noticed
>> often
>> > > that
>> > > > no
>> > > > > build in the queue is making any progress for hours, and suddenly
>> 5
>> > or
>> > > 6
>> > > > > builds kick off all together after the long pause. I'm at PST
>> > (UTC-08)
>> > > > time
>> > > > > zone, and I've seen pause can be as long as 6 hours from PST 9am
>> to
>> > 3pm
>> > > > > (let alone the time needed to drain the queue afterwards).
>> > > > >
>> > > > > I think this has greatly impacted our productivity. I've
>> experienced
>> > > that
>> > > > > PRs submitted in the early morning of PST time zone won't finish
>> > their
>> > > > > build until late night of the same day.
>> > > > >
>> > > > > So my questions are:
>> > > > >
>> > > > > - Has anyone else experienced the same problem or have similar
>> > > > observation
>> > > > > on TravisCI? (I suspect it has things to do with time zone)
>> > > > >
>> > > > > - What pricing plan of TravisCI is Flink currently using? Is it
>> the
>> > > free
>> > > > > plan for open source projects? What are the guaranteed build
>> capacity
>> > > of
>> > > > > the current plan?
>> > > > >
>> > > > > - If the current pricing plan (either free or paid) can't provide
>> > > stable
>> > > > > build capacity, can we upgrade to a higher priced plan with larger
>> > and
>> > > > more
>> > > > > stable build capacity?
>> > > > >
>> > > > > BTW, another factor that contribute to the productivity problem is
>> > that
>> > > > > our build is slow - we run full build for every PR and a
>> successful
>> > > full
>> > > > > build takes ~5h. We definitely have more options to solve it, for
>> > > > instance,
>> > > > > modularize the build graphs and reuse artifacts from the previous
>> > > build.
>> > > > > But I think that can be a big effort which is much harder to
>> > accomplish
>> > > > in
>> > > > > a short period of time and may deserve its own separate
>> discussion.
>> > > > >
>> > > > > [1] https://travis-ci.org/apache/flink/pull_requests
>> > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] solve unstable build capacity problem on TravisCI

Jeff Zhang
Hi Folks,

Zeppelin meet this kind of issue before, we solve it by delegating each
one's PR build to his travis account (Everyone can have 5 free slot for
travis build).
Apache account travis build is only triggered when PR is merged.



Kurt Young <[hidden email]> 于2019年6月25日周二 上午10:16写道:

> (Forgot to cc George)
>
> Best,
> Kurt
>
>
> On Tue, Jun 25, 2019 at 10:16 AM Kurt Young <[hidden email]> wrote:
>
> > Hi Bowen,
> >
> > Thanks for bringing this up. We actually have discussed about this, and I
> > think Till and George have
> > already spend sometime investigating it. I have cced both of them, and
> > maybe they can share
> > their findings.
> >
> > Best,
> > Kurt
> >
> >
> > On Tue, Jun 25, 2019 at 10:08 AM Jark Wu <[hidden email]> wrote:
> >
> >> Hi Bowen,
> >>
> >> Thanks for bringing this. We also suffered from the long build time.
> >> I agree that we should focus on solving build capacity problem in the
> >> thread.
> >>
> >> My observation is there is only one build is running, all the others
> >> (other
> >> PRs, master) are pending.
> >> The pricing plan[1] of travis shows it can support concurrent build
> jobs.
> >> But I don't know which plan we are using, might be the free plan for
> open
> >> source.
> >>
> >> I cc-ed Chesnay who may have some experience on Travis.
> >>
> >> Regards,
> >> Jark
> >>
> >> [1]: https://travis-ci.com/plans
> >>
> >> On Tue, 25 Jun 2019 at 08:11, Bowen Li <[hidden email]> wrote:
> >>
> >> > Hi Steven,
> >> >
> >> > I think you may not read what I wrote. The discussion is about
> "unstable
> >> > build **capacity**", in another word "unstable / lack of build
> >> resources",
> >> > not "unstable build".
> >> >
> >> > On Mon, Jun 24, 2019 at 4:40 PM Steven Wu <[hidden email]>
> wrote:
> >> >
> >> > > long and sometimes unstable build is definitely a pain point.
> >> > >
> >> > > I suspect the build failure here in flink-connector-kafka is not
> >> related
> >> > to
> >> > > my change. but there is no easy re-run the build on travis UI.
> Google
> >> > > search showed a trick of close-and-open the PR will trigger rebuild.
> >> but
> >> > > that could add noises to the PR activities.
> >> > > https://travis-ci.org/apache/flink/jobs/545555519
> >> > >
> >> > > travis-ci for my personal repo often failed with exceeding time
> limit
> >> > after
> >> > > 4+ hours.
> >> > > The job exceeded the maximum time limit for jobs, and has been
> >> > terminated.
> >> > >
> >> > > On Mon, Jun 24, 2019 at 4:15 PM Bowen Li <[hidden email]>
> wrote:
> >> > >
> >> > > > https://travis-ci.org/apache/flink/builds/549681530  This build
> >> > request
> >> > > > has
> >> > > > been sitting at **HEAD of the queue** since I first saw it at PST
> >> > 10:30am
> >> > > > (not sure how long it's been there before 10:30am). It's PST
> 4:12pm
> >> now
> >> > > and
> >> > > > it hasn't started yet.
> >> > > >
> >> > > > On Mon, Jun 24, 2019 at 2:48 PM Bowen Li <[hidden email]>
> >> wrote:
> >> > > >
> >> > > > > Hi devs,
> >> > > > >
> >> > > > > I've been experiencing the pain resulting from lack of stable
> >> build
> >> > > > > capacity on Travis for Flink PRs [1]. Specifically, I noticed
> >> often
> >> > > that
> >> > > > no
> >> > > > > build in the queue is making any progress for hours, and
> suddenly
> >> 5
> >> > or
> >> > > 6
> >> > > > > builds kick off all together after the long pause. I'm at PST
> >> > (UTC-08)
> >> > > > time
> >> > > > > zone, and I've seen pause can be as long as 6 hours from PST 9am
> >> to
> >> > 3pm
> >> > > > > (let alone the time needed to drain the queue afterwards).
> >> > > > >
> >> > > > > I think this has greatly impacted our productivity. I've
> >> experienced
> >> > > that
> >> > > > > PRs submitted in the early morning of PST time zone won't finish
> >> > their
> >> > > > > build until late night of the same day.
> >> > > > >
> >> > > > > So my questions are:
> >> > > > >
> >> > > > > - Has anyone else experienced the same problem or have similar
> >> > > > observation
> >> > > > > on TravisCI? (I suspect it has things to do with time zone)
> >> > > > >
> >> > > > > - What pricing plan of TravisCI is Flink currently using? Is it
> >> the
> >> > > free
> >> > > > > plan for open source projects? What are the guaranteed build
> >> capacity
> >> > > of
> >> > > > > the current plan?
> >> > > > >
> >> > > > > - If the current pricing plan (either free or paid) can't
> provide
> >> > > stable
> >> > > > > build capacity, can we upgrade to a higher priced plan with
> larger
> >> > and
> >> > > > more
> >> > > > > stable build capacity?
> >> > > > >
> >> > > > > BTW, another factor that contribute to the productivity problem
> is
> >> > that
> >> > > > > our build is slow - we run full build for every PR and a
> >> successful
> >> > > full
> >> > > > > build takes ~5h. We definitely have more options to solve it,
> for
> >> > > > instance,
> >> > > > > modularize the build graphs and reuse artifacts from the
> previous
> >> > > build.
> >> > > > > But I think that can be a big effort which is much harder to
> >> > accomplish
> >> > > > in
> >> > > > > a short period of time and may deserve its own separate
> >> discussion.
> >> > > > >
> >> > > > > [1] https://travis-ci.org/apache/flink/pull_requests
> >> > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> >
>


--
Best Regards

Jeff Zhang
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] solve unstable build capacity problem on TravisCI

Jark Wu-2
Hi Jeff,

Thanks for sharing the Zeppelin approach. I think it's a good idea to
leverage user's travis account.
In this way, we can have almost unlimited concurrent build jobs and
developers can restart build by themselves (currently only committers can
restart PR's build).

But I'm still not very clear how to integrate user's travis build into the
Flink pull request's build automatically. Can you explain more in detail?

Another question: does travis only build branches for user account?
My concern is that builds for PRs will rebase user's commits against
current master branch.
This will help us to find problems before merge.  Builds for branches will
lose the impact of new commits in master.
How does Zeppelin solve this problem?

Thanks again for sharing the idea.

Regards,
Jark

On Tue, 25 Jun 2019 at 11:01, Jeff Zhang <[hidden email]> wrote:

> Hi Folks,
>
> Zeppelin meet this kind of issue before, we solve it by delegating each
> one's PR build to his travis account (Everyone can have 5 free slot for
> travis build).
> Apache account travis build is only triggered when PR is merged.
>
>
>
> Kurt Young <[hidden email]> 于2019年6月25日周二 上午10:16写道:
>
> > (Forgot to cc George)
> >
> > Best,
> > Kurt
> >
> >
> > On Tue, Jun 25, 2019 at 10:16 AM Kurt Young <[hidden email]> wrote:
> >
> > > Hi Bowen,
> > >
> > > Thanks for bringing this up. We actually have discussed about this,
> and I
> > > think Till and George have
> > > already spend sometime investigating it. I have cced both of them, and
> > > maybe they can share
> > > their findings.
> > >
> > > Best,
> > > Kurt
> > >
> > >
> > > On Tue, Jun 25, 2019 at 10:08 AM Jark Wu <[hidden email]> wrote:
> > >
> > >> Hi Bowen,
> > >>
> > >> Thanks for bringing this. We also suffered from the long build time.
> > >> I agree that we should focus on solving build capacity problem in the
> > >> thread.
> > >>
> > >> My observation is there is only one build is running, all the others
> > >> (other
> > >> PRs, master) are pending.
> > >> The pricing plan[1] of travis shows it can support concurrent build
> > jobs.
> > >> But I don't know which plan we are using, might be the free plan for
> > open
> > >> source.
> > >>
> > >> I cc-ed Chesnay who may have some experience on Travis.
> > >>
> > >> Regards,
> > >> Jark
> > >>
> > >> [1]: https://travis-ci.com/plans
> > >>
> > >> On Tue, 25 Jun 2019 at 08:11, Bowen Li <[hidden email]> wrote:
> > >>
> > >> > Hi Steven,
> > >> >
> > >> > I think you may not read what I wrote. The discussion is about
> > "unstable
> > >> > build **capacity**", in another word "unstable / lack of build
> > >> resources",
> > >> > not "unstable build".
> > >> >
> > >> > On Mon, Jun 24, 2019 at 4:40 PM Steven Wu <[hidden email]>
> > wrote:
> > >> >
> > >> > > long and sometimes unstable build is definitely a pain point.
> > >> > >
> > >> > > I suspect the build failure here in flink-connector-kafka is not
> > >> related
> > >> > to
> > >> > > my change. but there is no easy re-run the build on travis UI.
> > Google
> > >> > > search showed a trick of close-and-open the PR will trigger
> rebuild.
> > >> but
> > >> > > that could add noises to the PR activities.
> > >> > > https://travis-ci.org/apache/flink/jobs/545555519
> > >> > >
> > >> > > travis-ci for my personal repo often failed with exceeding time
> > limit
> > >> > after
> > >> > > 4+ hours.
> > >> > > The job exceeded the maximum time limit for jobs, and has been
> > >> > terminated.
> > >> > >
> > >> > > On Mon, Jun 24, 2019 at 4:15 PM Bowen Li <[hidden email]>
> > wrote:
> > >> > >
> > >> > > > https://travis-ci.org/apache/flink/builds/549681530  This build
> > >> > request
> > >> > > > has
> > >> > > > been sitting at **HEAD of the queue** since I first saw it at
> PST
> > >> > 10:30am
> > >> > > > (not sure how long it's been there before 10:30am). It's PST
> > 4:12pm
> > >> now
> > >> > > and
> > >> > > > it hasn't started yet.
> > >> > > >
> > >> > > > On Mon, Jun 24, 2019 at 2:48 PM Bowen Li <[hidden email]>
> > >> wrote:
> > >> > > >
> > >> > > > > Hi devs,
> > >> > > > >
> > >> > > > > I've been experiencing the pain resulting from lack of stable
> > >> build
> > >> > > > > capacity on Travis for Flink PRs [1]. Specifically, I noticed
> > >> often
> > >> > > that
> > >> > > > no
> > >> > > > > build in the queue is making any progress for hours, and
> > suddenly
> > >> 5
> > >> > or
> > >> > > 6
> > >> > > > > builds kick off all together after the long pause. I'm at PST
> > >> > (UTC-08)
> > >> > > > time
> > >> > > > > zone, and I've seen pause can be as long as 6 hours from PST
> 9am
> > >> to
> > >> > 3pm
> > >> > > > > (let alone the time needed to drain the queue afterwards).
> > >> > > > >
> > >> > > > > I think this has greatly impacted our productivity. I've
> > >> experienced
> > >> > > that
> > >> > > > > PRs submitted in the early morning of PST time zone won't
> finish
> > >> > their
> > >> > > > > build until late night of the same day.
> > >> > > > >
> > >> > > > > So my questions are:
> > >> > > > >
> > >> > > > > - Has anyone else experienced the same problem or have similar
> > >> > > > observation
> > >> > > > > on TravisCI? (I suspect it has things to do with time zone)
> > >> > > > >
> > >> > > > > - What pricing plan of TravisCI is Flink currently using? Is
> it
> > >> the
> > >> > > free
> > >> > > > > plan for open source projects? What are the guaranteed build
> > >> capacity
> > >> > > of
> > >> > > > > the current plan?
> > >> > > > >
> > >> > > > > - If the current pricing plan (either free or paid) can't
> > provide
> > >> > > stable
> > >> > > > > build capacity, can we upgrade to a higher priced plan with
> > larger
> > >> > and
> > >> > > > more
> > >> > > > > stable build capacity?
> > >> > > > >
> > >> > > > > BTW, another factor that contribute to the productivity
> problem
> > is
> > >> > that
> > >> > > > > our build is slow - we run full build for every PR and a
> > >> successful
> > >> > > full
> > >> > > > > build takes ~5h. We definitely have more options to solve it,
> > for
> > >> > > > instance,
> > >> > > > > modularize the build graphs and reuse artifacts from the
> > previous
> > >> > > build.
> > >> > > > > But I think that can be a big effort which is much harder to
> > >> > accomplish
> > >> > > > in
> > >> > > > > a short period of time and may deserve its own separate
> > >> discussion.
> > >> > > > >
> > >> > > > > [1] https://travis-ci.org/apache/flink/pull_requests
> > >> > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >
> >
>
>
> --
> Best Regards
>
> Jeff Zhang
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] solve unstable build capacity problem on TravisCI

Chesnay Schepler-3
In reply to this post by bowen.li

On 24/06/2019 23:48, Bowen Li wrote:
> - Has anyone else experienced the same problem or have similar observation
> on TravisCI? (I suspect it has things to do with time zone)
In Europe we have the same problem.
>
> - What pricing plan of TravisCI is Flink currently using? Is it the free
> plan for open source projects? What are the guaranteed build capacity of
> the current plan?
Flink is using the Travis resources provided by the ASF, which afaik the
ASF is paying for.

Note that in the past the Flink project was already approached  by INFRA
since we were using too many Travis resources,
so this is _not_ as simple as asking for more.
>
> - If the current pricing plan (either free or paid) can't provide stable
> build capacity, can we upgrade to a higher priced plan with larger and more
> stable build capacity?
We are currently investigating whether companies could donate/sponsor
Travis CI resources to the ASF for increasing the build capacity;
currently waiting for an answer from INFRA.
>
> BTW, another factor that contribute to the productivity problem is that our
> build is slow - we run full build for every PR and a successful full build
> takes ~5h. We definitely have more options to solve it, for instance,
> modularize the build graphs and reuse artifacts from the previous build.
> But I think that can be a big effort which is much harder to accomplish in
> a short period of time and may deserve its own separate discussion.
We already are doing that. The vast majority of the build times is
simply due to tests taking way too long, not compilation.
The tests for the kafka connector alone exceed a single profile, as does
the table API.
Unless people start caring about test times before adding them, this
issue cannot be solved.

Of course, this discussion isn't new, I already raised it the last 2
times we approach the Travis limits, with little to no effect to be seen.

At this point I'm sure someone out there is thinking "well, just don't
run kafka tests for every PR. Like, check the diff or something",
and yes, sure, that _might_ work. But to this day, despite numerous
people suggesting it, I still haven't seen a single person actually try
implementing it.

The problem with these kind of approaches is that they tend to be
brittle as hell, result in subtle behaviors if they have bugs, and
overall make the CI significantly more complicated by introducing
various edge cases.

Our current CI is, relatively speaking, straightforward and consistent.
And as it stands we can't afford elaborate schemes because I just don't
have the time capacity for maintaining that.
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] solve unstable build capacity problem on TravisCI

bowen.li
Want to summarize Chesnay's points for everyone reading this thread: 1) the
build resources Flink is currently using belong to ASF INFRA, and 2) we are
waiting on ASF INFRA's response on whether we can donate/sponsor extra
build resources for Flink.

I think it'll be super helpful to pay and secure dedicated build resources
for Flink. If that doesn't work, I agree with Jark that the Zeppelin's
approach Jeff described sounds promising.

Jeff, can you answer Jark's questions above and share how Zeppelin
community's practices look like?

Cheers,
Bowen

On Tue, Jun 25, 2019 at 12:50 AM Chesnay Schepler <[hidden email]>
wrote:

>
> On 24/06/2019 23:48, Bowen Li wrote:
> > - Has anyone else experienced the same problem or have similar
> observation
> > on TravisCI? (I suspect it has things to do with time zone)
> In Europe we have the same problem.
> >
> > - What pricing plan of TravisCI is Flink currently using? Is it the free
> > plan for open source projects? What are the guaranteed build capacity of
> > the current plan?
> Flink is using the Travis resources provided by the ASF, which afaik the
> ASF is paying for.
>
> Note that in the past the Flink project was already approached  by INFRA
> since we were using too many Travis resources,
> so this is _not_ as simple as asking for more.
> >
> > - If the current pricing plan (either free or paid) can't provide stable
> > build capacity, can we upgrade to a higher priced plan with larger and
> more
> > stable build capacity?
> We are currently investigating whether companies could donate/sponsor
> Travis CI resources to the ASF for increasing the build capacity;
> currently waiting for an answer from INFRA.
> >
> > BTW, another factor that contribute to the productivity problem is that
> our
> > build is slow - we run full build for every PR and a successful full
> build
> > takes ~5h. We definitely have more options to solve it, for instance,
> > modularize the build graphs and reuse artifacts from the previous build.
> > But I think that can be a big effort which is much harder to accomplish
> in
> > a short period of time and may deserve its own separate discussion.
> We already are doing that. The vast majority of the build times is
> simply due to tests taking way too long, not compilation.
> The tests for the kafka connector alone exceed a single profile, as does
> the table API.
> Unless people start caring about test times before adding them, this
> issue cannot be solved.
>
> Of course, this discussion isn't new, I already raised it the last 2
> times we approach the Travis limits, with little to no effect to be seen.
>
> At this point I'm sure someone out there is thinking "well, just don't
> run kafka tests for every PR. Like, check the diff or something",
> and yes, sure, that _might_ work. But to this day, despite numerous
> people suggesting it, I still haven't seen a single person actually try
> implementing it.
>
> The problem with these kind of approaches is that they tend to be
> brittle as hell, result in subtle behaviors if they have bugs, and
> overall make the CI significantly more complicated by introducing
> various edge cases.
>
> Our current CI is, relatively speaking, straightforward and consistent.
> And as it stands we can't afford elaborate schemes because I just don't
> have the time capacity for maintaining that.
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] solve unstable build capacity problem on TravisCI

Chesnay Schepler-3
In reply to this post by Jark Wu-2
 From what I gathered, there's no special sauce that the Zeppelin
project uses which actually integrates a users Travis account into the PR.

They just disabled Travis for PRs. And that's kind of it.

Naturally we can do this (duh) and safe the ASF a fair amount of
resources, but there are downsides:

The discoverability of the Travis check takes a nose-dive. Either we
require every contributor to always, an every commit, also post a Travis
build, or we have the reviewer sift through the contributors account to
find it.

This is rather cumbersome. Additionally, it's also not equivalent to
having a PR build.

A normal branch build takes a branch as is and tests it. A PR build
merges the branch into master, and then runs it. (Fun fact: This is why
a PR without merge conflicts is not being run on Travis.)

And ultimately, everyone can already make use of this approach anyway.

On 25/06/2019 08:02, Jark Wu wrote:

> Hi Jeff,
>
> Thanks for sharing the Zeppelin approach. I think it's a good idea to
> leverage user's travis account.
> In this way, we can have almost unlimited concurrent build jobs and
> developers can restart build by themselves (currently only committers
> can restart PR's build).
>
> But I'm still not very clear how to integrate user's travis build into
> the Flink pull request's build automatically. Can you explain more in
> detail?
>
> Another question: does travis only build branches for user account?
> My concern is that builds for PRs will rebase user's commits against
> current master branch.
> This will help us to find problems before merge.  Builds for branches
> will lose the impact of new commits in master.
> How does Zeppelin solve this problem?
>
> Thanks again for sharing the idea.
>
> Regards,
> Jark
>
> On Tue, 25 Jun 2019 at 11:01, Jeff Zhang <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>     Hi Folks,
>
>     Zeppelin meet this kind of issue before, we solve it by delegating
>     each
>     one's PR build to his travis account (Everyone can have 5 free
>     slot for
>     travis build).
>     Apache account travis build is only triggered when PR is merged.
>
>
>
>     Kurt Young <[hidden email] <mailto:[hidden email]>>
>     于2019年6月25日周二 上午10:16写道:
>
>     > (Forgot to cc George)
>     >
>     > Best,
>     > Kurt
>     >
>     >
>     > On Tue, Jun 25, 2019 at 10:16 AM Kurt Young <[hidden email]
>     <mailto:[hidden email]>> wrote:
>     >
>     > > Hi Bowen,
>     > >
>     > > Thanks for bringing this up. We actually have discussed about
>     this, and I
>     > > think Till and George have
>     > > already spend sometime investigating it. I have cced both of
>     them, and
>     > > maybe they can share
>     > > their findings.
>     > >
>     > > Best,
>     > > Kurt
>     > >
>     > >
>     > > On Tue, Jun 25, 2019 at 10:08 AM Jark Wu <[hidden email]
>     <mailto:[hidden email]>> wrote:
>     > >
>     > >> Hi Bowen,
>     > >>
>     > >> Thanks for bringing this. We also suffered from the long
>     build time.
>     > >> I agree that we should focus on solving build capacity
>     problem in the
>     > >> thread.
>     > >>
>     > >> My observation is there is only one build is running, all the
>     others
>     > >> (other
>     > >> PRs, master) are pending.
>     > >> The pricing plan[1] of travis shows it can support concurrent
>     build
>     > jobs.
>     > >> But I don't know which plan we are using, might be the free
>     plan for
>     > open
>     > >> source.
>     > >>
>     > >> I cc-ed Chesnay who may have some experience on Travis.
>     > >>
>     > >> Regards,
>     > >> Jark
>     > >>
>     > >> [1]: https://travis-ci.com/plans
>     > >>
>     > >> On Tue, 25 Jun 2019 at 08:11, Bowen Li <[hidden email]
>     <mailto:[hidden email]>> wrote:
>     > >>
>     > >> > Hi Steven,
>     > >> >
>     > >> > I think you may not read what I wrote. The discussion is about
>     > "unstable
>     > >> > build **capacity**", in another word "unstable / lack of build
>     > >> resources",
>     > >> > not "unstable build".
>     > >> >
>     > >> > On Mon, Jun 24, 2019 at 4:40 PM Steven Wu
>     <[hidden email] <mailto:[hidden email]>>
>     > wrote:
>     > >> >
>     > >> > > long and sometimes unstable build is definitely a pain point.
>     > >> > >
>     > >> > > I suspect the build failure here in flink-connector-kafka
>     is not
>     > >> related
>     > >> > to
>     > >> > > my change. but there is no easy re-run the build on
>     travis UI.
>     > Google
>     > >> > > search showed a trick of close-and-open the PR will
>     trigger rebuild.
>     > >> but
>     > >> > > that could add noises to the PR activities.
>     > >> > > https://travis-ci.org/apache/flink/jobs/545555519
>     > >> > >
>     > >> > > travis-ci for my personal repo often failed with
>     exceeding time
>     > limit
>     > >> > after
>     > >> > > 4+ hours.
>     > >> > > The job exceeded the maximum time limit for jobs, and has
>     been
>     > >> > terminated.
>     > >> > >
>     > >> > > On Mon, Jun 24, 2019 at 4:15 PM Bowen Li
>     <[hidden email] <mailto:[hidden email]>>
>     > wrote:
>     > >> > >
>     > >> > > > https://travis-ci.org/apache/flink/builds/549681530
>     This build
>     > >> > request
>     > >> > > > has
>     > >> > > > been sitting at **HEAD of the queue** since I first saw
>     it at PST
>     > >> > 10:30am
>     > >> > > > (not sure how long it's been there before 10:30am).
>     It's PST
>     > 4:12pm
>     > >> now
>     > >> > > and
>     > >> > > > it hasn't started yet.
>     > >> > > >
>     > >> > > > On Mon, Jun 24, 2019 at 2:48 PM Bowen Li
>     <[hidden email] <mailto:[hidden email]>>
>     > >> wrote:
>     > >> > > >
>     > >> > > > > Hi devs,
>     > >> > > > >
>     > >> > > > > I've been experiencing the pain resulting from lack
>     of stable
>     > >> build
>     > >> > > > > capacity on Travis for Flink PRs [1]. Specifically, I
>     noticed
>     > >> often
>     > >> > > that
>     > >> > > > no
>     > >> > > > > build in the queue is making any progress for hours, and
>     > suddenly
>     > >> 5
>     > >> > or
>     > >> > > 6
>     > >> > > > > builds kick off all together after the long pause.
>     I'm at PST
>     > >> > (UTC-08)
>     > >> > > > time
>     > >> > > > > zone, and I've seen pause can be as long as 6 hours
>     from PST 9am
>     > >> to
>     > >> > 3pm
>     > >> > > > > (let alone the time needed to drain the queue
>     afterwards).
>     > >> > > > >
>     > >> > > > > I think this has greatly impacted our productivity. I've
>     > >> experienced
>     > >> > > that
>     > >> > > > > PRs submitted in the early morning of PST time zone
>     won't finish
>     > >> > their
>     > >> > > > > build until late night of the same day.
>     > >> > > > >
>     > >> > > > > So my questions are:
>     > >> > > > >
>     > >> > > > > - Has anyone else experienced the same problem or
>     have similar
>     > >> > > > observation
>     > >> > > > > on TravisCI? (I suspect it has things to do with time
>     zone)
>     > >> > > > >
>     > >> > > > > - What pricing plan of TravisCI is Flink currently
>     using? Is it
>     > >> the
>     > >> > > free
>     > >> > > > > plan for open source projects? What are the
>     guaranteed build
>     > >> capacity
>     > >> > > of
>     > >> > > > > the current plan?
>     > >> > > > >
>     > >> > > > > - If the current pricing plan (either free or paid) can't
>     > provide
>     > >> > > stable
>     > >> > > > > build capacity, can we upgrade to a higher priced
>     plan with
>     > larger
>     > >> > and
>     > >> > > > more
>     > >> > > > > stable build capacity?
>     > >> > > > >
>     > >> > > > > BTW, another factor that contribute to the
>     productivity problem
>     > is
>     > >> > that
>     > >> > > > > our build is slow - we run full build for every PR and a
>     > >> successful
>     > >> > > full
>     > >> > > > > build takes ~5h. We definitely have more options to
>     solve it,
>     > for
>     > >> > > > instance,
>     > >> > > > > modularize the build graphs and reuse artifacts from the
>     > previous
>     > >> > > build.
>     > >> > > > > But I think that can be a big effort which is much
>     harder to
>     > >> > accomplish
>     > >> > > > in
>     > >> > > > > a short period of time and may deserve its own separate
>     > >> discussion.
>     > >> > > > >
>     > >> > > > > [1] https://travis-ci.org/apache/flink/pull_requests
>     > >> > > > >
>     > >> > > > >
>     > >> > > >
>     > >> > >
>     > >> >
>     > >>
>     > >
>     >
>
>
>     --
>     Best Regards
>
>     Jeff Zhang
>

Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] solve unstable build capacity problem on TravisCI

Robert Metzger
Do we know if using "the best" available hardware would improve the build
times?
Imagine we would run the build on machines with plenty of main memory to
mount everything to ramdisk + the latest CPU architecture?

Throwing hardware at the problem could help reduce the time of an
individual build, and using our own infrastructure would remove our
dependency on Apache's Travis account (with the obvious downside of having
to maintain the infrastructure)
We could use an open source travis alternative, to have a similar
experience and make the migration easy.


On Wed, Jun 26, 2019 at 9:34 AM Chesnay Schepler <[hidden email]> wrote:

>  From what I gathered, there's no special sauce that the Zeppelin
> project uses which actually integrates a users Travis account into the PR.
>
> They just disabled Travis for PRs. And that's kind of it.
>
> Naturally we can do this (duh) and safe the ASF a fair amount of
> resources, but there are downsides:
>
> The discoverability of the Travis check takes a nose-dive. Either we
> require every contributor to always, an every commit, also post a Travis
> build, or we have the reviewer sift through the contributors account to
> find it.
>
> This is rather cumbersome. Additionally, it's also not equivalent to
> having a PR build.
>
> A normal branch build takes a branch as is and tests it. A PR build
> merges the branch into master, and then runs it. (Fun fact: This is why
> a PR without merge conflicts is not being run on Travis.)
>
> And ultimately, everyone can already make use of this approach anyway.
>
> On 25/06/2019 08:02, Jark Wu wrote:
> > Hi Jeff,
> >
> > Thanks for sharing the Zeppelin approach. I think it's a good idea to
> > leverage user's travis account.
> > In this way, we can have almost unlimited concurrent build jobs and
> > developers can restart build by themselves (currently only committers
> > can restart PR's build).
> >
> > But I'm still not very clear how to integrate user's travis build into
> > the Flink pull request's build automatically. Can you explain more in
> > detail?
> >
> > Another question: does travis only build branches for user account?
> > My concern is that builds for PRs will rebase user's commits against
> > current master branch.
> > This will help us to find problems before merge.  Builds for branches
> > will lose the impact of new commits in master.
> > How does Zeppelin solve this problem?
> >
> > Thanks again for sharing the idea.
> >
> > Regards,
> > Jark
> >
> > On Tue, 25 Jun 2019 at 11:01, Jeff Zhang <[hidden email]
> > <mailto:[hidden email]>> wrote:
> >
> >     Hi Folks,
> >
> >     Zeppelin meet this kind of issue before, we solve it by delegating
> >     each
> >     one's PR build to his travis account (Everyone can have 5 free
> >     slot for
> >     travis build).
> >     Apache account travis build is only triggered when PR is merged.
> >
> >
> >
> >     Kurt Young <[hidden email] <mailto:[hidden email]>>
> >     于2019年6月25日周二 上午10:16写道:
> >
> >     > (Forgot to cc George)
> >     >
> >     > Best,
> >     > Kurt
> >     >
> >     >
> >     > On Tue, Jun 25, 2019 at 10:16 AM Kurt Young <[hidden email]
> >     <mailto:[hidden email]>> wrote:
> >     >
> >     > > Hi Bowen,
> >     > >
> >     > > Thanks for bringing this up. We actually have discussed about
> >     this, and I
> >     > > think Till and George have
> >     > > already spend sometime investigating it. I have cced both of
> >     them, and
> >     > > maybe they can share
> >     > > their findings.
> >     > >
> >     > > Best,
> >     > > Kurt
> >     > >
> >     > >
> >     > > On Tue, Jun 25, 2019 at 10:08 AM Jark Wu <[hidden email]
> >     <mailto:[hidden email]>> wrote:
> >     > >
> >     > >> Hi Bowen,
> >     > >>
> >     > >> Thanks for bringing this. We also suffered from the long
> >     build time.
> >     > >> I agree that we should focus on solving build capacity
> >     problem in the
> >     > >> thread.
> >     > >>
> >     > >> My observation is there is only one build is running, all the
> >     others
> >     > >> (other
> >     > >> PRs, master) are pending.
> >     > >> The pricing plan[1] of travis shows it can support concurrent
> >     build
> >     > jobs.
> >     > >> But I don't know which plan we are using, might be the free
> >     plan for
> >     > open
> >     > >> source.
> >     > >>
> >     > >> I cc-ed Chesnay who may have some experience on Travis.
> >     > >>
> >     > >> Regards,
> >     > >> Jark
> >     > >>
> >     > >> [1]: https://travis-ci.com/plans
> >     > >>
> >     > >> On Tue, 25 Jun 2019 at 08:11, Bowen Li <[hidden email]
> >     <mailto:[hidden email]>> wrote:
> >     > >>
> >     > >> > Hi Steven,
> >     > >> >
> >     > >> > I think you may not read what I wrote. The discussion is about
> >     > "unstable
> >     > >> > build **capacity**", in another word "unstable / lack of build
> >     > >> resources",
> >     > >> > not "unstable build".
> >     > >> >
> >     > >> > On Mon, Jun 24, 2019 at 4:40 PM Steven Wu
> >     <[hidden email] <mailto:[hidden email]>>
> >     > wrote:
> >     > >> >
> >     > >> > > long and sometimes unstable build is definitely a pain
> point.
> >     > >> > >
> >     > >> > > I suspect the build failure here in flink-connector-kafka
> >     is not
> >     > >> related
> >     > >> > to
> >     > >> > > my change. but there is no easy re-run the build on
> >     travis UI.
> >     > Google
> >     > >> > > search showed a trick of close-and-open the PR will
> >     trigger rebuild.
> >     > >> but
> >     > >> > > that could add noises to the PR activities.
> >     > >> > > https://travis-ci.org/apache/flink/jobs/545555519
> >     > >> > >
> >     > >> > > travis-ci for my personal repo often failed with
> >     exceeding time
> >     > limit
> >     > >> > after
> >     > >> > > 4+ hours.
> >     > >> > > The job exceeded the maximum time limit for jobs, and has
> >     been
> >     > >> > terminated.
> >     > >> > >
> >     > >> > > On Mon, Jun 24, 2019 at 4:15 PM Bowen Li
> >     <[hidden email] <mailto:[hidden email]>>
> >     > wrote:
> >     > >> > >
> >     > >> > > > https://travis-ci.org/apache/flink/builds/549681530
> >     This build
> >     > >> > request
> >     > >> > > > has
> >     > >> > > > been sitting at **HEAD of the queue** since I first saw
> >     it at PST
> >     > >> > 10:30am
> >     > >> > > > (not sure how long it's been there before 10:30am).
> >     It's PST
> >     > 4:12pm
> >     > >> now
> >     > >> > > and
> >     > >> > > > it hasn't started yet.
> >     > >> > > >
> >     > >> > > > On Mon, Jun 24, 2019 at 2:48 PM Bowen Li
> >     <[hidden email] <mailto:[hidden email]>>
> >     > >> wrote:
> >     > >> > > >
> >     > >> > > > > Hi devs,
> >     > >> > > > >
> >     > >> > > > > I've been experiencing the pain resulting from lack
> >     of stable
> >     > >> build
> >     > >> > > > > capacity on Travis for Flink PRs [1]. Specifically, I
> >     noticed
> >     > >> often
> >     > >> > > that
> >     > >> > > > no
> >     > >> > > > > build in the queue is making any progress for hours, and
> >     > suddenly
> >     > >> 5
> >     > >> > or
> >     > >> > > 6
> >     > >> > > > > builds kick off all together after the long pause.
> >     I'm at PST
> >     > >> > (UTC-08)
> >     > >> > > > time
> >     > >> > > > > zone, and I've seen pause can be as long as 6 hours
> >     from PST 9am
> >     > >> to
> >     > >> > 3pm
> >     > >> > > > > (let alone the time needed to drain the queue
> >     afterwards).
> >     > >> > > > >
> >     > >> > > > > I think this has greatly impacted our productivity. I've
> >     > >> experienced
> >     > >> > > that
> >     > >> > > > > PRs submitted in the early morning of PST time zone
> >     won't finish
> >     > >> > their
> >     > >> > > > > build until late night of the same day.
> >     > >> > > > >
> >     > >> > > > > So my questions are:
> >     > >> > > > >
> >     > >> > > > > - Has anyone else experienced the same problem or
> >     have similar
> >     > >> > > > observation
> >     > >> > > > > on TravisCI? (I suspect it has things to do with time
> >     zone)
> >     > >> > > > >
> >     > >> > > > > - What pricing plan of TravisCI is Flink currently
> >     using? Is it
> >     > >> the
> >     > >> > > free
> >     > >> > > > > plan for open source projects? What are the
> >     guaranteed build
> >     > >> capacity
> >     > >> > > of
> >     > >> > > > > the current plan?
> >     > >> > > > >
> >     > >> > > > > - If the current pricing plan (either free or paid)
> can't
> >     > provide
> >     > >> > > stable
> >     > >> > > > > build capacity, can we upgrade to a higher priced
> >     plan with
> >     > larger
> >     > >> > and
> >     > >> > > > more
> >     > >> > > > > stable build capacity?
> >     > >> > > > >
> >     > >> > > > > BTW, another factor that contribute to the
> >     productivity problem
> >     > is
> >     > >> > that
> >     > >> > > > > our build is slow - we run full build for every PR and a
> >     > >> successful
> >     > >> > > full
> >     > >> > > > > build takes ~5h. We definitely have more options to
> >     solve it,
> >     > for
> >     > >> > > > instance,
> >     > >> > > > > modularize the build graphs and reuse artifacts from the
> >     > previous
> >     > >> > > build.
> >     > >> > > > > But I think that can be a big effort which is much
> >     harder to
> >     > >> > accomplish
> >     > >> > > > in
> >     > >> > > > > a short period of time and may deserve its own separate
> >     > >> discussion.
> >     > >> > > > >
> >     > >> > > > > [1] https://travis-ci.org/apache/flink/pull_requests
> >     > >> > > > >
> >     > >> > > > >
> >     > >> > > >
> >     > >> > >
> >     > >> >
> >     > >>
> >     > >
> >     >
> >
> >
> >     --
> >     Best Regards
> >
> >     Jeff Zhang
> >
>
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] solve unstable build capacity problem on TravisCI

dwysakowicz
Sorry to jump in late, but I think Bowen missed the most important point
from Chesnay's previous message in the summary. The ultimate reason for
all the problems is that the tests take close to 2 hours to run already.
I fully support this claim: "Unless people start caring about test times
before adding them, this issue cannot be solved"

This is also another reason why using user's Travis account won't help.
Every few weeks we reach the user's time limit for a single profile.
This makes the user's builds simply fail, until we either properly
decrease the time the tests take (which I am not sure we ever did) or
postpone the problem by splitting into more profiles. (Note that the ASF
Travis account has higher time limits)

Best,

Dawid

On 26/06/2019 09:36, Robert Metzger wrote:

> Do we know if using "the best" available hardware would improve the build
> times?
> Imagine we would run the build on machines with plenty of main memory to
> mount everything to ramdisk + the latest CPU architecture?
>
> Throwing hardware at the problem could help reduce the time of an
> individual build, and using our own infrastructure would remove our
> dependency on Apache's Travis account (with the obvious downside of having
> to maintain the infrastructure)
> We could use an open source travis alternative, to have a similar
> experience and make the migration easy.
>
>
> On Wed, Jun 26, 2019 at 9:34 AM Chesnay Schepler <[hidden email]> wrote:
>
>>  From what I gathered, there's no special sauce that the Zeppelin
>> project uses which actually integrates a users Travis account into the PR.
>>
>> They just disabled Travis for PRs. And that's kind of it.
>>
>> Naturally we can do this (duh) and safe the ASF a fair amount of
>> resources, but there are downsides:
>>
>> The discoverability of the Travis check takes a nose-dive. Either we
>> require every contributor to always, an every commit, also post a Travis
>> build, or we have the reviewer sift through the contributors account to
>> find it.
>>
>> This is rather cumbersome. Additionally, it's also not equivalent to
>> having a PR build.
>>
>> A normal branch build takes a branch as is and tests it. A PR build
>> merges the branch into master, and then runs it. (Fun fact: This is why
>> a PR without merge conflicts is not being run on Travis.)
>>
>> And ultimately, everyone can already make use of this approach anyway.
>>
>> On 25/06/2019 08:02, Jark Wu wrote:
>>> Hi Jeff,
>>>
>>> Thanks for sharing the Zeppelin approach. I think it's a good idea to
>>> leverage user's travis account.
>>> In this way, we can have almost unlimited concurrent build jobs and
>>> developers can restart build by themselves (currently only committers
>>> can restart PR's build).
>>>
>>> But I'm still not very clear how to integrate user's travis build into
>>> the Flink pull request's build automatically. Can you explain more in
>>> detail?
>>>
>>> Another question: does travis only build branches for user account?
>>> My concern is that builds for PRs will rebase user's commits against
>>> current master branch.
>>> This will help us to find problems before merge.  Builds for branches
>>> will lose the impact of new commits in master.
>>> How does Zeppelin solve this problem?
>>>
>>> Thanks again for sharing the idea.
>>>
>>> Regards,
>>> Jark
>>>
>>> On Tue, 25 Jun 2019 at 11:01, Jeff Zhang <[hidden email]
>>> <mailto:[hidden email]>> wrote:
>>>
>>>     Hi Folks,
>>>
>>>     Zeppelin meet this kind of issue before, we solve it by delegating
>>>     each
>>>     one's PR build to his travis account (Everyone can have 5 free
>>>     slot for
>>>     travis build).
>>>     Apache account travis build is only triggered when PR is merged.
>>>
>>>
>>>
>>>     Kurt Young <[hidden email] <mailto:[hidden email]>>
>>>     于2019年6月25日周二 上午10:16写道:
>>>
>>>     > (Forgot to cc George)
>>>     >
>>>     > Best,
>>>     > Kurt
>>>     >
>>>     >
>>>     > On Tue, Jun 25, 2019 at 10:16 AM Kurt Young <[hidden email]
>>>     <mailto:[hidden email]>> wrote:
>>>     >
>>>     > > Hi Bowen,
>>>     > >
>>>     > > Thanks for bringing this up. We actually have discussed about
>>>     this, and I
>>>     > > think Till and George have
>>>     > > already spend sometime investigating it. I have cced both of
>>>     them, and
>>>     > > maybe they can share
>>>     > > their findings.
>>>     > >
>>>     > > Best,
>>>     > > Kurt
>>>     > >
>>>     > >
>>>     > > On Tue, Jun 25, 2019 at 10:08 AM Jark Wu <[hidden email]
>>>     <mailto:[hidden email]>> wrote:
>>>     > >
>>>     > >> Hi Bowen,
>>>     > >>
>>>     > >> Thanks for bringing this. We also suffered from the long
>>>     build time.
>>>     > >> I agree that we should focus on solving build capacity
>>>     problem in the
>>>     > >> thread.
>>>     > >>
>>>     > >> My observation is there is only one build is running, all the
>>>     others
>>>     > >> (other
>>>     > >> PRs, master) are pending.
>>>     > >> The pricing plan[1] of travis shows it can support concurrent
>>>     build
>>>     > jobs.
>>>     > >> But I don't know which plan we are using, might be the free
>>>     plan for
>>>     > open
>>>     > >> source.
>>>     > >>
>>>     > >> I cc-ed Chesnay who may have some experience on Travis.
>>>     > >>
>>>     > >> Regards,
>>>     > >> Jark
>>>     > >>
>>>     > >> [1]: https://travis-ci.com/plans
>>>     > >>
>>>     > >> On Tue, 25 Jun 2019 at 08:11, Bowen Li <[hidden email]
>>>     <mailto:[hidden email]>> wrote:
>>>     > >>
>>>     > >> > Hi Steven,
>>>     > >> >
>>>     > >> > I think you may not read what I wrote. The discussion is about
>>>     > "unstable
>>>     > >> > build **capacity**", in another word "unstable / lack of build
>>>     > >> resources",
>>>     > >> > not "unstable build".
>>>     > >> >
>>>     > >> > On Mon, Jun 24, 2019 at 4:40 PM Steven Wu
>>>     <[hidden email] <mailto:[hidden email]>>
>>>     > wrote:
>>>     > >> >
>>>     > >> > > long and sometimes unstable build is definitely a pain
>> point.
>>>     > >> > >
>>>     > >> > > I suspect the build failure here in flink-connector-kafka
>>>     is not
>>>     > >> related
>>>     > >> > to
>>>     > >> > > my change. but there is no easy re-run the build on
>>>     travis UI.
>>>     > Google
>>>     > >> > > search showed a trick of close-and-open the PR will
>>>     trigger rebuild.
>>>     > >> but
>>>     > >> > > that could add noises to the PR activities.
>>>     > >> > > https://travis-ci.org/apache/flink/jobs/545555519
>>>     > >> > >
>>>     > >> > > travis-ci for my personal repo often failed with
>>>     exceeding time
>>>     > limit
>>>     > >> > after
>>>     > >> > > 4+ hours.
>>>     > >> > > The job exceeded the maximum time limit for jobs, and has
>>>     been
>>>     > >> > terminated.
>>>     > >> > >
>>>     > >> > > On Mon, Jun 24, 2019 at 4:15 PM Bowen Li
>>>     <[hidden email] <mailto:[hidden email]>>
>>>     > wrote:
>>>     > >> > >
>>>     > >> > > > https://travis-ci.org/apache/flink/builds/549681530
>>>     This build
>>>     > >> > request
>>>     > >> > > > has
>>>     > >> > > > been sitting at **HEAD of the queue** since I first saw
>>>     it at PST
>>>     > >> > 10:30am
>>>     > >> > > > (not sure how long it's been there before 10:30am).
>>>     It's PST
>>>     > 4:12pm
>>>     > >> now
>>>     > >> > > and
>>>     > >> > > > it hasn't started yet.
>>>     > >> > > >
>>>     > >> > > > On Mon, Jun 24, 2019 at 2:48 PM Bowen Li
>>>     <[hidden email] <mailto:[hidden email]>>
>>>     > >> wrote:
>>>     > >> > > >
>>>     > >> > > > > Hi devs,
>>>     > >> > > > >
>>>     > >> > > > > I've been experiencing the pain resulting from lack
>>>     of stable
>>>     > >> build
>>>     > >> > > > > capacity on Travis for Flink PRs [1]. Specifically, I
>>>     noticed
>>>     > >> often
>>>     > >> > > that
>>>     > >> > > > no
>>>     > >> > > > > build in the queue is making any progress for hours, and
>>>     > suddenly
>>>     > >> 5
>>>     > >> > or
>>>     > >> > > 6
>>>     > >> > > > > builds kick off all together after the long pause.
>>>     I'm at PST
>>>     > >> > (UTC-08)
>>>     > >> > > > time
>>>     > >> > > > > zone, and I've seen pause can be as long as 6 hours
>>>     from PST 9am
>>>     > >> to
>>>     > >> > 3pm
>>>     > >> > > > > (let alone the time needed to drain the queue
>>>     afterwards).
>>>     > >> > > > >
>>>     > >> > > > > I think this has greatly impacted our productivity. I've
>>>     > >> experienced
>>>     > >> > > that
>>>     > >> > > > > PRs submitted in the early morning of PST time zone
>>>     won't finish
>>>     > >> > their
>>>     > >> > > > > build until late night of the same day.
>>>     > >> > > > >
>>>     > >> > > > > So my questions are:
>>>     > >> > > > >
>>>     > >> > > > > - Has anyone else experienced the same problem or
>>>     have similar
>>>     > >> > > > observation
>>>     > >> > > > > on TravisCI? (I suspect it has things to do with time
>>>     zone)
>>>     > >> > > > >
>>>     > >> > > > > - What pricing plan of TravisCI is Flink currently
>>>     using? Is it
>>>     > >> the
>>>     > >> > > free
>>>     > >> > > > > plan for open source projects? What are the
>>>     guaranteed build
>>>     > >> capacity
>>>     > >> > > of
>>>     > >> > > > > the current plan?
>>>     > >> > > > >
>>>     > >> > > > > - If the current pricing plan (either free or paid)
>> can't
>>>     > provide
>>>     > >> > > stable
>>>     > >> > > > > build capacity, can we upgrade to a higher priced
>>>     plan with
>>>     > larger
>>>     > >> > and
>>>     > >> > > > more
>>>     > >> > > > > stable build capacity?
>>>     > >> > > > >
>>>     > >> > > > > BTW, another factor that contribute to the
>>>     productivity problem
>>>     > is
>>>     > >> > that
>>>     > >> > > > > our build is slow - we run full build for every PR and a
>>>     > >> successful
>>>     > >> > > full
>>>     > >> > > > > build takes ~5h. We definitely have more options to
>>>     solve it,
>>>     > for
>>>     > >> > > > instance,
>>>     > >> > > > > modularize the build graphs and reuse artifacts from the
>>>     > previous
>>>     > >> > > build.
>>>     > >> > > > > But I think that can be a big effort which is much
>>>     harder to
>>>     > >> > accomplish
>>>     > >> > > > in
>>>     > >> > > > > a short period of time and may deserve its own separate
>>>     > >> discussion.
>>>     > >> > > > >
>>>     > >> > > > > [1] https://travis-ci.org/apache/flink/pull_requests
>>>     > >> > > > >
>>>     > >> > > > >
>>>     > >> > > >
>>>     > >> > >
>>>     > >> >
>>>     > >>
>>>     > >
>>>     >
>>>
>>>
>>>     --
>>>     Best Regards
>>>
>>>     Jeff Zhang
>>>
>>


signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] solve unstable build capacity problem on TravisCI

bowen.li
Hi,

@Dawid, I think the "long test running" as I mentioned in the first email,
also as you guys said, belongs to "a big effort which is much harder to
accomplish in a short period of time and may deserve its own separate
discussion". Thus I didn't include it in what we can do in a foreseeable
short term.

Besides, I don't think that's the ultimate reason for lack of build
resources. Even if the build is shortened to something like 2h, the
problems of no build machine works about 6 or more hours in PST daytime
that I described will still happen, because no machine from ASF INFRA's
pool is allocated to Flink. As I have paid close attention to the build
queue in the past few weekdays, it's a pretty clear pattern now.

**The ultimate root cause** for that is - we don't have any **dedicated**
build resources that we can stably rely on. I'm actually ok to wait for a
long time if there are build requests running, it means at least we are
making progress. But I'm not ok with no build resource. A better place I
think we should aim at in short term is to always have at least a central
pool (can be 3 or 5) of machines dedicated to build Flink at any time, or
maybe use users resources.

@Chesnay @Robert I synced with Jeff offline that Zeppelin community is
using a Jenkins job to automatically build on users' travis account and
link the result back to github PR. I guess the Jenkins job would fetch
latest upstream master and build the PR against it. Jeff has filed tickets
to learn and get access to the Jenkins infra. It'll better to fully
understand it first before judging this approach.

I also heard good things about CircleCI, and ASF INFRA seems to have a pool
of build capacity there too. Can be an alternative to consider.









On Wed, Jun 26, 2019 at 12:44 AM Dawid Wysakowicz <[hidden email]>
wrote:

> Sorry to jump in late, but I think Bowen missed the most important point
> from Chesnay's previous message in the summary. The ultimate reason for
> all the problems is that the tests take close to 2 hours to run already.
> I fully support this claim: "Unless people start caring about test times
> before adding them, this issue cannot be solved"
>
> This is also another reason why using user's Travis account won't help.
> Every few weeks we reach the user's time limit for a single profile.
> This makes the user's builds simply fail, until we either properly
> decrease the time the tests take (which I am not sure we ever did) or
> postpone the problem by splitting into more profiles. (Note that the ASF
> Travis account has higher time limits)
>
> Best,
>
> Dawid
>
> On 26/06/2019 09:36, Robert Metzger wrote:
> > Do we know if using "the best" available hardware would improve the build
> > times?
> > Imagine we would run the build on machines with plenty of main memory to
> > mount everything to ramdisk + the latest CPU architecture?
> >
> > Throwing hardware at the problem could help reduce the time of an
> > individual build, and using our own infrastructure would remove our
> > dependency on Apache's Travis account (with the obvious downside of
> having
> > to maintain the infrastructure)
> > We could use an open source travis alternative, to have a similar
> > experience and make the migration easy.
> >
> >
> > On Wed, Jun 26, 2019 at 9:34 AM Chesnay Schepler <[hidden email]>
> wrote:
> >
> >>  From what I gathered, there's no special sauce that the Zeppelin
> >> project uses which actually integrates a users Travis account into the
> PR.
> >>
> >> They just disabled Travis for PRs. And that's kind of it.
> >>
> >> Naturally we can do this (duh) and safe the ASF a fair amount of
> >> resources, but there are downsides:
> >>
> >> The discoverability of the Travis check takes a nose-dive. Either we
> >> require every contributor to always, an every commit, also post a Travis
> >> build, or we have the reviewer sift through the contributors account to
> >> find it.
> >>
> >> This is rather cumbersome. Additionally, it's also not equivalent to
> >> having a PR build.
> >>
> >> A normal branch build takes a branch as is and tests it. A PR build
> >> merges the branch into master, and then runs it. (Fun fact: This is why
> >> a PR without merge conflicts is not being run on Travis.)
> >>
> >> And ultimately, everyone can already make use of this approach anyway.
> >>
> >> On 25/06/2019 08:02, Jark Wu wrote:
> >>> Hi Jeff,
> >>>
> >>> Thanks for sharing the Zeppelin approach. I think it's a good idea to
> >>> leverage user's travis account.
> >>> In this way, we can have almost unlimited concurrent build jobs and
> >>> developers can restart build by themselves (currently only committers
> >>> can restart PR's build).
> >>>
> >>> But I'm still not very clear how to integrate user's travis build into
> >>> the Flink pull request's build automatically. Can you explain more in
> >>> detail?
> >>>
> >>> Another question: does travis only build branches for user account?
> >>> My concern is that builds for PRs will rebase user's commits against
> >>> current master branch.
> >>> This will help us to find problems before merge.  Builds for branches
> >>> will lose the impact of new commits in master.
> >>> How does Zeppelin solve this problem?
> >>>
> >>> Thanks again for sharing the idea.
> >>>
> >>> Regards,
> >>> Jark
> >>>
> >>> On Tue, 25 Jun 2019 at 11:01, Jeff Zhang <[hidden email]
> >>> <mailto:[hidden email]>> wrote:
> >>>
> >>>     Hi Folks,
> >>>
> >>>     Zeppelin meet this kind of issue before, we solve it by delegating
> >>>     each
> >>>     one's PR build to his travis account (Everyone can have 5 free
> >>>     slot for
> >>>     travis build).
> >>>     Apache account travis build is only triggered when PR is merged.
> >>>
> >>>
> >>>
> >>>     Kurt Young <[hidden email] <mailto:[hidden email]>>
> >>>     于2019年6月25日周二 上午10:16写道:
> >>>
> >>>     > (Forgot to cc George)
> >>>     >
> >>>     > Best,
> >>>     > Kurt
> >>>     >
> >>>     >
> >>>     > On Tue, Jun 25, 2019 at 10:16 AM Kurt Young <[hidden email]
> >>>     <mailto:[hidden email]>> wrote:
> >>>     >
> >>>     > > Hi Bowen,
> >>>     > >
> >>>     > > Thanks for bringing this up. We actually have discussed about
> >>>     this, and I
> >>>     > > think Till and George have
> >>>     > > already spend sometime investigating it. I have cced both of
> >>>     them, and
> >>>     > > maybe they can share
> >>>     > > their findings.
> >>>     > >
> >>>     > > Best,
> >>>     > > Kurt
> >>>     > >
> >>>     > >
> >>>     > > On Tue, Jun 25, 2019 at 10:08 AM Jark Wu <[hidden email]
> >>>     <mailto:[hidden email]>> wrote:
> >>>     > >
> >>>     > >> Hi Bowen,
> >>>     > >>
> >>>     > >> Thanks for bringing this. We also suffered from the long
> >>>     build time.
> >>>     > >> I agree that we should focus on solving build capacity
> >>>     problem in the
> >>>     > >> thread.
> >>>     > >>
> >>>     > >> My observation is there is only one build is running, all the
> >>>     others
> >>>     > >> (other
> >>>     > >> PRs, master) are pending.
> >>>     > >> The pricing plan[1] of travis shows it can support concurrent
> >>>     build
> >>>     > jobs.
> >>>     > >> But I don't know which plan we are using, might be the free
> >>>     plan for
> >>>     > open
> >>>     > >> source.
> >>>     > >>
> >>>     > >> I cc-ed Chesnay who may have some experience on Travis.
> >>>     > >>
> >>>     > >> Regards,
> >>>     > >> Jark
> >>>     > >>
> >>>     > >> [1]: https://travis-ci.com/plans
> >>>     > >>
> >>>     > >> On Tue, 25 Jun 2019 at 08:11, Bowen Li <[hidden email]
> >>>     <mailto:[hidden email]>> wrote:
> >>>     > >>
> >>>     > >> > Hi Steven,
> >>>     > >> >
> >>>     > >> > I think you may not read what I wrote. The discussion is
> about
> >>>     > "unstable
> >>>     > >> > build **capacity**", in another word "unstable / lack of
> build
> >>>     > >> resources",
> >>>     > >> > not "unstable build".
> >>>     > >> >
> >>>     > >> > On Mon, Jun 24, 2019 at 4:40 PM Steven Wu
> >>>     <[hidden email] <mailto:[hidden email]>>
> >>>     > wrote:
> >>>     > >> >
> >>>     > >> > > long and sometimes unstable build is definitely a pain
> >> point.
> >>>     > >> > >
> >>>     > >> > > I suspect the build failure here in flink-connector-kafka
> >>>     is not
> >>>     > >> related
> >>>     > >> > to
> >>>     > >> > > my change. but there is no easy re-run the build on
> >>>     travis UI.
> >>>     > Google
> >>>     > >> > > search showed a trick of close-and-open the PR will
> >>>     trigger rebuild.
> >>>     > >> but
> >>>     > >> > > that could add noises to the PR activities.
> >>>     > >> > > https://travis-ci.org/apache/flink/jobs/545555519
> >>>     > >> > >
> >>>     > >> > > travis-ci for my personal repo often failed with
> >>>     exceeding time
> >>>     > limit
> >>>     > >> > after
> >>>     > >> > > 4+ hours.
> >>>     > >> > > The job exceeded the maximum time limit for jobs, and has
> >>>     been
> >>>     > >> > terminated.
> >>>     > >> > >
> >>>     > >> > > On Mon, Jun 24, 2019 at 4:15 PM Bowen Li
> >>>     <[hidden email] <mailto:[hidden email]>>
> >>>     > wrote:
> >>>     > >> > >
> >>>     > >> > > > https://travis-ci.org/apache/flink/builds/549681530
> >>>     This build
> >>>     > >> > request
> >>>     > >> > > > has
> >>>     > >> > > > been sitting at **HEAD of the queue** since I first saw
> >>>     it at PST
> >>>     > >> > 10:30am
> >>>     > >> > > > (not sure how long it's been there before 10:30am).
> >>>     It's PST
> >>>     > 4:12pm
> >>>     > >> now
> >>>     > >> > > and
> >>>     > >> > > > it hasn't started yet.
> >>>     > >> > > >
> >>>     > >> > > > On Mon, Jun 24, 2019 at 2:48 PM Bowen Li
> >>>     <[hidden email] <mailto:[hidden email]>>
> >>>     > >> wrote:
> >>>     > >> > > >
> >>>     > >> > > > > Hi devs,
> >>>     > >> > > > >
> >>>     > >> > > > > I've been experiencing the pain resulting from lack
> >>>     of stable
> >>>     > >> build
> >>>     > >> > > > > capacity on Travis for Flink PRs [1]. Specifically, I
> >>>     noticed
> >>>     > >> often
> >>>     > >> > > that
> >>>     > >> > > > no
> >>>     > >> > > > > build in the queue is making any progress for hours,
> and
> >>>     > suddenly
> >>>     > >> 5
> >>>     > >> > or
> >>>     > >> > > 6
> >>>     > >> > > > > builds kick off all together after the long pause.
> >>>     I'm at PST
> >>>     > >> > (UTC-08)
> >>>     > >> > > > time
> >>>     > >> > > > > zone, and I've seen pause can be as long as 6 hours
> >>>     from PST 9am
> >>>     > >> to
> >>>     > >> > 3pm
> >>>     > >> > > > > (let alone the time needed to drain the queue
> >>>     afterwards).
> >>>     > >> > > > >
> >>>     > >> > > > > I think this has greatly impacted our productivity.
> I've
> >>>     > >> experienced
> >>>     > >> > > that
> >>>     > >> > > > > PRs submitted in the early morning of PST time zone
> >>>     won't finish
> >>>     > >> > their
> >>>     > >> > > > > build until late night of the same day.
> >>>     > >> > > > >
> >>>     > >> > > > > So my questions are:
> >>>     > >> > > > >
> >>>     > >> > > > > - Has anyone else experienced the same problem or
> >>>     have similar
> >>>     > >> > > > observation
> >>>     > >> > > > > on TravisCI? (I suspect it has things to do with time
> >>>     zone)
> >>>     > >> > > > >
> >>>     > >> > > > > - What pricing plan of TravisCI is Flink currently
> >>>     using? Is it
> >>>     > >> the
> >>>     > >> > > free
> >>>     > >> > > > > plan for open source projects? What are the
> >>>     guaranteed build
> >>>     > >> capacity
> >>>     > >> > > of
> >>>     > >> > > > > the current plan?
> >>>     > >> > > > >
> >>>     > >> > > > > - If the current pricing plan (either free or paid)
> >> can't
> >>>     > provide
> >>>     > >> > > stable
> >>>     > >> > > > > build capacity, can we upgrade to a higher priced
> >>>     plan with
> >>>     > larger
> >>>     > >> > and
> >>>     > >> > > > more
> >>>     > >> > > > > stable build capacity?
> >>>     > >> > > > >
> >>>     > >> > > > > BTW, another factor that contribute to the
> >>>     productivity problem
> >>>     > is
> >>>     > >> > that
> >>>     > >> > > > > our build is slow - we run full build for every PR
> and a
> >>>     > >> successful
> >>>     > >> > > full
> >>>     > >> > > > > build takes ~5h. We definitely have more options to
> >>>     solve it,
> >>>     > for
> >>>     > >> > > > instance,
> >>>     > >> > > > > modularize the build graphs and reuse artifacts from
> the
> >>>     > previous
> >>>     > >> > > build.
> >>>     > >> > > > > But I think that can be a big effort which is much
> >>>     harder to
> >>>     > >> > accomplish
> >>>     > >> > > > in
> >>>     > >> > > > > a short period of time and may deserve its own
> separate
> >>>     > >> discussion.
> >>>     > >> > > > >
> >>>     > >> > > > > [1] https://travis-ci.org/apache/flink/pull_requests
> >>>     > >> > > > >
> >>>     > >> > > > >
> >>>     > >> > > >
> >>>     > >> > >
> >>>     > >> >
> >>>     > >>
> >>>     > >
> >>>     >
> >>>
> >>>
> >>>     --
> >>>     Best Regards
> >>>
> >>>     Jeff Zhang
> >>>
> >>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] solve unstable build capacity problem on TravisCI

bowen.li
just elaborate a bit more on why slow build is ok but no resource is not: Say I submit a build request at PST 9am, no other requests exist and mine is the queue head, currently it means it still cannot get built until 4 or 5pm.



> On Jun 26, 2019, at 12:28, Bowen Li <[hidden email]> wrote:
>
> Hi,
>
> @Dawid, I think the "long test running" as I mentioned in the first email, also as you guys said, belongs to "a big effort which is much harder to accomplish in a short period of time and may deserve its own separate discussion". Thus I didn't include it in what we can do in a foreseeable short term.
>
> Besides, I don't think that's the ultimate reason for lack of build resources. Even if the build is shortened to something like 2h, the problems of no build machine works about 6 or more hours in PST daytime that I described will still happen, because no machine from ASF INFRA's pool is allocated to Flink. As I have paid close attention to the build queue in the past few weekdays, it's a pretty clear pattern now.
>
> **The ultimate root cause** for that is - we don't have any **dedicated** build resources that we can stably rely on. I'm actually ok to wait for a long time if there are build requests running, it means at least we are making progress. But I'm not ok with no build resource. A better place I think we should aim at in short term is to always have at least a central pool (can be 3 or 5) of machines dedicated to build Flink at any time, or maybe use users resources.
>
> @Chesnay @Robert I synced with Jeff offline that Zeppelin community is using a Jenkins job to automatically build on users' travis account and link the result back to github PR. I guess the Jenkins job would fetch latest upstream master and build the PR against it. Jeff has filed tickets to learn and get access to the Jenkins infra. It'll better to fully understand it first before judging this approach.
>
> I also heard good things about CircleCI, and ASF INFRA seems to have a pool of build capacity there too. Can be an alternative to consider.
>
>
>
>
>
>
>
>
>
>> On Wed, Jun 26, 2019 at 12:44 AM Dawid Wysakowicz <[hidden email]> wrote:
>> Sorry to jump in late, but I think Bowen missed the most important point
>> from Chesnay's previous message in the summary. The ultimate reason for
>> all the problems is that the tests take close to 2 hours to run already.
>> I fully support this claim: "Unless people start caring about test times
>> before adding them, this issue cannot be solved"
>>
>> This is also another reason why using user's Travis account won't help.
>> Every few weeks we reach the user's time limit for a single profile.
>> This makes the user's builds simply fail, until we either properly
>> decrease the time the tests take (which I am not sure we ever did) or
>> postpone the problem by splitting into more profiles. (Note that the ASF
>> Travis account has higher time limits)
>>
>> Best,
>>
>> Dawid
>>
>> On 26/06/2019 09:36, Robert Metzger wrote:
>> > Do we know if using "the best" available hardware would improve the build
>> > times?
>> > Imagine we would run the build on machines with plenty of main memory to
>> > mount everything to ramdisk + the latest CPU architecture?
>> >
>> > Throwing hardware at the problem could help reduce the time of an
>> > individual build, and using our own infrastructure would remove our
>> > dependency on Apache's Travis account (with the obvious downside of having
>> > to maintain the infrastructure)
>> > We could use an open source travis alternative, to have a similar
>> > experience and make the migration easy.
>> >
>> >
>> > On Wed, Jun 26, 2019 at 9:34 AM Chesnay Schepler <[hidden email]> wrote:
>> >
>> >>  From what I gathered, there's no special sauce that the Zeppelin
>> >> project uses which actually integrates a users Travis account into the PR.
>> >>
>> >> They just disabled Travis for PRs. And that's kind of it.
>> >>
>> >> Naturally we can do this (duh) and safe the ASF a fair amount of
>> >> resources, but there are downsides:
>> >>
>> >> The discoverability of the Travis check takes a nose-dive. Either we
>> >> require every contributor to always, an every commit, also post a Travis
>> >> build, or we have the reviewer sift through the contributors account to
>> >> find it.
>> >>
>> >> This is rather cumbersome. Additionally, it's also not equivalent to
>> >> having a PR build.
>> >>
>> >> A normal branch build takes a branch as is and tests it. A PR build
>> >> merges the branch into master, and then runs it. (Fun fact: This is why
>> >> a PR without merge conflicts is not being run on Travis.)
>> >>
>> >> And ultimately, everyone can already make use of this approach anyway.
>> >>
>> >> On 25/06/2019 08:02, Jark Wu wrote:
>> >>> Hi Jeff,
>> >>>
>> >>> Thanks for sharing the Zeppelin approach. I think it's a good idea to
>> >>> leverage user's travis account.
>> >>> In this way, we can have almost unlimited concurrent build jobs and
>> >>> developers can restart build by themselves (currently only committers
>> >>> can restart PR's build).
>> >>>
>> >>> But I'm still not very clear how to integrate user's travis build into
>> >>> the Flink pull request's build automatically. Can you explain more in
>> >>> detail?
>> >>>
>> >>> Another question: does travis only build branches for user account?
>> >>> My concern is that builds for PRs will rebase user's commits against
>> >>> current master branch.
>> >>> This will help us to find problems before merge.  Builds for branches
>> >>> will lose the impact of new commits in master.
>> >>> How does Zeppelin solve this problem?
>> >>>
>> >>> Thanks again for sharing the idea.
>> >>>
>> >>> Regards,
>> >>> Jark
>> >>>
>> >>> On Tue, 25 Jun 2019 at 11:01, Jeff Zhang <[hidden email]
>> >>> <mailto:[hidden email]>> wrote:
>> >>>
>> >>>     Hi Folks,
>> >>>
>> >>>     Zeppelin meet this kind of issue before, we solve it by delegating
>> >>>     each
>> >>>     one's PR build to his travis account (Everyone can have 5 free
>> >>>     slot for
>> >>>     travis build).
>> >>>     Apache account travis build is only triggered when PR is merged.
>> >>>
>> >>>
>> >>>
>> >>>     Kurt Young <[hidden email] <mailto:[hidden email]>>
>> >>>     于2019年6月25日周二 上午10:16写道:
>> >>>
>> >>>     > (Forgot to cc George)
>> >>>     >
>> >>>     > Best,
>> >>>     > Kurt
>> >>>     >
>> >>>     >
>> >>>     > On Tue, Jun 25, 2019 at 10:16 AM Kurt Young <[hidden email]
>> >>>     <mailto:[hidden email]>> wrote:
>> >>>     >
>> >>>     > > Hi Bowen,
>> >>>     > >
>> >>>     > > Thanks for bringing this up. We actually have discussed about
>> >>>     this, and I
>> >>>     > > think Till and George have
>> >>>     > > already spend sometime investigating it. I have cced both of
>> >>>     them, and
>> >>>     > > maybe they can share
>> >>>     > > their findings.
>> >>>     > >
>> >>>     > > Best,
>> >>>     > > Kurt
>> >>>     > >
>> >>>     > >
>> >>>     > > On Tue, Jun 25, 2019 at 10:08 AM Jark Wu <[hidden email]
>> >>>     <mailto:[hidden email]>> wrote:
>> >>>     > >
>> >>>     > >> Hi Bowen,
>> >>>     > >>
>> >>>     > >> Thanks for bringing this. We also suffered from the long
>> >>>     build time.
>> >>>     > >> I agree that we should focus on solving build capacity
>> >>>     problem in the
>> >>>     > >> thread.
>> >>>     > >>
>> >>>     > >> My observation is there is only one build is running, all the
>> >>>     others
>> >>>     > >> (other
>> >>>     > >> PRs, master) are pending.
>> >>>     > >> The pricing plan[1] of travis shows it can support concurrent
>> >>>     build
>> >>>     > jobs.
>> >>>     > >> But I don't know which plan we are using, might be the free
>> >>>     plan for
>> >>>     > open
>> >>>     > >> source.
>> >>>     > >>
>> >>>     > >> I cc-ed Chesnay who may have some experience on Travis.
>> >>>     > >>
>> >>>     > >> Regards,
>> >>>     > >> Jark
>> >>>     > >>
>> >>>     > >> [1]: https://travis-ci.com/plans
>> >>>     > >>
>> >>>     > >> On Tue, 25 Jun 2019 at 08:11, Bowen Li <[hidden email]
>> >>>     <mailto:[hidden email]>> wrote:
>> >>>     > >>
>> >>>     > >> > Hi Steven,
>> >>>     > >> >
>> >>>     > >> > I think you may not read what I wrote. The discussion is about
>> >>>     > "unstable
>> >>>     > >> > build **capacity**", in another word "unstable / lack of build
>> >>>     > >> resources",
>> >>>     > >> > not "unstable build".
>> >>>     > >> >
>> >>>     > >> > On Mon, Jun 24, 2019 at 4:40 PM Steven Wu
>> >>>     <[hidden email] <mailto:[hidden email]>>
>> >>>     > wrote:
>> >>>     > >> >
>> >>>     > >> > > long and sometimes unstable build is definitely a pain
>> >> point.
>> >>>     > >> > >
>> >>>     > >> > > I suspect the build failure here in flink-connector-kafka
>> >>>     is not
>> >>>     > >> related
>> >>>     > >> > to
>> >>>     > >> > > my change. but there is no easy re-run the build on
>> >>>     travis UI.
>> >>>     > Google
>> >>>     > >> > > search showed a trick of close-and-open the PR will
>> >>>     trigger rebuild.
>> >>>     > >> but
>> >>>     > >> > > that could add noises to the PR activities.
>> >>>     > >> > > https://travis-ci.org/apache/flink/jobs/545555519
>> >>>     > >> > >
>> >>>     > >> > > travis-ci for my personal repo often failed with
>> >>>     exceeding time
>> >>>     > limit
>> >>>     > >> > after
>> >>>     > >> > > 4+ hours.
>> >>>     > >> > > The job exceeded the maximum time limit for jobs, and has
>> >>>     been
>> >>>     > >> > terminated.
>> >>>     > >> > >
>> >>>     > >> > > On Mon, Jun 24, 2019 at 4:15 PM Bowen Li
>> >>>     <[hidden email] <mailto:[hidden email]>>
>> >>>     > wrote:
>> >>>     > >> > >
>> >>>     > >> > > > https://travis-ci.org/apache/flink/builds/549681530
>> >>>     This build
>> >>>     > >> > request
>> >>>     > >> > > > has
>> >>>     > >> > > > been sitting at **HEAD of the queue** since I first saw
>> >>>     it at PST
>> >>>     > >> > 10:30am
>> >>>     > >> > > > (not sure how long it's been there before 10:30am).
>> >>>     It's PST
>> >>>     > 4:12pm
>> >>>     > >> now
>> >>>     > >> > > and
>> >>>     > >> > > > it hasn't started yet.
>> >>>     > >> > > >
>> >>>     > >> > > > On Mon, Jun 24, 2019 at 2:48 PM Bowen Li
>> >>>     <[hidden email] <mailto:[hidden email]>>
>> >>>     > >> wrote:
>> >>>     > >> > > >
>> >>>     > >> > > > > Hi devs,
>> >>>     > >> > > > >
>> >>>     > >> > > > > I've been experiencing the pain resulting from lack
>> >>>     of stable
>> >>>     > >> build
>> >>>     > >> > > > > capacity on Travis for Flink PRs [1]. Specifically, I
>> >>>     noticed
>> >>>     > >> often
>> >>>     > >> > > that
>> >>>     > >> > > > no
>> >>>     > >> > > > > build in the queue is making any progress for hours, and
>> >>>     > suddenly
>> >>>     > >> 5
>> >>>     > >> > or
>> >>>     > >> > > 6
>> >>>     > >> > > > > builds kick off all together after the long pause.
>> >>>     I'm at PST
>> >>>     > >> > (UTC-08)
>> >>>     > >> > > > time
>> >>>     > >> > > > > zone, and I've seen pause can be as long as 6 hours
>> >>>     from PST 9am
>> >>>     > >> to
>> >>>     > >> > 3pm
>> >>>     > >> > > > > (let alone the time needed to drain the queue
>> >>>     afterwards).
>> >>>     > >> > > > >
>> >>>     > >> > > > > I think this has greatly impacted our productivity. I've
>> >>>     > >> experienced
>> >>>     > >> > > that
>> >>>     > >> > > > > PRs submitted in the early morning of PST time zone
>> >>>     won't finish
>> >>>     > >> > their
>> >>>     > >> > > > > build until late night of the same day.
>> >>>     > >> > > > >
>> >>>     > >> > > > > So my questions are:
>> >>>     > >> > > > >
>> >>>     > >> > > > > - Has anyone else experienced the same problem or
>> >>>     have similar
>> >>>     > >> > > > observation
>> >>>     > >> > > > > on TravisCI? (I suspect it has things to do with time
>> >>>     zone)
>> >>>     > >> > > > >
>> >>>     > >> > > > > - What pricing plan of TravisCI is Flink currently
>> >>>     using? Is it
>> >>>     > >> the
>> >>>     > >> > > free
>> >>>     > >> > > > > plan for open source projects? What are the
>> >>>     guaranteed build
>> >>>     > >> capacity
>> >>>     > >> > > of
>> >>>     > >> > > > > the current plan?
>> >>>     > >> > > > >
>> >>>     > >> > > > > - If the current pricing plan (either free or paid)
>> >> can't
>> >>>     > provide
>> >>>     > >> > > stable
>> >>>     > >> > > > > build capacity, can we upgrade to a higher priced
>> >>>     plan with
>> >>>     > larger
>> >>>     > >> > and
>> >>>     > >> > > > more
>> >>>     > >> > > > > stable build capacity?
>> >>>     > >> > > > >
>> >>>     > >> > > > > BTW, another factor that contribute to the
>> >>>     productivity problem
>> >>>     > is
>> >>>     > >> > that
>> >>>     > >> > > > > our build is slow - we run full build for every PR and a
>> >>>     > >> successful
>> >>>     > >> > > full
>> >>>     > >> > > > > build takes ~5h. We definitely have more options to
>> >>>     solve it,
>> >>>     > for
>> >>>     > >> > > > instance,
>> >>>     > >> > > > > modularize the build graphs and reuse artifacts from the
>> >>>     > previous
>> >>>     > >> > > build.
>> >>>     > >> > > > > But I think that can be a big effort which is much
>> >>>     harder to
>> >>>     > >> > accomplish
>> >>>     > >> > > > in
>> >>>     > >> > > > > a short period of time and may deserve its own separate
>> >>>     > >> discussion.
>> >>>     > >> > > > >
>> >>>     > >> > > > > [1] https://travis-ci.org/apache/flink/pull_requests
>> >>>     > >> > > > >
>> >>>     > >> > > > >
>> >>>     > >> > > >
>> >>>     > >> > >
>> >>>     > >> >
>> >>>     > >>
>> >>>     > >
>> >>>     >
>> >>>
>> >>>
>> >>>     --
>> >>>     Best Regards
>> >>>
>> >>>     Jeff Zhang
>> >>>
>> >>
>>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] solve unstable build capacity problem on TravisCI

Chesnay Schepler-3
see https://issues.apache.org/jira/browse/INFRA-18533 for the overall
degradation of Travis capacity.

On 26/06/2019 21:50, Bowen wrote:

> just elaborate a bit more on why slow build is ok but no resource is not: Say I submit a build request at PST 9am, no other requests exist and mine is the queue head, currently it means it still cannot get built until 4 or 5pm.
>
>
>
>> On Jun 26, 2019, at 12:28, Bowen Li <[hidden email]> wrote:
>>
>> Hi,
>>
>> @Dawid, I think the "long test running" as I mentioned in the first email, also as you guys said, belongs to "a big effort which is much harder to accomplish in a short period of time and may deserve its own separate discussion". Thus I didn't include it in what we can do in a foreseeable short term.
>>
>> Besides, I don't think that's the ultimate reason for lack of build resources. Even if the build is shortened to something like 2h, the problems of no build machine works about 6 or more hours in PST daytime that I described will still happen, because no machine from ASF INFRA's pool is allocated to Flink. As I have paid close attention to the build queue in the past few weekdays, it's a pretty clear pattern now.
>>
>> **The ultimate root cause** for that is - we don't have any **dedicated** build resources that we can stably rely on. I'm actually ok to wait for a long time if there are build requests running, it means at least we are making progress. But I'm not ok with no build resource. A better place I think we should aim at in short term is to always have at least a central pool (can be 3 or 5) of machines dedicated to build Flink at any time, or maybe use users resources.
>>
>> @Chesnay @Robert I synced with Jeff offline that Zeppelin community is using a Jenkins job to automatically build on users' travis account and link the result back to github PR. I guess the Jenkins job would fetch latest upstream master and build the PR against it. Jeff has filed tickets to learn and get access to the Jenkins infra. It'll better to fully understand it first before judging this approach.
>>
>> I also heard good things about CircleCI, and ASF INFRA seems to have a pool of build capacity there too. Can be an alternative to consider.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>> On Wed, Jun 26, 2019 at 12:44 AM Dawid Wysakowicz <[hidden email]> wrote:
>>> Sorry to jump in late, but I think Bowen missed the most important point
>>> from Chesnay's previous message in the summary. The ultimate reason for
>>> all the problems is that the tests take close to 2 hours to run already.
>>> I fully support this claim: "Unless people start caring about test times
>>> before adding them, this issue cannot be solved"
>>>
>>> This is also another reason why using user's Travis account won't help.
>>> Every few weeks we reach the user's time limit for a single profile.
>>> This makes the user's builds simply fail, until we either properly
>>> decrease the time the tests take (which I am not sure we ever did) or
>>> postpone the problem by splitting into more profiles. (Note that the ASF
>>> Travis account has higher time limits)
>>>
>>> Best,
>>>
>>> Dawid
>>>
>>> On 26/06/2019 09:36, Robert Metzger wrote:
>>>> Do we know if using "the best" available hardware would improve the build
>>>> times?
>>>> Imagine we would run the build on machines with plenty of main memory to
>>>> mount everything to ramdisk + the latest CPU architecture?
>>>>
>>>> Throwing hardware at the problem could help reduce the time of an
>>>> individual build, and using our own infrastructure would remove our
>>>> dependency on Apache's Travis account (with the obvious downside of having
>>>> to maintain the infrastructure)
>>>> We could use an open source travis alternative, to have a similar
>>>> experience and make the migration easy.
>>>>
>>>>
>>>> On Wed, Jun 26, 2019 at 9:34 AM Chesnay Schepler <[hidden email]> wrote:
>>>>
>>>>>   From what I gathered, there's no special sauce that the Zeppelin
>>>>> project uses which actually integrates a users Travis account into the PR.
>>>>>
>>>>> They just disabled Travis for PRs. And that's kind of it.
>>>>>
>>>>> Naturally we can do this (duh) and safe the ASF a fair amount of
>>>>> resources, but there are downsides:
>>>>>
>>>>> The discoverability of the Travis check takes a nose-dive. Either we
>>>>> require every contributor to always, an every commit, also post a Travis
>>>>> build, or we have the reviewer sift through the contributors account to
>>>>> find it.
>>>>>
>>>>> This is rather cumbersome. Additionally, it's also not equivalent to
>>>>> having a PR build.
>>>>>
>>>>> A normal branch build takes a branch as is and tests it. A PR build
>>>>> merges the branch into master, and then runs it. (Fun fact: This is why
>>>>> a PR without merge conflicts is not being run on Travis.)
>>>>>
>>>>> And ultimately, everyone can already make use of this approach anyway.
>>>>>
>>>>> On 25/06/2019 08:02, Jark Wu wrote:
>>>>>> Hi Jeff,
>>>>>>
>>>>>> Thanks for sharing the Zeppelin approach. I think it's a good idea to
>>>>>> leverage user's travis account.
>>>>>> In this way, we can have almost unlimited concurrent build jobs and
>>>>>> developers can restart build by themselves (currently only committers
>>>>>> can restart PR's build).
>>>>>>
>>>>>> But I'm still not very clear how to integrate user's travis build into
>>>>>> the Flink pull request's build automatically. Can you explain more in
>>>>>> detail?
>>>>>>
>>>>>> Another question: does travis only build branches for user account?
>>>>>> My concern is that builds for PRs will rebase user's commits against
>>>>>> current master branch.
>>>>>> This will help us to find problems before merge.  Builds for branches
>>>>>> will lose the impact of new commits in master.
>>>>>> How does Zeppelin solve this problem?
>>>>>>
>>>>>> Thanks again for sharing the idea.
>>>>>>
>>>>>> Regards,
>>>>>> Jark
>>>>>>
>>>>>> On Tue, 25 Jun 2019 at 11:01, Jeff Zhang <[hidden email]
>>>>>> <mailto:[hidden email]>> wrote:
>>>>>>
>>>>>>      Hi Folks,
>>>>>>
>>>>>>      Zeppelin meet this kind of issue before, we solve it by delegating
>>>>>>      each
>>>>>>      one's PR build to his travis account (Everyone can have 5 free
>>>>>>      slot for
>>>>>>      travis build).
>>>>>>      Apache account travis build is only triggered when PR is merged.
>>>>>>
>>>>>>
>>>>>>
>>>>>>      Kurt Young <[hidden email] <mailto:[hidden email]>>
>>>>>>      于2019年6月25日周二 上午10:16写道:
>>>>>>
>>>>>>      > (Forgot to cc George)
>>>>>>      >
>>>>>>      > Best,
>>>>>>      > Kurt
>>>>>>      >
>>>>>>      >
>>>>>>      > On Tue, Jun 25, 2019 at 10:16 AM Kurt Young <[hidden email]
>>>>>>      <mailto:[hidden email]>> wrote:
>>>>>>      >
>>>>>>      > > Hi Bowen,
>>>>>>      > >
>>>>>>      > > Thanks for bringing this up. We actually have discussed about
>>>>>>      this, and I
>>>>>>      > > think Till and George have
>>>>>>      > > already spend sometime investigating it. I have cced both of
>>>>>>      them, and
>>>>>>      > > maybe they can share
>>>>>>      > > their findings.
>>>>>>      > >
>>>>>>      > > Best,
>>>>>>      > > Kurt
>>>>>>      > >
>>>>>>      > >
>>>>>>      > > On Tue, Jun 25, 2019 at 10:08 AM Jark Wu <[hidden email]
>>>>>>      <mailto:[hidden email]>> wrote:
>>>>>>      > >
>>>>>>      > >> Hi Bowen,
>>>>>>      > >>
>>>>>>      > >> Thanks for bringing this. We also suffered from the long
>>>>>>      build time.
>>>>>>      > >> I agree that we should focus on solving build capacity
>>>>>>      problem in the
>>>>>>      > >> thread.
>>>>>>      > >>
>>>>>>      > >> My observation is there is only one build is running, all the
>>>>>>      others
>>>>>>      > >> (other
>>>>>>      > >> PRs, master) are pending.
>>>>>>      > >> The pricing plan[1] of travis shows it can support concurrent
>>>>>>      build
>>>>>>      > jobs.
>>>>>>      > >> But I don't know which plan we are using, might be the free
>>>>>>      plan for
>>>>>>      > open
>>>>>>      > >> source.
>>>>>>      > >>
>>>>>>      > >> I cc-ed Chesnay who may have some experience on Travis.
>>>>>>      > >>
>>>>>>      > >> Regards,
>>>>>>      > >> Jark
>>>>>>      > >>
>>>>>>      > >> [1]: https://travis-ci.com/plans
>>>>>>      > >>
>>>>>>      > >> On Tue, 25 Jun 2019 at 08:11, Bowen Li <[hidden email]
>>>>>>      <mailto:[hidden email]>> wrote:
>>>>>>      > >>
>>>>>>      > >> > Hi Steven,
>>>>>>      > >> >
>>>>>>      > >> > I think you may not read what I wrote. The discussion is about
>>>>>>      > "unstable
>>>>>>      > >> > build **capacity**", in another word "unstable / lack of build
>>>>>>      > >> resources",
>>>>>>      > >> > not "unstable build".
>>>>>>      > >> >
>>>>>>      > >> > On Mon, Jun 24, 2019 at 4:40 PM Steven Wu
>>>>>>      <[hidden email] <mailto:[hidden email]>>
>>>>>>      > wrote:
>>>>>>      > >> >
>>>>>>      > >> > > long and sometimes unstable build is definitely a pain
>>>>> point.
>>>>>>      > >> > >
>>>>>>      > >> > > I suspect the build failure here in flink-connector-kafka
>>>>>>      is not
>>>>>>      > >> related
>>>>>>      > >> > to
>>>>>>      > >> > > my change. but there is no easy re-run the build on
>>>>>>      travis UI.
>>>>>>      > Google
>>>>>>      > >> > > search showed a trick of close-and-open the PR will
>>>>>>      trigger rebuild.
>>>>>>      > >> but
>>>>>>      > >> > > that could add noises to the PR activities.
>>>>>>      > >> > > https://travis-ci.org/apache/flink/jobs/545555519
>>>>>>      > >> > >
>>>>>>      > >> > > travis-ci for my personal repo often failed with
>>>>>>      exceeding time
>>>>>>      > limit
>>>>>>      > >> > after
>>>>>>      > >> > > 4+ hours.
>>>>>>      > >> > > The job exceeded the maximum time limit for jobs, and has
>>>>>>      been
>>>>>>      > >> > terminated.
>>>>>>      > >> > >
>>>>>>      > >> > > On Mon, Jun 24, 2019 at 4:15 PM Bowen Li
>>>>>>      <[hidden email] <mailto:[hidden email]>>
>>>>>>      > wrote:
>>>>>>      > >> > >
>>>>>>      > >> > > > https://travis-ci.org/apache/flink/builds/549681530
>>>>>>      This build
>>>>>>      > >> > request
>>>>>>      > >> > > > has
>>>>>>      > >> > > > been sitting at **HEAD of the queue** since I first saw
>>>>>>      it at PST
>>>>>>      > >> > 10:30am
>>>>>>      > >> > > > (not sure how long it's been there before 10:30am).
>>>>>>      It's PST
>>>>>>      > 4:12pm
>>>>>>      > >> now
>>>>>>      > >> > > and
>>>>>>      > >> > > > it hasn't started yet.
>>>>>>      > >> > > >
>>>>>>      > >> > > > On Mon, Jun 24, 2019 at 2:48 PM Bowen Li
>>>>>>      <[hidden email] <mailto:[hidden email]>>
>>>>>>      > >> wrote:
>>>>>>      > >> > > >
>>>>>>      > >> > > > > Hi devs,
>>>>>>      > >> > > > >
>>>>>>      > >> > > > > I've been experiencing the pain resulting from lack
>>>>>>      of stable
>>>>>>      > >> build
>>>>>>      > >> > > > > capacity on Travis for Flink PRs [1]. Specifically, I
>>>>>>      noticed
>>>>>>      > >> often
>>>>>>      > >> > > that
>>>>>>      > >> > > > no
>>>>>>      > >> > > > > build in the queue is making any progress for hours, and
>>>>>>      > suddenly
>>>>>>      > >> 5
>>>>>>      > >> > or
>>>>>>      > >> > > 6
>>>>>>      > >> > > > > builds kick off all together after the long pause.
>>>>>>      I'm at PST
>>>>>>      > >> > (UTC-08)
>>>>>>      > >> > > > time
>>>>>>      > >> > > > > zone, and I've seen pause can be as long as 6 hours
>>>>>>      from PST 9am
>>>>>>      > >> to
>>>>>>      > >> > 3pm
>>>>>>      > >> > > > > (let alone the time needed to drain the queue
>>>>>>      afterwards).
>>>>>>      > >> > > > >
>>>>>>      > >> > > > > I think this has greatly impacted our productivity. I've
>>>>>>      > >> experienced
>>>>>>      > >> > > that
>>>>>>      > >> > > > > PRs submitted in the early morning of PST time zone
>>>>>>      won't finish
>>>>>>      > >> > their
>>>>>>      > >> > > > > build until late night of the same day.
>>>>>>      > >> > > > >
>>>>>>      > >> > > > > So my questions are:
>>>>>>      > >> > > > >
>>>>>>      > >> > > > > - Has anyone else experienced the same problem or
>>>>>>      have similar
>>>>>>      > >> > > > observation
>>>>>>      > >> > > > > on TravisCI? (I suspect it has things to do with time
>>>>>>      zone)
>>>>>>      > >> > > > >
>>>>>>      > >> > > > > - What pricing plan of TravisCI is Flink currently
>>>>>>      using? Is it
>>>>>>      > >> the
>>>>>>      > >> > > free
>>>>>>      > >> > > > > plan for open source projects? What are the
>>>>>>      guaranteed build
>>>>>>      > >> capacity
>>>>>>      > >> > > of
>>>>>>      > >> > > > > the current plan?
>>>>>>      > >> > > > >
>>>>>>      > >> > > > > - If the current pricing plan (either free or paid)
>>>>> can't
>>>>>>      > provide
>>>>>>      > >> > > stable
>>>>>>      > >> > > > > build capacity, can we upgrade to a higher priced
>>>>>>      plan with
>>>>>>      > larger
>>>>>>      > >> > and
>>>>>>      > >> > > > more
>>>>>>      > >> > > > > stable build capacity?
>>>>>>      > >> > > > >
>>>>>>      > >> > > > > BTW, another factor that contribute to the
>>>>>>      productivity problem
>>>>>>      > is
>>>>>>      > >> > that
>>>>>>      > >> > > > > our build is slow - we run full build for every PR and a
>>>>>>      > >> successful
>>>>>>      > >> > > full
>>>>>>      > >> > > > > build takes ~5h. We definitely have more options to
>>>>>>      solve it,
>>>>>>      > for
>>>>>>      > >> > > > instance,
>>>>>>      > >> > > > > modularize the build graphs and reuse artifacts from the
>>>>>>      > previous
>>>>>>      > >> > > build.
>>>>>>      > >> > > > > But I think that can be a big effort which is much
>>>>>>      harder to
>>>>>>      > >> > accomplish
>>>>>>      > >> > > > in
>>>>>>      > >> > > > > a short period of time and may deserve its own separate
>>>>>>      > >> discussion.
>>>>>>      > >> > > > >
>>>>>>      > >> > > > > [1] https://travis-ci.org/apache/flink/pull_requests
>>>>>>      > >> > > > >
>>>>>>      > >> > > > >
>>>>>>      > >> > > >
>>>>>>      > >> > >
>>>>>>      > >> >
>>>>>>      > >>
>>>>>>      > >
>>>>>>      >
>>>>>>
>>>>>>
>>>>>>      --
>>>>>>      Best Regards
>>>>>>
>>>>>>      Jeff Zhang
>>>>>>

Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] solve unstable build capacity problem on TravisCI

Chesnay Schepler-3
In reply to this post by bowen.li
Does this imply that a Jenkins job is active as long as the Travis build
runs?

On 26/06/2019 21:28, Bowen Li wrote:

> Hi,
>
> @Dawid, I think the "long test running" as I mentioned in the first email,
> also as you guys said, belongs to "a big effort which is much harder to
> accomplish in a short period of time and may deserve its own separate
> discussion". Thus I didn't include it in what we can do in a foreseeable
> short term.
>
> Besides, I don't think that's the ultimate reason for lack of build
> resources. Even if the build is shortened to something like 2h, the
> problems of no build machine works about 6 or more hours in PST daytime
> that I described will still happen, because no machine from ASF INFRA's
> pool is allocated to Flink. As I have paid close attention to the build
> queue in the past few weekdays, it's a pretty clear pattern now.
>
> **The ultimate root cause** for that is - we don't have any **dedicated**
> build resources that we can stably rely on. I'm actually ok to wait for a
> long time if there are build requests running, it means at least we are
> making progress. But I'm not ok with no build resource. A better place I
> think we should aim at in short term is to always have at least a central
> pool (can be 3 or 5) of machines dedicated to build Flink at any time, or
> maybe use users resources.
>
> @Chesnay @Robert I synced with Jeff offline that Zeppelin community is
> using a Jenkins job to automatically build on users' travis account and
> link the result back to github PR. I guess the Jenkins job would fetch
> latest upstream master and build the PR against it. Jeff has filed tickets
> to learn and get access to the Jenkins infra. It'll better to fully
> understand it first before judging this approach.
>
> I also heard good things about CircleCI, and ASF INFRA seems to have a pool
> of build capacity there too. Can be an alternative to consider.
>
>
>
>
>
>
>
>
>
> On Wed, Jun 26, 2019 at 12:44 AM Dawid Wysakowicz <[hidden email]>
> wrote:
>
>> Sorry to jump in late, but I think Bowen missed the most important point
>> from Chesnay's previous message in the summary. The ultimate reason for
>> all the problems is that the tests take close to 2 hours to run already.
>> I fully support this claim: "Unless people start caring about test times
>> before adding them, this issue cannot be solved"
>>
>> This is also another reason why using user's Travis account won't help.
>> Every few weeks we reach the user's time limit for a single profile.
>> This makes the user's builds simply fail, until we either properly
>> decrease the time the tests take (which I am not sure we ever did) or
>> postpone the problem by splitting into more profiles. (Note that the ASF
>> Travis account has higher time limits)
>>
>> Best,
>>
>> Dawid
>>
>> On 26/06/2019 09:36, Robert Metzger wrote:
>>> Do we know if using "the best" available hardware would improve the build
>>> times?
>>> Imagine we would run the build on machines with plenty of main memory to
>>> mount everything to ramdisk + the latest CPU architecture?
>>>
>>> Throwing hardware at the problem could help reduce the time of an
>>> individual build, and using our own infrastructure would remove our
>>> dependency on Apache's Travis account (with the obvious downside of
>> having
>>> to maintain the infrastructure)
>>> We could use an open source travis alternative, to have a similar
>>> experience and make the migration easy.
>>>
>>>
>>> On Wed, Jun 26, 2019 at 9:34 AM Chesnay Schepler <[hidden email]>
>> wrote:
>>>>   From what I gathered, there's no special sauce that the Zeppelin
>>>> project uses which actually integrates a users Travis account into the
>> PR.
>>>> They just disabled Travis for PRs. And that's kind of it.
>>>>
>>>> Naturally we can do this (duh) and safe the ASF a fair amount of
>>>> resources, but there are downsides:
>>>>
>>>> The discoverability of the Travis check takes a nose-dive. Either we
>>>> require every contributor to always, an every commit, also post a Travis
>>>> build, or we have the reviewer sift through the contributors account to
>>>> find it.
>>>>
>>>> This is rather cumbersome. Additionally, it's also not equivalent to
>>>> having a PR build.
>>>>
>>>> A normal branch build takes a branch as is and tests it. A PR build
>>>> merges the branch into master, and then runs it. (Fun fact: This is why
>>>> a PR without merge conflicts is not being run on Travis.)
>>>>
>>>> And ultimately, everyone can already make use of this approach anyway.
>>>>
>>>> On 25/06/2019 08:02, Jark Wu wrote:
>>>>> Hi Jeff,
>>>>>
>>>>> Thanks for sharing the Zeppelin approach. I think it's a good idea to
>>>>> leverage user's travis account.
>>>>> In this way, we can have almost unlimited concurrent build jobs and
>>>>> developers can restart build by themselves (currently only committers
>>>>> can restart PR's build).
>>>>>
>>>>> But I'm still not very clear how to integrate user's travis build into
>>>>> the Flink pull request's build automatically. Can you explain more in
>>>>> detail?
>>>>>
>>>>> Another question: does travis only build branches for user account?
>>>>> My concern is that builds for PRs will rebase user's commits against
>>>>> current master branch.
>>>>> This will help us to find problems before merge.  Builds for branches
>>>>> will lose the impact of new commits in master.
>>>>> How does Zeppelin solve this problem?
>>>>>
>>>>> Thanks again for sharing the idea.
>>>>>
>>>>> Regards,
>>>>> Jark
>>>>>
>>>>> On Tue, 25 Jun 2019 at 11:01, Jeff Zhang <[hidden email]
>>>>> <mailto:[hidden email]>> wrote:
>>>>>
>>>>>      Hi Folks,
>>>>>
>>>>>      Zeppelin meet this kind of issue before, we solve it by delegating
>>>>>      each
>>>>>      one's PR build to his travis account (Everyone can have 5 free
>>>>>      slot for
>>>>>      travis build).
>>>>>      Apache account travis build is only triggered when PR is merged.
>>>>>
>>>>>
>>>>>
>>>>>      Kurt Young <[hidden email] <mailto:[hidden email]>>
>>>>>      于2019年6月25日周二 上午10:16写道:
>>>>>
>>>>>      > (Forgot to cc George)
>>>>>      >
>>>>>      > Best,
>>>>>      > Kurt
>>>>>      >
>>>>>      >
>>>>>      > On Tue, Jun 25, 2019 at 10:16 AM Kurt Young <[hidden email]
>>>>>      <mailto:[hidden email]>> wrote:
>>>>>      >
>>>>>      > > Hi Bowen,
>>>>>      > >
>>>>>      > > Thanks for bringing this up. We actually have discussed about
>>>>>      this, and I
>>>>>      > > think Till and George have
>>>>>      > > already spend sometime investigating it. I have cced both of
>>>>>      them, and
>>>>>      > > maybe they can share
>>>>>      > > their findings.
>>>>>      > >
>>>>>      > > Best,
>>>>>      > > Kurt
>>>>>      > >
>>>>>      > >
>>>>>      > > On Tue, Jun 25, 2019 at 10:08 AM Jark Wu <[hidden email]
>>>>>      <mailto:[hidden email]>> wrote:
>>>>>      > >
>>>>>      > >> Hi Bowen,
>>>>>      > >>
>>>>>      > >> Thanks for bringing this. We also suffered from the long
>>>>>      build time.
>>>>>      > >> I agree that we should focus on solving build capacity
>>>>>      problem in the
>>>>>      > >> thread.
>>>>>      > >>
>>>>>      > >> My observation is there is only one build is running, all the
>>>>>      others
>>>>>      > >> (other
>>>>>      > >> PRs, master) are pending.
>>>>>      > >> The pricing plan[1] of travis shows it can support concurrent
>>>>>      build
>>>>>      > jobs.
>>>>>      > >> But I don't know which plan we are using, might be the free
>>>>>      plan for
>>>>>      > open
>>>>>      > >> source.
>>>>>      > >>
>>>>>      > >> I cc-ed Chesnay who may have some experience on Travis.
>>>>>      > >>
>>>>>      > >> Regards,
>>>>>      > >> Jark
>>>>>      > >>
>>>>>      > >> [1]: https://travis-ci.com/plans
>>>>>      > >>
>>>>>      > >> On Tue, 25 Jun 2019 at 08:11, Bowen Li <[hidden email]
>>>>>      <mailto:[hidden email]>> wrote:
>>>>>      > >>
>>>>>      > >> > Hi Steven,
>>>>>      > >> >
>>>>>      > >> > I think you may not read what I wrote. The discussion is
>> about
>>>>>      > "unstable
>>>>>      > >> > build **capacity**", in another word "unstable / lack of
>> build
>>>>>      > >> resources",
>>>>>      > >> > not "unstable build".
>>>>>      > >> >
>>>>>      > >> > On Mon, Jun 24, 2019 at 4:40 PM Steven Wu
>>>>>      <[hidden email] <mailto:[hidden email]>>
>>>>>      > wrote:
>>>>>      > >> >
>>>>>      > >> > > long and sometimes unstable build is definitely a pain
>>>> point.
>>>>>      > >> > >
>>>>>      > >> > > I suspect the build failure here in flink-connector-kafka
>>>>>      is not
>>>>>      > >> related
>>>>>      > >> > to
>>>>>      > >> > > my change. but there is no easy re-run the build on
>>>>>      travis UI.
>>>>>      > Google
>>>>>      > >> > > search showed a trick of close-and-open the PR will
>>>>>      trigger rebuild.
>>>>>      > >> but
>>>>>      > >> > > that could add noises to the PR activities.
>>>>>      > >> > > https://travis-ci.org/apache/flink/jobs/545555519
>>>>>      > >> > >
>>>>>      > >> > > travis-ci for my personal repo often failed with
>>>>>      exceeding time
>>>>>      > limit
>>>>>      > >> > after
>>>>>      > >> > > 4+ hours.
>>>>>      > >> > > The job exceeded the maximum time limit for jobs, and has
>>>>>      been
>>>>>      > >> > terminated.
>>>>>      > >> > >
>>>>>      > >> > > On Mon, Jun 24, 2019 at 4:15 PM Bowen Li
>>>>>      <[hidden email] <mailto:[hidden email]>>
>>>>>      > wrote:
>>>>>      > >> > >
>>>>>      > >> > > > https://travis-ci.org/apache/flink/builds/549681530
>>>>>      This build
>>>>>      > >> > request
>>>>>      > >> > > > has
>>>>>      > >> > > > been sitting at **HEAD of the queue** since I first saw
>>>>>      it at PST
>>>>>      > >> > 10:30am
>>>>>      > >> > > > (not sure how long it's been there before 10:30am).
>>>>>      It's PST
>>>>>      > 4:12pm
>>>>>      > >> now
>>>>>      > >> > > and
>>>>>      > >> > > > it hasn't started yet.
>>>>>      > >> > > >
>>>>>      > >> > > > On Mon, Jun 24, 2019 at 2:48 PM Bowen Li
>>>>>      <[hidden email] <mailto:[hidden email]>>
>>>>>      > >> wrote:
>>>>>      > >> > > >
>>>>>      > >> > > > > Hi devs,
>>>>>      > >> > > > >
>>>>>      > >> > > > > I've been experiencing the pain resulting from lack
>>>>>      of stable
>>>>>      > >> build
>>>>>      > >> > > > > capacity on Travis for Flink PRs [1]. Specifically, I
>>>>>      noticed
>>>>>      > >> often
>>>>>      > >> > > that
>>>>>      > >> > > > no
>>>>>      > >> > > > > build in the queue is making any progress for hours,
>> and
>>>>>      > suddenly
>>>>>      > >> 5
>>>>>      > >> > or
>>>>>      > >> > > 6
>>>>>      > >> > > > > builds kick off all together after the long pause.
>>>>>      I'm at PST
>>>>>      > >> > (UTC-08)
>>>>>      > >> > > > time
>>>>>      > >> > > > > zone, and I've seen pause can be as long as 6 hours
>>>>>      from PST 9am
>>>>>      > >> to
>>>>>      > >> > 3pm
>>>>>      > >> > > > > (let alone the time needed to drain the queue
>>>>>      afterwards).
>>>>>      > >> > > > >
>>>>>      > >> > > > > I think this has greatly impacted our productivity.
>> I've
>>>>>      > >> experienced
>>>>>      > >> > > that
>>>>>      > >> > > > > PRs submitted in the early morning of PST time zone
>>>>>      won't finish
>>>>>      > >> > their
>>>>>      > >> > > > > build until late night of the same day.
>>>>>      > >> > > > >
>>>>>      > >> > > > > So my questions are:
>>>>>      > >> > > > >
>>>>>      > >> > > > > - Has anyone else experienced the same problem or
>>>>>      have similar
>>>>>      > >> > > > observation
>>>>>      > >> > > > > on TravisCI? (I suspect it has things to do with time
>>>>>      zone)
>>>>>      > >> > > > >
>>>>>      > >> > > > > - What pricing plan of TravisCI is Flink currently
>>>>>      using? Is it
>>>>>      > >> the
>>>>>      > >> > > free
>>>>>      > >> > > > > plan for open source projects? What are the
>>>>>      guaranteed build
>>>>>      > >> capacity
>>>>>      > >> > > of
>>>>>      > >> > > > > the current plan?
>>>>>      > >> > > > >
>>>>>      > >> > > > > - If the current pricing plan (either free or paid)
>>>> can't
>>>>>      > provide
>>>>>      > >> > > stable
>>>>>      > >> > > > > build capacity, can we upgrade to a higher priced
>>>>>      plan with
>>>>>      > larger
>>>>>      > >> > and
>>>>>      > >> > > > more
>>>>>      > >> > > > > stable build capacity?
>>>>>      > >> > > > >
>>>>>      > >> > > > > BTW, another factor that contribute to the
>>>>>      productivity problem
>>>>>      > is
>>>>>      > >> > that
>>>>>      > >> > > > > our build is slow - we run full build for every PR
>> and a
>>>>>      > >> successful
>>>>>      > >> > > full
>>>>>      > >> > > > > build takes ~5h. We definitely have more options to
>>>>>      solve it,
>>>>>      > for
>>>>>      > >> > > > instance,
>>>>>      > >> > > > > modularize the build graphs and reuse artifacts from
>> the
>>>>>      > previous
>>>>>      > >> > > build.
>>>>>      > >> > > > > But I think that can be a big effort which is much
>>>>>      harder to
>>>>>      > >> > accomplish
>>>>>      > >> > > > in
>>>>>      > >> > > > > a short period of time and may deserve its own
>> separate
>>>>>      > >> discussion.
>>>>>      > >> > > > >
>>>>>      > >> > > > > [1] https://travis-ci.org/apache/flink/pull_requests
>>>>>      > >> > > > >
>>>>>      > >> > > > >
>>>>>      > >> > > >
>>>>>      > >> > >
>>>>>      > >> >
>>>>>      > >>
>>>>>      > >
>>>>>      >
>>>>>
>>>>>
>>>>>      --
>>>>>      Best Regards
>>>>>
>>>>>      Jeff Zhang
>>>>>
>>

Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] solve unstable build capacity problem on TravisCI

Jeff Zhang
Here's what zeppelin community did, we make a python script to check the
build status of pull request.
Here's script:
https://github.com/apache/zeppelin/blob/master/travis_check.py

And this is the script we used in Jenkins build job.

if [ -f "travis_check.py" ]; then
  git log -n 1
  STATUS=$(curl -s $BUILD_URL | grep -e "GitHub pull request.*from.*" | sed
's/.*GitHub pull request <a
href=\"\(https[^"]*\).*from[^"]*.\(https[^"]*\).*/\1 \2/g')
  AUTHOR=$(echo $STATUS | sed 's/.*[/]\(.*\)$/\1/g')
  PR=$(echo $STATUS | awk '{print $1}' | sed 's/.*[/]\(.*\)$/\1/g')
  #COMMIT=$(git log -n 1 | grep "^Merge:" | awk '{print $3}')
  #if [ -z $COMMIT ]; then
  #  COMMIT=$(curl -s https://api.github.com/repos/apache/zeppelin/pulls/$PR
| grep -e "\"label\":" -e "\"ref\":" -e "\"sha\":" | tr '\n' ' ' | sed
's/\(.*sha[^,]*,\)\(.*ref.*\)/\1 = \2/g' | tr = '\n' | grep -v "apache:" |
sed 's/.*sha.[^"]*["]\([^"]*\).*/\1/g')
  #fi

  # get commit hash from PR
  COMMIT=$(curl -s https://api.github.com/repos/apache/zeppelin/pulls/$PR |
grep -e "\"label\":" -e "\"ref\":" -e "\"sha\":" | tr '\n' ' ' | sed
's/\(.*sha[^,]*,\)\(.*ref.*\)/\1 = \2/g' | tr = '\n' | grep -v "apache:" |
sed 's/.*sha.[^"]*["]\([^"]*\).*/\1/g')
  sleep 30 # sleep few moment to wait travis starts the build
  RET_CODE=0
  python ./travis_check.py ${AUTHOR} ${COMMIT} || RET_CODE=$?
  if [ $RET_CODE -eq 2 ]; then # try with repository name when travis-ci is
not available in the account
    RET_CODE=0
    AUTHOR=$(curl -s https://api.github.com/repos/apache/zeppelin/pulls/$PR
| grep '"full_name":' | grep -v "apache/zeppelin" | sed
's/.*[:][^"]*["]\([^/]*\).*/\1/g')
  python ./travis_check.py ${AUTHOR} ${COMMIT} || RET_CODE=$?
  fi

  if [ $RET_CODE -eq 2 ]; then # fail with can't find build information in
the travis
    set +x
    echo "-----------------------------------------------------"
    echo "Looks like travis-ci is not configured for your fork."
    echo "Please setup by swich on 'zeppelin' repository at
https://travis-ci.org/profile and travis-ci."
    echo "And then make sure 'Build branch updates' option is enabled in
the settings https://travis-ci.org/${AUTHOR}/zeppelin/settings."
    echo ""
    echo "To trigger CI after setup, you will need ammend your last commit
with"
    echo "git commit --amend"
    echo "git push your-remote HEAD --force"
    echo ""
    echo "See
http://zeppelin.apache.org/contribution/contributions.html#continuous-integration
."
  fi

  exit $RET_CODE
else
  set +x
  echo "travis_check.py does not exists"
  exit 1
fi

Chesnay Schepler <[hidden email]> 于2019年6月29日周六 下午3:17写道:

> Does this imply that a Jenkins job is active as long as the Travis build
> runs?
>
> On 26/06/2019 21:28, Bowen Li wrote:
> > Hi,
> >
> > @Dawid, I think the "long test running" as I mentioned in the first
> email,
> > also as you guys said, belongs to "a big effort which is much harder to
> > accomplish in a short period of time and may deserve its own separate
> > discussion". Thus I didn't include it in what we can do in a foreseeable
> > short term.
> >
> > Besides, I don't think that's the ultimate reason for lack of build
> > resources. Even if the build is shortened to something like 2h, the
> > problems of no build machine works about 6 or more hours in PST daytime
> > that I described will still happen, because no machine from ASF INFRA's
> > pool is allocated to Flink. As I have paid close attention to the build
> > queue in the past few weekdays, it's a pretty clear pattern now.
> >
> > **The ultimate root cause** for that is - we don't have any **dedicated**
> > build resources that we can stably rely on. I'm actually ok to wait for a
> > long time if there are build requests running, it means at least we are
> > making progress. But I'm not ok with no build resource. A better place I
> > think we should aim at in short term is to always have at least a central
> > pool (can be 3 or 5) of machines dedicated to build Flink at any time, or
> > maybe use users resources.
> >
> > @Chesnay @Robert I synced with Jeff offline that Zeppelin community is
> > using a Jenkins job to automatically build on users' travis account and
> > link the result back to github PR. I guess the Jenkins job would fetch
> > latest upstream master and build the PR against it. Jeff has filed
> tickets
> > to learn and get access to the Jenkins infra. It'll better to fully
> > understand it first before judging this approach.
> >
> > I also heard good things about CircleCI, and ASF INFRA seems to have a
> pool
> > of build capacity there too. Can be an alternative to consider.
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > On Wed, Jun 26, 2019 at 12:44 AM Dawid Wysakowicz <
> [hidden email]>
> > wrote:
> >
> >> Sorry to jump in late, but I think Bowen missed the most important point
> >> from Chesnay's previous message in the summary. The ultimate reason for
> >> all the problems is that the tests take close to 2 hours to run already.
> >> I fully support this claim: "Unless people start caring about test times
> >> before adding them, this issue cannot be solved"
> >>
> >> This is also another reason why using user's Travis account won't help.
> >> Every few weeks we reach the user's time limit for a single profile.
> >> This makes the user's builds simply fail, until we either properly
> >> decrease the time the tests take (which I am not sure we ever did) or
> >> postpone the problem by splitting into more profiles. (Note that the ASF
> >> Travis account has higher time limits)
> >>
> >> Best,
> >>
> >> Dawid
> >>
> >> On 26/06/2019 09:36, Robert Metzger wrote:
> >>> Do we know if using "the best" available hardware would improve the
> build
> >>> times?
> >>> Imagine we would run the build on machines with plenty of main memory
> to
> >>> mount everything to ramdisk + the latest CPU architecture?
> >>>
> >>> Throwing hardware at the problem could help reduce the time of an
> >>> individual build, and using our own infrastructure would remove our
> >>> dependency on Apache's Travis account (with the obvious downside of
> >> having
> >>> to maintain the infrastructure)
> >>> We could use an open source travis alternative, to have a similar
> >>> experience and make the migration easy.
> >>>
> >>>
> >>> On Wed, Jun 26, 2019 at 9:34 AM Chesnay Schepler <[hidden email]>
> >> wrote:
> >>>>   From what I gathered, there's no special sauce that the Zeppelin
> >>>> project uses which actually integrates a users Travis account into the
> >> PR.
> >>>> They just disabled Travis for PRs. And that's kind of it.
> >>>>
> >>>> Naturally we can do this (duh) and safe the ASF a fair amount of
> >>>> resources, but there are downsides:
> >>>>
> >>>> The discoverability of the Travis check takes a nose-dive. Either we
> >>>> require every contributor to always, an every commit, also post a
> Travis
> >>>> build, or we have the reviewer sift through the contributors account
> to
> >>>> find it.
> >>>>
> >>>> This is rather cumbersome. Additionally, it's also not equivalent to
> >>>> having a PR build.
> >>>>
> >>>> A normal branch build takes a branch as is and tests it. A PR build
> >>>> merges the branch into master, and then runs it. (Fun fact: This is
> why
> >>>> a PR without merge conflicts is not being run on Travis.)
> >>>>
> >>>> And ultimately, everyone can already make use of this approach anyway.
> >>>>
> >>>> On 25/06/2019 08:02, Jark Wu wrote:
> >>>>> Hi Jeff,
> >>>>>
> >>>>> Thanks for sharing the Zeppelin approach. I think it's a good idea to
> >>>>> leverage user's travis account.
> >>>>> In this way, we can have almost unlimited concurrent build jobs and
> >>>>> developers can restart build by themselves (currently only committers
> >>>>> can restart PR's build).
> >>>>>
> >>>>> But I'm still not very clear how to integrate user's travis build
> into
> >>>>> the Flink pull request's build automatically. Can you explain more in
> >>>>> detail?
> >>>>>
> >>>>> Another question: does travis only build branches for user account?
> >>>>> My concern is that builds for PRs will rebase user's commits against
> >>>>> current master branch.
> >>>>> This will help us to find problems before merge.  Builds for branches
> >>>>> will lose the impact of new commits in master.
> >>>>> How does Zeppelin solve this problem?
> >>>>>
> >>>>> Thanks again for sharing the idea.
> >>>>>
> >>>>> Regards,
> >>>>> Jark
> >>>>>
> >>>>> On Tue, 25 Jun 2019 at 11:01, Jeff Zhang <[hidden email]
> >>>>> <mailto:[hidden email]>> wrote:
> >>>>>
> >>>>>      Hi Folks,
> >>>>>
> >>>>>      Zeppelin meet this kind of issue before, we solve it by
> delegating
> >>>>>      each
> >>>>>      one's PR build to his travis account (Everyone can have 5 free
> >>>>>      slot for
> >>>>>      travis build).
> >>>>>      Apache account travis build is only triggered when PR is merged.
> >>>>>
> >>>>>
> >>>>>
> >>>>>      Kurt Young <[hidden email] <mailto:[hidden email]>>
> >>>>>      于2019年6月25日周二 上午10:16写道:
> >>>>>
> >>>>>      > (Forgot to cc George)
> >>>>>      >
> >>>>>      > Best,
> >>>>>      > Kurt
> >>>>>      >
> >>>>>      >
> >>>>>      > On Tue, Jun 25, 2019 at 10:16 AM Kurt Young <[hidden email]
> >>>>>      <mailto:[hidden email]>> wrote:
> >>>>>      >
> >>>>>      > > Hi Bowen,
> >>>>>      > >
> >>>>>      > > Thanks for bringing this up. We actually have discussed
> about
> >>>>>      this, and I
> >>>>>      > > think Till and George have
> >>>>>      > > already spend sometime investigating it. I have cced both of
> >>>>>      them, and
> >>>>>      > > maybe they can share
> >>>>>      > > their findings.
> >>>>>      > >
> >>>>>      > > Best,
> >>>>>      > > Kurt
> >>>>>      > >
> >>>>>      > >
> >>>>>      > > On Tue, Jun 25, 2019 at 10:08 AM Jark Wu <[hidden email]
> >>>>>      <mailto:[hidden email]>> wrote:
> >>>>>      > >
> >>>>>      > >> Hi Bowen,
> >>>>>      > >>
> >>>>>      > >> Thanks for bringing this. We also suffered from the long
> >>>>>      build time.
> >>>>>      > >> I agree that we should focus on solving build capacity
> >>>>>      problem in the
> >>>>>      > >> thread.
> >>>>>      > >>
> >>>>>      > >> My observation is there is only one build is running, all
> the
> >>>>>      others
> >>>>>      > >> (other
> >>>>>      > >> PRs, master) are pending.
> >>>>>      > >> The pricing plan[1] of travis shows it can support
> concurrent
> >>>>>      build
> >>>>>      > jobs.
> >>>>>      > >> But I don't know which plan we are using, might be the free
> >>>>>      plan for
> >>>>>      > open
> >>>>>      > >> source.
> >>>>>      > >>
> >>>>>      > >> I cc-ed Chesnay who may have some experience on Travis.
> >>>>>      > >>
> >>>>>      > >> Regards,
> >>>>>      > >> Jark
> >>>>>      > >>
> >>>>>      > >> [1]: https://travis-ci.com/plans
> >>>>>      > >>
> >>>>>      > >> On Tue, 25 Jun 2019 at 08:11, Bowen Li <
> [hidden email]
> >>>>>      <mailto:[hidden email]>> wrote:
> >>>>>      > >>
> >>>>>      > >> > Hi Steven,
> >>>>>      > >> >
> >>>>>      > >> > I think you may not read what I wrote. The discussion is
> >> about
> >>>>>      > "unstable
> >>>>>      > >> > build **capacity**", in another word "unstable / lack of
> >> build
> >>>>>      > >> resources",
> >>>>>      > >> > not "unstable build".
> >>>>>      > >> >
> >>>>>      > >> > On Mon, Jun 24, 2019 at 4:40 PM Steven Wu
> >>>>>      <[hidden email] <mailto:[hidden email]>>
> >>>>>      > wrote:
> >>>>>      > >> >
> >>>>>      > >> > > long and sometimes unstable build is definitely a pain
> >>>> point.
> >>>>>      > >> > >
> >>>>>      > >> > > I suspect the build failure here in
> flink-connector-kafka
> >>>>>      is not
> >>>>>      > >> related
> >>>>>      > >> > to
> >>>>>      > >> > > my change. but there is no easy re-run the build on
> >>>>>      travis UI.
> >>>>>      > Google
> >>>>>      > >> > > search showed a trick of close-and-open the PR will
> >>>>>      trigger rebuild.
> >>>>>      > >> but
> >>>>>      > >> > > that could add noises to the PR activities.
> >>>>>      > >> > > https://travis-ci.org/apache/flink/jobs/545555519
> >>>>>      > >> > >
> >>>>>      > >> > > travis-ci for my personal repo often failed with
> >>>>>      exceeding time
> >>>>>      > limit
> >>>>>      > >> > after
> >>>>>      > >> > > 4+ hours.
> >>>>>      > >> > > The job exceeded the maximum time limit for jobs, and
> has
> >>>>>      been
> >>>>>      > >> > terminated.
> >>>>>      > >> > >
> >>>>>      > >> > > On Mon, Jun 24, 2019 at 4:15 PM Bowen Li
> >>>>>      <[hidden email] <mailto:[hidden email]>>
> >>>>>      > wrote:
> >>>>>      > >> > >
> >>>>>      > >> > > > https://travis-ci.org/apache/flink/builds/549681530
> >>>>>      This build
> >>>>>      > >> > request
> >>>>>      > >> > > > has
> >>>>>      > >> > > > been sitting at **HEAD of the queue** since I first
> saw
> >>>>>      it at PST
> >>>>>      > >> > 10:30am
> >>>>>      > >> > > > (not sure how long it's been there before 10:30am).
> >>>>>      It's PST
> >>>>>      > 4:12pm
> >>>>>      > >> now
> >>>>>      > >> > > and
> >>>>>      > >> > > > it hasn't started yet.
> >>>>>      > >> > > >
> >>>>>      > >> > > > On Mon, Jun 24, 2019 at 2:48 PM Bowen Li
> >>>>>      <[hidden email] <mailto:[hidden email]>>
> >>>>>      > >> wrote:
> >>>>>      > >> > > >
> >>>>>      > >> > > > > Hi devs,
> >>>>>      > >> > > > >
> >>>>>      > >> > > > > I've been experiencing the pain resulting from lack
> >>>>>      of stable
> >>>>>      > >> build
> >>>>>      > >> > > > > capacity on Travis for Flink PRs [1].
> Specifically, I
> >>>>>      noticed
> >>>>>      > >> often
> >>>>>      > >> > > that
> >>>>>      > >> > > > no
> >>>>>      > >> > > > > build in the queue is making any progress for
> hours,
> >> and
> >>>>>      > suddenly
> >>>>>      > >> 5
> >>>>>      > >> > or
> >>>>>      > >> > > 6
> >>>>>      > >> > > > > builds kick off all together after the long pause.
> >>>>>      I'm at PST
> >>>>>      > >> > (UTC-08)
> >>>>>      > >> > > > time
> >>>>>      > >> > > > > zone, and I've seen pause can be as long as 6 hours
> >>>>>      from PST 9am
> >>>>>      > >> to
> >>>>>      > >> > 3pm
> >>>>>      > >> > > > > (let alone the time needed to drain the queue
> >>>>>      afterwards).
> >>>>>      > >> > > > >
> >>>>>      > >> > > > > I think this has greatly impacted our productivity.
> >> I've
> >>>>>      > >> experienced
> >>>>>      > >> > > that
> >>>>>      > >> > > > > PRs submitted in the early morning of PST time zone
> >>>>>      won't finish
> >>>>>      > >> > their
> >>>>>      > >> > > > > build until late night of the same day.
> >>>>>      > >> > > > >
> >>>>>      > >> > > > > So my questions are:
> >>>>>      > >> > > > >
> >>>>>      > >> > > > > - Has anyone else experienced the same problem or
> >>>>>      have similar
> >>>>>      > >> > > > observation
> >>>>>      > >> > > > > on TravisCI? (I suspect it has things to do with
> time
> >>>>>      zone)
> >>>>>      > >> > > > >
> >>>>>      > >> > > > > - What pricing plan of TravisCI is Flink currently
> >>>>>      using? Is it
> >>>>>      > >> the
> >>>>>      > >> > > free
> >>>>>      > >> > > > > plan for open source projects? What are the
> >>>>>      guaranteed build
> >>>>>      > >> capacity
> >>>>>      > >> > > of
> >>>>>      > >> > > > > the current plan?
> >>>>>      > >> > > > >
> >>>>>      > >> > > > > - If the current pricing plan (either free or paid)
> >>>> can't
> >>>>>      > provide
> >>>>>      > >> > > stable
> >>>>>      > >> > > > > build capacity, can we upgrade to a higher priced
> >>>>>      plan with
> >>>>>      > larger
> >>>>>      > >> > and
> >>>>>      > >> > > > more
> >>>>>      > >> > > > > stable build capacity?
> >>>>>      > >> > > > >
> >>>>>      > >> > > > > BTW, another factor that contribute to the
> >>>>>      productivity problem
> >>>>>      > is
> >>>>>      > >> > that
> >>>>>      > >> > > > > our build is slow - we run full build for every PR
> >> and a
> >>>>>      > >> successful
> >>>>>      > >> > > full
> >>>>>      > >> > > > > build takes ~5h. We definitely have more options to
> >>>>>      solve it,
> >>>>>      > for
> >>>>>      > >> > > > instance,
> >>>>>      > >> > > > > modularize the build graphs and reuse artifacts
> from
> >> the
> >>>>>      > previous
> >>>>>      > >> > > build.
> >>>>>      > >> > > > > But I think that can be a big effort which is much
> >>>>>      harder to
> >>>>>      > >> > accomplish
> >>>>>      > >> > > > in
> >>>>>      > >> > > > > a short period of time and may deserve its own
> >> separate
> >>>>>      > >> discussion.
> >>>>>      > >> > > > >
> >>>>>      > >> > > > > [1]
> https://travis-ci.org/apache/flink/pull_requests
> >>>>>      > >> > > > >
> >>>>>      > >> > > > >
> >>>>>      > >> > > >
> >>>>>      > >> > >
> >>>>>      > >> >
> >>>>>      > >>
> >>>>>      > >
> >>>>>      >
> >>>>>
> >>>>>
> >>>>>      --
> >>>>>      Best Regards
> >>>>>
> >>>>>      Jeff Zhang
> >>>>>
> >>
>
>

--
Best Regards

Jeff Zhang
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] solve unstable build capacity problem on TravisCI

Chesnay Schepler-3
So yes, the Jenkins job keeps pulling the state from Travis until it
finishes.

Note sure I'm comfortable with the idea of using Jenkins workers just to
idle for a several hours.

On 29/06/2019 14:56, Jeff Zhang wrote:

> Here's what zeppelin community did, we make a python script to check the
> build status of pull request.
> Here's script:
> https://github.com/apache/zeppelin/blob/master/travis_check.py
>
> And this is the script we used in Jenkins build job.
>
> if [ -f "travis_check.py" ]; then
>    git log -n 1
>    STATUS=$(curl -s $BUILD_URL | grep -e "GitHub pull request.*from.*" | sed
> 's/.*GitHub pull request <a
> href=\"\(https[^"]*\).*from[^"]*.\(https[^"]*\).*/\1 \2/g')
>    AUTHOR=$(echo $STATUS | sed 's/.*[/]\(.*\)$/\1/g')
>    PR=$(echo $STATUS | awk '{print $1}' | sed 's/.*[/]\(.*\)$/\1/g')
>    #COMMIT=$(git log -n 1 | grep "^Merge:" | awk '{print $3}')
>    #if [ -z $COMMIT ]; then
>    #  COMMIT=$(curl -s https://api.github.com/repos/apache/zeppelin/pulls/$PR
> | grep -e "\"label\":" -e "\"ref\":" -e "\"sha\":" | tr '\n' ' ' | sed
> 's/\(.*sha[^,]*,\)\(.*ref.*\)/\1 = \2/g' | tr = '\n' | grep -v "apache:" |
> sed 's/.*sha.[^"]*["]\([^"]*\).*/\1/g')
>    #fi
>
>    # get commit hash from PR
>    COMMIT=$(curl -s https://api.github.com/repos/apache/zeppelin/pulls/$PR |
> grep -e "\"label\":" -e "\"ref\":" -e "\"sha\":" | tr '\n' ' ' | sed
> 's/\(.*sha[^,]*,\)\(.*ref.*\)/\1 = \2/g' | tr = '\n' | grep -v "apache:" |
> sed 's/.*sha.[^"]*["]\([^"]*\).*/\1/g')
>    sleep 30 # sleep few moment to wait travis starts the build
>    RET_CODE=0
>    python ./travis_check.py ${AUTHOR} ${COMMIT} || RET_CODE=$?
>    if [ $RET_CODE -eq 2 ]; then # try with repository name when travis-ci is
> not available in the account
>      RET_CODE=0
>      AUTHOR=$(curl -s https://api.github.com/repos/apache/zeppelin/pulls/$PR
> | grep '"full_name":' | grep -v "apache/zeppelin" | sed
> 's/.*[:][^"]*["]\([^/]*\).*/\1/g')
>    python ./travis_check.py ${AUTHOR} ${COMMIT} || RET_CODE=$?
>    fi
>
>    if [ $RET_CODE -eq 2 ]; then # fail with can't find build information in
> the travis
>      set +x
>      echo "-----------------------------------------------------"
>      echo "Looks like travis-ci is not configured for your fork."
>      echo "Please setup by swich on 'zeppelin' repository at
> https://travis-ci.org/profile and travis-ci."
>      echo "And then make sure 'Build branch updates' option is enabled in
> the settings https://travis-ci.org/${AUTHOR}/zeppelin/settings."
>      echo ""
>      echo "To trigger CI after setup, you will need ammend your last commit
> with"
>      echo "git commit --amend"
>      echo "git push your-remote HEAD --force"
>      echo ""
>      echo "See
> http://zeppelin.apache.org/contribution/contributions.html#continuous-integration
> ."
>    fi
>
>    exit $RET_CODE
> else
>    set +x
>    echo "travis_check.py does not exists"
>    exit 1
> fi
>
> Chesnay Schepler <[hidden email]> 于2019年6月29日周六 下午3:17写道:
>
>> Does this imply that a Jenkins job is active as long as the Travis build
>> runs?
>>
>> On 26/06/2019 21:28, Bowen Li wrote:
>>> Hi,
>>>
>>> @Dawid, I think the "long test running" as I mentioned in the first
>> email,
>>> also as you guys said, belongs to "a big effort which is much harder to
>>> accomplish in a short period of time and may deserve its own separate
>>> discussion". Thus I didn't include it in what we can do in a foreseeable
>>> short term.
>>>
>>> Besides, I don't think that's the ultimate reason for lack of build
>>> resources. Even if the build is shortened to something like 2h, the
>>> problems of no build machine works about 6 or more hours in PST daytime
>>> that I described will still happen, because no machine from ASF INFRA's
>>> pool is allocated to Flink. As I have paid close attention to the build
>>> queue in the past few weekdays, it's a pretty clear pattern now.
>>>
>>> **The ultimate root cause** for that is - we don't have any **dedicated**
>>> build resources that we can stably rely on. I'm actually ok to wait for a
>>> long time if there are build requests running, it means at least we are
>>> making progress. But I'm not ok with no build resource. A better place I
>>> think we should aim at in short term is to always have at least a central
>>> pool (can be 3 or 5) of machines dedicated to build Flink at any time, or
>>> maybe use users resources.
>>>
>>> @Chesnay @Robert I synced with Jeff offline that Zeppelin community is
>>> using a Jenkins job to automatically build on users' travis account and
>>> link the result back to github PR. I guess the Jenkins job would fetch
>>> latest upstream master and build the PR against it. Jeff has filed
>> tickets
>>> to learn and get access to the Jenkins infra. It'll better to fully
>>> understand it first before judging this approach.
>>>
>>> I also heard good things about CircleCI, and ASF INFRA seems to have a
>> pool
>>> of build capacity there too. Can be an alternative to consider.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Jun 26, 2019 at 12:44 AM Dawid Wysakowicz <
>> [hidden email]>
>>> wrote:
>>>
>>>> Sorry to jump in late, but I think Bowen missed the most important point
>>>> from Chesnay's previous message in the summary. The ultimate reason for
>>>> all the problems is that the tests take close to 2 hours to run already.
>>>> I fully support this claim: "Unless people start caring about test times
>>>> before adding them, this issue cannot be solved"
>>>>
>>>> This is also another reason why using user's Travis account won't help.
>>>> Every few weeks we reach the user's time limit for a single profile.
>>>> This makes the user's builds simply fail, until we either properly
>>>> decrease the time the tests take (which I am not sure we ever did) or
>>>> postpone the problem by splitting into more profiles. (Note that the ASF
>>>> Travis account has higher time limits)
>>>>
>>>> Best,
>>>>
>>>> Dawid
>>>>
>>>> On 26/06/2019 09:36, Robert Metzger wrote:
>>>>> Do we know if using "the best" available hardware would improve the
>> build
>>>>> times?
>>>>> Imagine we would run the build on machines with plenty of main memory
>> to
>>>>> mount everything to ramdisk + the latest CPU architecture?
>>>>>
>>>>> Throwing hardware at the problem could help reduce the time of an
>>>>> individual build, and using our own infrastructure would remove our
>>>>> dependency on Apache's Travis account (with the obvious downside of
>>>> having
>>>>> to maintain the infrastructure)
>>>>> We could use an open source travis alternative, to have a similar
>>>>> experience and make the migration easy.
>>>>>
>>>>>
>>>>> On Wed, Jun 26, 2019 at 9:34 AM Chesnay Schepler <[hidden email]>
>>>> wrote:
>>>>>>    From what I gathered, there's no special sauce that the Zeppelin
>>>>>> project uses which actually integrates a users Travis account into the
>>>> PR.
>>>>>> They just disabled Travis for PRs. And that's kind of it.
>>>>>>
>>>>>> Naturally we can do this (duh) and safe the ASF a fair amount of
>>>>>> resources, but there are downsides:
>>>>>>
>>>>>> The discoverability of the Travis check takes a nose-dive. Either we
>>>>>> require every contributor to always, an every commit, also post a
>> Travis
>>>>>> build, or we have the reviewer sift through the contributors account
>> to
>>>>>> find it.
>>>>>>
>>>>>> This is rather cumbersome. Additionally, it's also not equivalent to
>>>>>> having a PR build.
>>>>>>
>>>>>> A normal branch build takes a branch as is and tests it. A PR build
>>>>>> merges the branch into master, and then runs it. (Fun fact: This is
>> why
>>>>>> a PR without merge conflicts is not being run on Travis.)
>>>>>>
>>>>>> And ultimately, everyone can already make use of this approach anyway.
>>>>>>
>>>>>> On 25/06/2019 08:02, Jark Wu wrote:
>>>>>>> Hi Jeff,
>>>>>>>
>>>>>>> Thanks for sharing the Zeppelin approach. I think it's a good idea to
>>>>>>> leverage user's travis account.
>>>>>>> In this way, we can have almost unlimited concurrent build jobs and
>>>>>>> developers can restart build by themselves (currently only committers
>>>>>>> can restart PR's build).
>>>>>>>
>>>>>>> But I'm still not very clear how to integrate user's travis build
>> into
>>>>>>> the Flink pull request's build automatically. Can you explain more in
>>>>>>> detail?
>>>>>>>
>>>>>>> Another question: does travis only build branches for user account?
>>>>>>> My concern is that builds for PRs will rebase user's commits against
>>>>>>> current master branch.
>>>>>>> This will help us to find problems before merge.  Builds for branches
>>>>>>> will lose the impact of new commits in master.
>>>>>>> How does Zeppelin solve this problem?
>>>>>>>
>>>>>>> Thanks again for sharing the idea.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Jark
>>>>>>>
>>>>>>> On Tue, 25 Jun 2019 at 11:01, Jeff Zhang <[hidden email]
>>>>>>> <mailto:[hidden email]>> wrote:
>>>>>>>
>>>>>>>       Hi Folks,
>>>>>>>
>>>>>>>       Zeppelin meet this kind of issue before, we solve it by
>> delegating
>>>>>>>       each
>>>>>>>       one's PR build to his travis account (Everyone can have 5 free
>>>>>>>       slot for
>>>>>>>       travis build).
>>>>>>>       Apache account travis build is only triggered when PR is merged.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>       Kurt Young <[hidden email] <mailto:[hidden email]>>
>>>>>>>       于2019年6月25日周二 上午10:16写道:
>>>>>>>
>>>>>>>       > (Forgot to cc George)
>>>>>>>       >
>>>>>>>       > Best,
>>>>>>>       > Kurt
>>>>>>>       >
>>>>>>>       >
>>>>>>>       > On Tue, Jun 25, 2019 at 10:16 AM Kurt Young <[hidden email]
>>>>>>>       <mailto:[hidden email]>> wrote:
>>>>>>>       >
>>>>>>>       > > Hi Bowen,
>>>>>>>       > >
>>>>>>>       > > Thanks for bringing this up. We actually have discussed
>> about
>>>>>>>       this, and I
>>>>>>>       > > think Till and George have
>>>>>>>       > > already spend sometime investigating it. I have cced both of
>>>>>>>       them, and
>>>>>>>       > > maybe they can share
>>>>>>>       > > their findings.
>>>>>>>       > >
>>>>>>>       > > Best,
>>>>>>>       > > Kurt
>>>>>>>       > >
>>>>>>>       > >
>>>>>>>       > > On Tue, Jun 25, 2019 at 10:08 AM Jark Wu <[hidden email]
>>>>>>>       <mailto:[hidden email]>> wrote:
>>>>>>>       > >
>>>>>>>       > >> Hi Bowen,
>>>>>>>       > >>
>>>>>>>       > >> Thanks for bringing this. We also suffered from the long
>>>>>>>       build time.
>>>>>>>       > >> I agree that we should focus on solving build capacity
>>>>>>>       problem in the
>>>>>>>       > >> thread.
>>>>>>>       > >>
>>>>>>>       > >> My observation is there is only one build is running, all
>> the
>>>>>>>       others
>>>>>>>       > >> (other
>>>>>>>       > >> PRs, master) are pending.
>>>>>>>       > >> The pricing plan[1] of travis shows it can support
>> concurrent
>>>>>>>       build
>>>>>>>       > jobs.
>>>>>>>       > >> But I don't know which plan we are using, might be the free
>>>>>>>       plan for
>>>>>>>       > open
>>>>>>>       > >> source.
>>>>>>>       > >>
>>>>>>>       > >> I cc-ed Chesnay who may have some experience on Travis.
>>>>>>>       > >>
>>>>>>>       > >> Regards,
>>>>>>>       > >> Jark
>>>>>>>       > >>
>>>>>>>       > >> [1]: https://travis-ci.com/plans
>>>>>>>       > >>
>>>>>>>       > >> On Tue, 25 Jun 2019 at 08:11, Bowen Li <
>> [hidden email]
>>>>>>>       <mailto:[hidden email]>> wrote:
>>>>>>>       > >>
>>>>>>>       > >> > Hi Steven,
>>>>>>>       > >> >
>>>>>>>       > >> > I think you may not read what I wrote. The discussion is
>>>> about
>>>>>>>       > "unstable
>>>>>>>       > >> > build **capacity**", in another word "unstable / lack of
>>>> build
>>>>>>>       > >> resources",
>>>>>>>       > >> > not "unstable build".
>>>>>>>       > >> >
>>>>>>>       > >> > On Mon, Jun 24, 2019 at 4:40 PM Steven Wu
>>>>>>>       <[hidden email] <mailto:[hidden email]>>
>>>>>>>       > wrote:
>>>>>>>       > >> >
>>>>>>>       > >> > > long and sometimes unstable build is definitely a pain
>>>>>> point.
>>>>>>>       > >> > >
>>>>>>>       > >> > > I suspect the build failure here in
>> flink-connector-kafka
>>>>>>>       is not
>>>>>>>       > >> related
>>>>>>>       > >> > to
>>>>>>>       > >> > > my change. but there is no easy re-run the build on
>>>>>>>       travis UI.
>>>>>>>       > Google
>>>>>>>       > >> > > search showed a trick of close-and-open the PR will
>>>>>>>       trigger rebuild.
>>>>>>>       > >> but
>>>>>>>       > >> > > that could add noises to the PR activities.
>>>>>>>       > >> > > https://travis-ci.org/apache/flink/jobs/545555519
>>>>>>>       > >> > >
>>>>>>>       > >> > > travis-ci for my personal repo often failed with
>>>>>>>       exceeding time
>>>>>>>       > limit
>>>>>>>       > >> > after
>>>>>>>       > >> > > 4+ hours.
>>>>>>>       > >> > > The job exceeded the maximum time limit for jobs, and
>> has
>>>>>>>       been
>>>>>>>       > >> > terminated.
>>>>>>>       > >> > >
>>>>>>>       > >> > > On Mon, Jun 24, 2019 at 4:15 PM Bowen Li
>>>>>>>       <[hidden email] <mailto:[hidden email]>>
>>>>>>>       > wrote:
>>>>>>>       > >> > >
>>>>>>>       > >> > > > https://travis-ci.org/apache/flink/builds/549681530
>>>>>>>       This build
>>>>>>>       > >> > request
>>>>>>>       > >> > > > has
>>>>>>>       > >> > > > been sitting at **HEAD of the queue** since I first
>> saw
>>>>>>>       it at PST
>>>>>>>       > >> > 10:30am
>>>>>>>       > >> > > > (not sure how long it's been there before 10:30am).
>>>>>>>       It's PST
>>>>>>>       > 4:12pm
>>>>>>>       > >> now
>>>>>>>       > >> > > and
>>>>>>>       > >> > > > it hasn't started yet.
>>>>>>>       > >> > > >
>>>>>>>       > >> > > > On Mon, Jun 24, 2019 at 2:48 PM Bowen Li
>>>>>>>       <[hidden email] <mailto:[hidden email]>>
>>>>>>>       > >> wrote:
>>>>>>>       > >> > > >
>>>>>>>       > >> > > > > Hi devs,
>>>>>>>       > >> > > > >
>>>>>>>       > >> > > > > I've been experiencing the pain resulting from lack
>>>>>>>       of stable
>>>>>>>       > >> build
>>>>>>>       > >> > > > > capacity on Travis for Flink PRs [1].
>> Specifically, I
>>>>>>>       noticed
>>>>>>>       > >> often
>>>>>>>       > >> > > that
>>>>>>>       > >> > > > no
>>>>>>>       > >> > > > > build in the queue is making any progress for
>> hours,
>>>> and
>>>>>>>       > suddenly
>>>>>>>       > >> 5
>>>>>>>       > >> > or
>>>>>>>       > >> > > 6
>>>>>>>       > >> > > > > builds kick off all together after the long pause.
>>>>>>>       I'm at PST
>>>>>>>       > >> > (UTC-08)
>>>>>>>       > >> > > > time
>>>>>>>       > >> > > > > zone, and I've seen pause can be as long as 6 hours
>>>>>>>       from PST 9am
>>>>>>>       > >> to
>>>>>>>       > >> > 3pm
>>>>>>>       > >> > > > > (let alone the time needed to drain the queue
>>>>>>>       afterwards).
>>>>>>>       > >> > > > >
>>>>>>>       > >> > > > > I think this has greatly impacted our productivity.
>>>> I've
>>>>>>>       > >> experienced
>>>>>>>       > >> > > that
>>>>>>>       > >> > > > > PRs submitted in the early morning of PST time zone
>>>>>>>       won't finish
>>>>>>>       > >> > their
>>>>>>>       > >> > > > > build until late night of the same day.
>>>>>>>       > >> > > > >
>>>>>>>       > >> > > > > So my questions are:
>>>>>>>       > >> > > > >
>>>>>>>       > >> > > > > - Has anyone else experienced the same problem or
>>>>>>>       have similar
>>>>>>>       > >> > > > observation
>>>>>>>       > >> > > > > on TravisCI? (I suspect it has things to do with
>> time
>>>>>>>       zone)
>>>>>>>       > >> > > > >
>>>>>>>       > >> > > > > - What pricing plan of TravisCI is Flink currently
>>>>>>>       using? Is it
>>>>>>>       > >> the
>>>>>>>       > >> > > free
>>>>>>>       > >> > > > > plan for open source projects? What are the
>>>>>>>       guaranteed build
>>>>>>>       > >> capacity
>>>>>>>       > >> > > of
>>>>>>>       > >> > > > > the current plan?
>>>>>>>       > >> > > > >
>>>>>>>       > >> > > > > - If the current pricing plan (either free or paid)
>>>>>> can't
>>>>>>>       > provide
>>>>>>>       > >> > > stable
>>>>>>>       > >> > > > > build capacity, can we upgrade to a higher priced
>>>>>>>       plan with
>>>>>>>       > larger
>>>>>>>       > >> > and
>>>>>>>       > >> > > > more
>>>>>>>       > >> > > > > stable build capacity?
>>>>>>>       > >> > > > >
>>>>>>>       > >> > > > > BTW, another factor that contribute to the
>>>>>>>       productivity problem
>>>>>>>       > is
>>>>>>>       > >> > that
>>>>>>>       > >> > > > > our build is slow - we run full build for every PR
>>>> and a
>>>>>>>       > >> successful
>>>>>>>       > >> > > full
>>>>>>>       > >> > > > > build takes ~5h. We definitely have more options to
>>>>>>>       solve it,
>>>>>>>       > for
>>>>>>>       > >> > > > instance,
>>>>>>>       > >> > > > > modularize the build graphs and reuse artifacts
>> from
>>>> the
>>>>>>>       > previous
>>>>>>>       > >> > > build.
>>>>>>>       > >> > > > > But I think that can be a big effort which is much
>>>>>>>       harder to
>>>>>>>       > >> > accomplish
>>>>>>>       > >> > > > in
>>>>>>>       > >> > > > > a short period of time and may deserve its own
>>>> separate
>>>>>>>       > >> discussion.
>>>>>>>       > >> > > > >
>>>>>>>       > >> > > > > [1]
>> https://travis-ci.org/apache/flink/pull_requests
>>>>>>>       > >> > > > >
>>>>>>>       > >> > > > >
>>>>>>>       > >> > > >
>>>>>>>       > >> > >
>>>>>>>       > >> >
>>>>>>>       > >>
>>>>>>>       > >
>>>>>>>       >
>>>>>>>
>>>>>>>
>>>>>>>       --
>>>>>>>       Best Regards
>>>>>>>
>>>>>>>       Jeff Zhang
>>>>>>>
>>

123