By looking at the git history of the Jenkins script, its core part was
finished in March 2017 (and only two minor update in 2017/2018), so it's been running for over two years now and feels like Zepplin community has been quite happy with it. @Jeff Zhang <[hidden email]> can you share your insights and user experience with the Jenkins+Travis approach? Things like: - has the approach completely solved the resource capacity problem for Zepplin community? is Zepplin community happy with the result? - is the whole configuration chain stable (e.g. uptime) enough? - how often do you need to maintain the Jenkins infra? how many people are usually involved in maintenance and bug-fixes? The downside of this approach seems mostly to be on the maintenance to me - maintain the script and Jenkins infra. ** Having Our Own Travis-CI.com Account ** Another alternative I've been thinking of is to have our own travis-ci.com account with paid dedicated resources. Note travis-ci.org is the free version and travis-ci.com is the commercial version. We currently use a shared resource pool managed by ASK INFRA team on travis-ci.org, but we have no control over it - we can't see how it's configured, how much resources are available, how resources are allocated among Apache projects, etc. The nice thing about having an account on travis-ci.com are: - relatively low cost with much better resource guarantee than what we currently have [1]: $249/month with 5 dedicated concurrency, $489/month with 10 concurrency - low maintenance work compared to using Jenkins - (potentially) no migration cost according to Travis's doc [2] (pending verification) - full control over the build capacity/configuration compared to using ASF INFRA's pool I'd be surprised if we as such a vibrant community cannot find and fund $249*12=$2988 a year in exchange for a much better developer experience and much higher productivity. [1] https://travis-ci.com/plans [2] https://docs.travis-ci.com/user/migrate/open-source-repository-migration On Sat, Jun 29, 2019 at 8:39 AM Chesnay Schepler <[hidden email]> wrote: > So yes, the Jenkins job keeps pulling the state from Travis until it > finishes. > > Note sure I'm comfortable with the idea of using Jenkins workers just to > idle for a several hours. > > On 29/06/2019 14:56, Jeff Zhang wrote: > > Here's what zeppelin community did, we make a python script to check the > > build status of pull request. > > Here's script: > > https://github.com/apache/zeppelin/blob/master/travis_check.py > > > > And this is the script we used in Jenkins build job. > > > > if [ -f "travis_check.py" ]; then > > git log -n 1 > > STATUS=$(curl -s $BUILD_URL | grep -e "GitHub pull request.*from.*" | > sed > > 's/.*GitHub pull request <a > > href=\"\(https[^"]*\).*from[^"]*.\(https[^"]*\).*/\1 \2/g') > > AUTHOR=$(echo $STATUS | sed 's/.*[/]\(.*\)$/\1/g') > > PR=$(echo $STATUS | awk '{print $1}' | sed 's/.*[/]\(.*\)$/\1/g') > > #COMMIT=$(git log -n 1 | grep "^Merge:" | awk '{print $3}') > > #if [ -z $COMMIT ]; then > > # COMMIT=$(curl -s > https://api.github.com/repos/apache/zeppelin/pulls/$PR > > | grep -e "\"label\":" -e "\"ref\":" -e "\"sha\":" | tr '\n' ' ' | sed > > 's/\(.*sha[^,]*,\)\(.*ref.*\)/\1 = \2/g' | tr = '\n' | grep -v "apache:" > | > > sed 's/.*sha.[^"]*["]\([^"]*\).*/\1/g') > > #fi > > > > # get commit hash from PR > > COMMIT=$(curl -s > https://api.github.com/repos/apache/zeppelin/pulls/$PR | > > grep -e "\"label\":" -e "\"ref\":" -e "\"sha\":" | tr '\n' ' ' | sed > > 's/\(.*sha[^,]*,\)\(.*ref.*\)/\1 = \2/g' | tr = '\n' | grep -v "apache:" > | > > sed 's/.*sha.[^"]*["]\([^"]*\).*/\1/g') > > sleep 30 # sleep few moment to wait travis starts the build > > RET_CODE=0 > > python ./travis_check.py ${AUTHOR} ${COMMIT} || RET_CODE=$? > > if [ $RET_CODE -eq 2 ]; then # try with repository name when > travis-ci is > > not available in the account > > RET_CODE=0 > > AUTHOR=$(curl -s > https://api.github.com/repos/apache/zeppelin/pulls/$PR > > | grep '"full_name":' | grep -v "apache/zeppelin" | sed > > 's/.*[:][^"]*["]\([^/]*\).*/\1/g') > > python ./travis_check.py ${AUTHOR} ${COMMIT} || RET_CODE=$? > > fi > > > > if [ $RET_CODE -eq 2 ]; then # fail with can't find build information > in > > the travis > > set +x > > echo "-----------------------------------------------------" > > echo "Looks like travis-ci is not configured for your fork." > > echo "Please setup by swich on 'zeppelin' repository at > > https://travis-ci.org/profile and travis-ci." > > echo "And then make sure 'Build branch updates' option is enabled in > > the settings https://travis-ci.org/${AUTHOR}/zeppelin/settings." > > echo "" > > echo "To trigger CI after setup, you will need ammend your last > commit > > with" > > echo "git commit --amend" > > echo "git push your-remote HEAD --force" > > echo "" > > echo "See > > > http://zeppelin.apache.org/contribution/contributions.html#continuous-integration > > ." > > fi > > > > exit $RET_CODE > > else > > set +x > > echo "travis_check.py does not exists" > > exit 1 > > fi > > > > Chesnay Schepler <[hidden email]> 于2019年6月29日周六 下午3:17写道: > > > >> Does this imply that a Jenkins job is active as long as the Travis build > >> runs? > >> > >> On 26/06/2019 21:28, Bowen Li wrote: > >>> Hi, > >>> > >>> @Dawid, I think the "long test running" as I mentioned in the first > >> email, > >>> also as you guys said, belongs to "a big effort which is much harder to > >>> accomplish in a short period of time and may deserve its own separate > >>> discussion". Thus I didn't include it in what we can do in a > foreseeable > >>> short term. > >>> > >>> Besides, I don't think that's the ultimate reason for lack of build > >>> resources. Even if the build is shortened to something like 2h, the > >>> problems of no build machine works about 6 or more hours in PST daytime > >>> that I described will still happen, because no machine from ASF INFRA's > >>> pool is allocated to Flink. As I have paid close attention to the build > >>> queue in the past few weekdays, it's a pretty clear pattern now. > >>> > >>> **The ultimate root cause** for that is - we don't have any > **dedicated** > >>> build resources that we can stably rely on. I'm actually ok to wait > for a > >>> long time if there are build requests running, it means at least we are > >>> making progress. But I'm not ok with no build resource. A better place > I > >>> think we should aim at in short term is to always have at least a > central > >>> pool (can be 3 or 5) of machines dedicated to build Flink at any time, > or > >>> maybe use users resources. > >>> > >>> @Chesnay @Robert I synced with Jeff offline that Zeppelin community is > >>> using a Jenkins job to automatically build on users' travis account and > >>> link the result back to github PR. I guess the Jenkins job would fetch > >>> latest upstream master and build the PR against it. Jeff has filed > >> tickets > >>> to learn and get access to the Jenkins infra. It'll better to fully > >>> understand it first before judging this approach. > >>> > >>> I also heard good things about CircleCI, and ASF INFRA seems to have a > >> pool > >>> of build capacity there too. Can be an alternative to consider. > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> On Wed, Jun 26, 2019 at 12:44 AM Dawid Wysakowicz < > >> [hidden email]> > >>> wrote: > >>> > >>>> Sorry to jump in late, but I think Bowen missed the most important > point > >>>> from Chesnay's previous message in the summary. The ultimate reason > for > >>>> all the problems is that the tests take close to 2 hours to run > already. > >>>> I fully support this claim: "Unless people start caring about test > times > >>>> before adding them, this issue cannot be solved" > >>>> > >>>> This is also another reason why using user's Travis account won't > help. > >>>> Every few weeks we reach the user's time limit for a single profile. > >>>> This makes the user's builds simply fail, until we either properly > >>>> decrease the time the tests take (which I am not sure we ever did) or > >>>> postpone the problem by splitting into more profiles. (Note that the > ASF > >>>> Travis account has higher time limits) > >>>> > >>>> Best, > >>>> > >>>> Dawid > >>>> > >>>> On 26/06/2019 09:36, Robert Metzger wrote: > >>>>> Do we know if using "the best" available hardware would improve the > >> build > >>>>> times? > >>>>> Imagine we would run the build on machines with plenty of main memory > >> to > >>>>> mount everything to ramdisk + the latest CPU architecture? > >>>>> > >>>>> Throwing hardware at the problem could help reduce the time of an > >>>>> individual build, and using our own infrastructure would remove our > >>>>> dependency on Apache's Travis account (with the obvious downside of > >>>> having > >>>>> to maintain the infrastructure) > >>>>> We could use an open source travis alternative, to have a similar > >>>>> experience and make the migration easy. > >>>>> > >>>>> > >>>>> On Wed, Jun 26, 2019 at 9:34 AM Chesnay Schepler <[hidden email] > > > >>>> wrote: > >>>>>> From what I gathered, there's no special sauce that the Zeppelin > >>>>>> project uses which actually integrates a users Travis account into > the > >>>> PR. > >>>>>> They just disabled Travis for PRs. And that's kind of it. > >>>>>> > >>>>>> Naturally we can do this (duh) and safe the ASF a fair amount of > >>>>>> resources, but there are downsides: > >>>>>> > >>>>>> The discoverability of the Travis check takes a nose-dive. Either we > >>>>>> require every contributor to always, an every commit, also post a > >> Travis > >>>>>> build, or we have the reviewer sift through the contributors account > >> to > >>>>>> find it. > >>>>>> > >>>>>> This is rather cumbersome. Additionally, it's also not equivalent to > >>>>>> having a PR build. > >>>>>> > >>>>>> A normal branch build takes a branch as is and tests it. A PR build > >>>>>> merges the branch into master, and then runs it. (Fun fact: This is > >> why > >>>>>> a PR without merge conflicts is not being run on Travis.) > >>>>>> > >>>>>> And ultimately, everyone can already make use of this approach > anyway. > >>>>>> > >>>>>> On 25/06/2019 08:02, Jark Wu wrote: > >>>>>>> Hi Jeff, > >>>>>>> > >>>>>>> Thanks for sharing the Zeppelin approach. I think it's a good idea > to > >>>>>>> leverage user's travis account. > >>>>>>> In this way, we can have almost unlimited concurrent build jobs and > >>>>>>> developers can restart build by themselves (currently only > committers > >>>>>>> can restart PR's build). > >>>>>>> > >>>>>>> But I'm still not very clear how to integrate user's travis build > >> into > >>>>>>> the Flink pull request's build automatically. Can you explain more > in > >>>>>>> detail? > >>>>>>> > >>>>>>> Another question: does travis only build branches for user account? > >>>>>>> My concern is that builds for PRs will rebase user's commits > against > >>>>>>> current master branch. > >>>>>>> This will help us to find problems before merge. Builds for > branches > >>>>>>> will lose the impact of new commits in master. > >>>>>>> How does Zeppelin solve this problem? > >>>>>>> > >>>>>>> Thanks again for sharing the idea. > >>>>>>> > >>>>>>> Regards, > >>>>>>> Jark > >>>>>>> > >>>>>>> On Tue, 25 Jun 2019 at 11:01, Jeff Zhang <[hidden email] > >>>>>>> <mailto:[hidden email]>> wrote: > >>>>>>> > >>>>>>> Hi Folks, > >>>>>>> > >>>>>>> Zeppelin meet this kind of issue before, we solve it by > >> delegating > >>>>>>> each > >>>>>>> one's PR build to his travis account (Everyone can have 5 > free > >>>>>>> slot for > >>>>>>> travis build). > >>>>>>> Apache account travis build is only triggered when PR is > merged. > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> Kurt Young <[hidden email] <mailto:[hidden email]>> > >>>>>>> 于2019年6月25日周二 上午10:16写道: > >>>>>>> > >>>>>>> > (Forgot to cc George) > >>>>>>> > > >>>>>>> > Best, > >>>>>>> > Kurt > >>>>>>> > > >>>>>>> > > >>>>>>> > On Tue, Jun 25, 2019 at 10:16 AM Kurt Young < > [hidden email] > >>>>>>> <mailto:[hidden email]>> wrote: > >>>>>>> > > >>>>>>> > > Hi Bowen, > >>>>>>> > > > >>>>>>> > > Thanks for bringing this up. We actually have discussed > >> about > >>>>>>> this, and I > >>>>>>> > > think Till and George have > >>>>>>> > > already spend sometime investigating it. I have cced > both of > >>>>>>> them, and > >>>>>>> > > maybe they can share > >>>>>>> > > their findings. > >>>>>>> > > > >>>>>>> > > Best, > >>>>>>> > > Kurt > >>>>>>> > > > >>>>>>> > > > >>>>>>> > > On Tue, Jun 25, 2019 at 10:08 AM Jark Wu < > [hidden email] > >>>>>>> <mailto:[hidden email]>> wrote: > >>>>>>> > > > >>>>>>> > >> Hi Bowen, > >>>>>>> > >> > >>>>>>> > >> Thanks for bringing this. We also suffered from the long > >>>>>>> build time. > >>>>>>> > >> I agree that we should focus on solving build capacity > >>>>>>> problem in the > >>>>>>> > >> thread. > >>>>>>> > >> > >>>>>>> > >> My observation is there is only one build is running, > all > >> the > >>>>>>> others > >>>>>>> > >> (other > >>>>>>> > >> PRs, master) are pending. > >>>>>>> > >> The pricing plan[1] of travis shows it can support > >> concurrent > >>>>>>> build > >>>>>>> > jobs. > >>>>>>> > >> But I don't know which plan we are using, might be the > free > >>>>>>> plan for > >>>>>>> > open > >>>>>>> > >> source. > >>>>>>> > >> > >>>>>>> > >> I cc-ed Chesnay who may have some experience on Travis. > >>>>>>> > >> > >>>>>>> > >> Regards, > >>>>>>> > >> Jark > >>>>>>> > >> > >>>>>>> > >> [1]: https://travis-ci.com/plans > >>>>>>> > >> > >>>>>>> > >> On Tue, 25 Jun 2019 at 08:11, Bowen Li < > >> [hidden email] > >>>>>>> <mailto:[hidden email]>> wrote: > >>>>>>> > >> > >>>>>>> > >> > Hi Steven, > >>>>>>> > >> > > >>>>>>> > >> > I think you may not read what I wrote. The discussion > is > >>>> about > >>>>>>> > "unstable > >>>>>>> > >> > build **capacity**", in another word "unstable / lack > of > >>>> build > >>>>>>> > >> resources", > >>>>>>> > >> > not "unstable build". > >>>>>>> > >> > > >>>>>>> > >> > On Mon, Jun 24, 2019 at 4:40 PM Steven Wu > >>>>>>> <[hidden email] <mailto:[hidden email]>> > >>>>>>> > wrote: > >>>>>>> > >> > > >>>>>>> > >> > > long and sometimes unstable build is definitely a > pain > >>>>>> point. > >>>>>>> > >> > > > >>>>>>> > >> > > I suspect the build failure here in > >> flink-connector-kafka > >>>>>>> is not > >>>>>>> > >> related > >>>>>>> > >> > to > >>>>>>> > >> > > my change. but there is no easy re-run the build on > >>>>>>> travis UI. > >>>>>>> > >> > > search showed a trick of close-and-open the PR will > >>>>>>> trigger rebuild. > >>>>>>> > >> but > >>>>>>> > >> > > that could add noises to the PR activities. > >>>>>>> > >> > > https://travis-ci.org/apache/flink/jobs/545555519 > >>>>>>> > >> > > > >>>>>>> > >> > > travis-ci for my personal repo often failed with > >>>>>>> exceeding time > >>>>>>> > limit > >>>>>>> > >> > after > >>>>>>> > >> > > 4+ hours. > >>>>>>> > >> > > The job exceeded the maximum time limit for jobs, > and > >> has > >>>>>>> been > >>>>>>> > >> > terminated. > >>>>>>> > >> > > > >>>>>>> > >> > > On Mon, Jun 24, 2019 at 4:15 PM Bowen Li > >>>>>>> <[hidden email] <mailto:[hidden email]>> > >>>>>>> > wrote: > >>>>>>> > >> > > > >>>>>>> > >> > > > > https://travis-ci.org/apache/flink/builds/549681530 > >>>>>>> This build > >>>>>>> > >> > request > >>>>>>> > >> > > > has > >>>>>>> > >> > > > been sitting at **HEAD of the queue** since I > first > >> saw > >>>>>>> it at PST > >>>>>>> > >> > 10:30am > >>>>>>> > >> > > > (not sure how long it's been there before > 10:30am). > >>>>>>> It's PST > >>>>>>> > 4:12pm > >>>>>>> > >> now > >>>>>>> > >> > > and > >>>>>>> > >> > > > it hasn't started yet. > >>>>>>> > >> > > > > >>>>>>> > >> > > > On Mon, Jun 24, 2019 at 2:48 PM Bowen Li > >>>>>>> <[hidden email] <mailto:[hidden email]>> > >>>>>>> > >> wrote: > >>>>>>> > >> > > > > >>>>>>> > >> > > > > Hi devs, > >>>>>>> > >> > > > > > >>>>>>> > >> > > > > I've been experiencing the pain resulting from > lack > >>>>>>> of stable > >>>>>>> > >> build > >>>>>>> > >> > > > > capacity on Travis for Flink PRs [1]. > >> Specifically, I > >>>>>>> noticed > >>>>>>> > >> often > >>>>>>> > >> > > that > >>>>>>> > >> > > > no > >>>>>>> > >> > > > > build in the queue is making any progress for > >> hours, > >>>> and > >>>>>>> > suddenly > >>>>>>> > >> 5 > >>>>>>> > >> > or > >>>>>>> > >> > > 6 > >>>>>>> > >> > > > > builds kick off all together after the long > pause. > >>>>>>> I'm at PST > >>>>>>> > >> > (UTC-08) > >>>>>>> > >> > > > time > >>>>>>> > >> > > > > zone, and I've seen pause can be as long as 6 > hours > >>>>>>> from PST 9am > >>>>>>> > >> to > >>>>>>> > >> > 3pm > >>>>>>> > >> > > > > (let alone the time needed to drain the queue > >>>>>>> afterwards). > >>>>>>> > >> > > > > > >>>>>>> > >> > > > > I think this has greatly impacted our > productivity. > >>>> I've > >>>>>>> > >> experienced > >>>>>>> > >> > > that > >>>>>>> > >> > > > > PRs submitted in the early morning of PST time > zone > >>>>>>> won't finish > >>>>>>> > >> > their > >>>>>>> > >> > > > > build until late night of the same day. > >>>>>>> > >> > > > > > >>>>>>> > >> > > > > So my questions are: > >>>>>>> > >> > > > > > >>>>>>> > >> > > > > - Has anyone else experienced the same problem > or > >>>>>>> have similar > >>>>>>> > >> > > > observation > >>>>>>> > >> > > > > on TravisCI? (I suspect it has things to do with > >> time > >>>>>>> zone) > >>>>>>> > >> > > > > > >>>>>>> > >> > > > > - What pricing plan of TravisCI is Flink > currently > >>>>>>> using? Is it > >>>>>>> > >> the > >>>>>>> > >> > > free > >>>>>>> > >> > > > > plan for open source projects? What are the > >>>>>>> guaranteed build > >>>>>>> > >> capacity > >>>>>>> > >> > > of > >>>>>>> > >> > > > > the current plan? > >>>>>>> > >> > > > > > >>>>>>> > >> > > > > - If the current pricing plan (either free or > paid) > >>>>>> can't > >>>>>>> > provide > >>>>>>> > >> > > stable > >>>>>>> > >> > > > > build capacity, can we upgrade to a higher > priced > >>>>>>> plan with > >>>>>>> > larger > >>>>>>> > >> > and > >>>>>>> > >> > > > more > >>>>>>> > >> > > > > stable build capacity? > >>>>>>> > >> > > > > > >>>>>>> > >> > > > > BTW, another factor that contribute to the > >>>>>>> productivity problem > >>>>>>> > is > >>>>>>> > >> > that > >>>>>>> > >> > > > > our build is slow - we run full build for every > PR > >>>> and a > >>>>>>> > >> successful > >>>>>>> > >> > > full > >>>>>>> > >> > > > > build takes ~5h. We definitely have more > options to > >>>>>>> solve it, > >>>>>>> > for > >>>>>>> > >> > > > instance, > >>>>>>> > >> > > > > modularize the build graphs and reuse artifacts > >> from > >>>> the > >>>>>>> > previous > >>>>>>> > >> > > build. > >>>>>>> > >> > > > > But I think that can be a big effort which is > much > >>>>>>> harder to > >>>>>>> > >> > accomplish > >>>>>>> > >> > > > in > >>>>>>> > >> > > > > a short period of time and may deserve its own > >>>> separate > >>>>>>> > >> discussion. > >>>>>>> > >> > > > > > >>>>>>> > >> > > > > [1] > >> https://travis-ci.org/apache/flink/pull_requests > >>>>>>> > >> > > > > > >>>>>>> > >> > > > > > >>>>>>> > >> > > > > >>>>>>> > >> > > > >>>>>>> > >> > > >>>>>>> > >> > >>>>>>> > > > >>>>>>> > > >>>>>>> > >>>>>>> > >>>>>>> -- > >>>>>>> Best Regards > >>>>>>> > >>>>>>> Jeff Zhang > >>>>>>> > >> > > |
People really have to stop thinking that just because something works
for us it is also a good solution. Also, please remember that our builds run for 2h from start to finish, and not the 14 _minutes_ it takes for zeppelin. We are dealing with an entirely different scale here, both in terms of build times and number of builds. In this very thread people have been complaining about long queue times for their builds. Surprise, other Apache projects have been suffering the very same thing due to us not controlling our build times. While switching services (be it Jenkins, CircleCI or whatever) will possibly work for us (and these options are actually attractive, like CircleCI's proper support for build artifacts), it will also result in us likely negatively affecting other projects in significant ways. Sure, the Jenkins setup has a good user experience for us, at the cost of blocking Jenkins workers for a _lot_ of time. Right now we have 25 PR's in our queue; that's possibly 50h we'd consume of Jenkins resources, and the European contributors haven't even really started yet. FYI, the latest INFRA response from INFRA-18533: "Our rough metrics shows that Flink used over 5800 hours of build time last month. That is equal to EIGHT servers running 24/7 for the ENTIRE MONTH. EIGHT. nonstop. When we discovered this last night, we discussed it some and are going to tune down Flink to allow only five executors maximum. We cannot allow Flink to consume so much of a Foundation shared resource." So yes, we either a) have to heavily reduce our CI usage or b) fund our own, either maintaining it ourselves or donating to Apache. On 02/07/2019 05:11, Bowen Li wrote: > By looking at the git history of the Jenkins script, its core part was > finished in March 2017 (and only two minor update in 2017/2018), so > it's been running for over two years now and feels like Zepplin > community has been quite happy with it. @Jeff Zhang > <mailto:[hidden email]> can you share your insights and user > experience with the Jenkins+Travis approach? > > Things like: > > - has the approach completely solved the resource capacity problem for > Zepplin community? is Zepplin community happy with the result? > - is the whole configuration chain stable (e.g. uptime) enough? > - how often do you need to maintain the Jenkins infra? how many people > are usually involved in maintenance and bug-fixes? > > The downside of this approach seems mostly to be on the maintenance to > me - maintain the script and Jenkins infra. > > ** Having Our Own Travis-CI.com Account ** > > Another alternative I've been thinking of is to have our own > travis-ci.com <http://travis-ci.com> account with paid dedicated > resources. Note travis-ci.org <http://travis-ci.org> is the free > version and travis-ci.com <http://travis-ci.com> is the commercial > version. We currently use a shared resource pool managed by ASK INFRA > team on travis-ci.org <http://travis-ci.org>, but we have no control > over it - we can't see how it's configured, how much resources are > available, how resources are allocated among Apache projects, etc. The > nice thing about having an account on travis-ci.com > <http://travis-ci.com> are: > > - relatively low cost with much better resource guarantee than what we > currently have [1]: $249/month with 5 dedicated concurrency, > $489/month with 10 concurrency > - low maintenance work compared to using Jenkins > - (potentially) no migration cost according to Travis's doc [2] > (pending verification) > - full control over the build capacity/configuration compared to using > ASF INFRA's pool > > I'd be surprised if we as such a vibrant community cannot find and > fund $249*12=$2988 a year in exchange for a much better developer > experience and much higher productivity. > > [1] https://travis-ci.com/plans > [2] > https://docs.travis-ci.com/user/migrate/open-source-repository-migration > > On Sat, Jun 29, 2019 at 8:39 AM Chesnay Schepler <[hidden email] > <mailto:[hidden email]>> wrote: > > So yes, the Jenkins job keeps pulling the state from Travis until it > finishes. > > Note sure I'm comfortable with the idea of using Jenkins workers > just to > idle for a several hours. > > On 29/06/2019 14:56, Jeff Zhang wrote: > > Here's what zeppelin community did, we make a python script to > check the > > build status of pull request. > > Here's script: > > https://github.com/apache/zeppelin/blob/master/travis_check.py > > > > And this is the script we used in Jenkins build job. > > > > if [ -f "travis_check.py" ]; then > > git log -n 1 > > STATUS=$(curl -s $BUILD_URL | grep -e "GitHub pull > request.*from.*" | sed > > 's/.*GitHub pull request <a > > href=\"\(https[^"]*\).*from[^"]*.\(https[^"]*\).*/\1 \2/g') > > AUTHOR=$(echo $STATUS | sed 's/.*[/]\(.*\)$/\1/g') > > PR=$(echo $STATUS | awk '{print $1}' | sed 's/.*[/]\(.*\)$/\1/g') > > #COMMIT=$(git log -n 1 | grep "^Merge:" | awk '{print $3}') > > #if [ -z $COMMIT ]; then > > # COMMIT=$(curl -s > https://api.github.com/repos/apache/zeppelin/pulls/$PR > > | grep -e "\"label\":" -e "\"ref\":" -e "\"sha\":" | tr '\n' ' ' > | sed > > 's/\(.*sha[^,]*,\)\(.*ref.*\)/\1 = \2/g' | tr = '\n' | grep -v > "apache:" | > > sed 's/.*sha.[^"]*["]\([^"]*\).*/\1/g') > > #fi > > > > # get commit hash from PR > > COMMIT=$(curl -s > https://api.github.com/repos/apache/zeppelin/pulls/$PR | > > grep -e "\"label\":" -e "\"ref\":" -e "\"sha\":" | tr '\n' ' ' | sed > > 's/\(.*sha[^,]*,\)\(.*ref.*\)/\1 = \2/g' | tr = '\n' | grep -v > "apache:" | > > sed 's/.*sha.[^"]*["]\([^"]*\).*/\1/g') > > sleep 30 # sleep few moment to wait travis starts the build > > RET_CODE=0 > > python ./travis_check.py ${AUTHOR} ${COMMIT} || RET_CODE=$? > > if [ $RET_CODE -eq 2 ]; then # try with repository name when > travis-ci is > > not available in the account > > RET_CODE=0 > > AUTHOR=$(curl -s > https://api.github.com/repos/apache/zeppelin/pulls/$PR > > | grep '"full_name":' | grep -v "apache/zeppelin" | sed > > 's/.*[:][^"]*["]\([^/]*\).*/\1/g') > > python ./travis_check.py ${AUTHOR} ${COMMIT} || RET_CODE=$? > > fi > > > > if [ $RET_CODE -eq 2 ]; then # fail with can't find build > information in > > the travis > > set +x > > echo "-----------------------------------------------------" > > echo "Looks like travis-ci is not configured for your fork." > > echo "Please setup by swich on 'zeppelin' repository at > > https://travis-ci.org/profile and travis-ci." > > echo "And then make sure 'Build branch updates' option is > enabled in > > the settings https://travis-ci.org/${AUTHOR}/zeppelin/settings > <https://travis-ci.org/$%7BAUTHOR%7D/zeppelin/settings>." > > echo "" > > echo "To trigger CI after setup, you will need ammend your > last commit > > with" > > echo "git commit --amend" > > echo "git push your-remote HEAD --force" > > echo "" > > echo "See > > > http://zeppelin.apache.org/contribution/contributions.html#continuous-integration > > ." > > fi > > > > exit $RET_CODE > > else > > set +x > > echo "travis_check.py does not exists" > > exit 1 > > fi > > > > Chesnay Schepler <[hidden email] > <mailto:[hidden email]>> 于2019年6月29日周六 下午3:17写道: > > > >> Does this imply that a Jenkins job is active as long as the > Travis build > >> runs? > >> > >> On 26/06/2019 21:28, Bowen Li wrote: > >>> Hi, > >>> > >>> @Dawid, I think the "long test running" as I mentioned in the > first > >> email, > >>> also as you guys said, belongs to "a big effort which is much > harder to > >>> accomplish in a short period of time and may deserve its own > separate > >>> discussion". Thus I didn't include it in what we can do in a > foreseeable > >>> short term. > >>> > >>> Besides, I don't think that's the ultimate reason for lack of > build > >>> resources. Even if the build is shortened to something like > 2h, the > >>> problems of no build machine works about 6 or more hours in > PST daytime > >>> that I described will still happen, because no machine from > ASF INFRA's > >>> pool is allocated to Flink. As I have paid close attention to > the build > >>> queue in the past few weekdays, it's a pretty clear pattern now. > >>> > >>> **The ultimate root cause** for that is - we don't have any > **dedicated** > >>> build resources that we can stably rely on. I'm actually ok to > wait for a > >>> long time if there are build requests running, it means at > least we are > >>> making progress. But I'm not ok with no build resource. A > better place I > >>> think we should aim at in short term is to always have at > least a central > >>> pool (can be 3 or 5) of machines dedicated to build Flink at > any time, or > >>> maybe use users resources. > >>> > >>> @Chesnay @Robert I synced with Jeff offline that Zeppelin > community is > >>> using a Jenkins job to automatically build on users' travis > account and > >>> link the result back to github PR. I guess the Jenkins job > would fetch > >>> latest upstream master and build the PR against it. Jeff has filed > >> tickets > >>> to learn and get access to the Jenkins infra. It'll better to > fully > >>> understand it first before judging this approach. > >>> > >>> I also heard good things about CircleCI, and ASF INFRA seems > to have a > >> pool > >>> of build capacity there too. Can be an alternative to consider. > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> On Wed, Jun 26, 2019 at 12:44 AM Dawid Wysakowicz < > >> [hidden email] <mailto:[hidden email]>> > >>> wrote: > >>> > >>>> Sorry to jump in late, but I think Bowen missed the most > important point > >>>> from Chesnay's previous message in the summary. The ultimate > reason for > >>>> all the problems is that the tests take close to 2 hours to > run already. > >>>> I fully support this claim: "Unless people start caring about > test times > >>>> before adding them, this issue cannot be solved" > >>>> > >>>> This is also another reason why using user's Travis account > won't help. > >>>> Every few weeks we reach the user's time limit for a single > profile. > >>>> This makes the user's builds simply fail, until we either > properly > >>>> decrease the time the tests take (which I am not sure we ever > did) or > >>>> postpone the problem by splitting into more profiles. (Note > that the ASF > >>>> Travis account has higher time limits) > >>>> > >>>> Best, > >>>> > >>>> Dawid > >>>> > >>>> On 26/06/2019 09:36, Robert Metzger wrote: > >>>>> Do we know if using "the best" available hardware would > improve the > >> build > >>>>> times? > >>>>> Imagine we would run the build on machines with plenty of > main memory > >> to > >>>>> mount everything to ramdisk + the latest CPU architecture? > >>>>> > >>>>> Throwing hardware at the problem could help reduce the time > of an > >>>>> individual build, and using our own infrastructure would > remove our > >>>>> dependency on Apache's Travis account (with the obvious > downside of > >>>> having > >>>>> to maintain the infrastructure) > >>>>> We could use an open source travis alternative, to have a > similar > >>>>> experience and make the migration easy. > >>>>> > >>>>> > >>>>> On Wed, Jun 26, 2019 at 9:34 AM Chesnay Schepler > <[hidden email] <mailto:[hidden email]>> > >>>> wrote: > >>>>>> From what I gathered, there's no special sauce that the > Zeppelin > >>>>>> project uses which actually integrates a users Travis > account into the > >>>> PR. > >>>>>> They just disabled Travis for PRs. And that's kind of it. > >>>>>> > >>>>>> Naturally we can do this (duh) and safe the ASF a fair > amount of > >>>>>> resources, but there are downsides: > >>>>>> > >>>>>> The discoverability of the Travis check takes a nose-dive. > Either we > >>>>>> require every contributor to always, an every commit, also > post a > >> Travis > >>>>>> build, or we have the reviewer sift through the > contributors account > >> to > >>>>>> find it. > >>>>>> > >>>>>> This is rather cumbersome. Additionally, it's also not > equivalent to > >>>>>> having a PR build. > >>>>>> > >>>>>> A normal branch build takes a branch as is and tests it. A > PR build > >>>>>> merges the branch into master, and then runs it. (Fun fact: > This is > >> why > >>>>>> a PR without merge conflicts is not being run on Travis.) > >>>>>> > >>>>>> And ultimately, everyone can already make use of this > approach anyway. > >>>>>> > >>>>>> On 25/06/2019 08:02, Jark Wu wrote: > >>>>>>> Hi Jeff, > >>>>>>> > >>>>>>> Thanks for sharing the Zeppelin approach. I think it's a > good idea to > >>>>>>> leverage user's travis account. > >>>>>>> In this way, we can have almost unlimited concurrent build > jobs and > >>>>>>> developers can restart build by themselves (currently only > committers > >>>>>>> can restart PR's build). > >>>>>>> > >>>>>>> But I'm still not very clear how to integrate user's > travis build > >> into > >>>>>>> the Flink pull request's build automatically. Can you > explain more in > >>>>>>> detail? > >>>>>>> > >>>>>>> Another question: does travis only build branches for user > account? > >>>>>>> My concern is that builds for PRs will rebase user's > commits against > >>>>>>> current master branch. > >>>>>>> This will help us to find problems before merge. Builds > for branches > >>>>>>> will lose the impact of new commits in master. > >>>>>>> How does Zeppelin solve this problem? > >>>>>>> > >>>>>>> Thanks again for sharing the idea. > >>>>>>> > >>>>>>> Regards, > >>>>>>> Jark > >>>>>>> > >>>>>>> On Tue, 25 Jun 2019 at 11:01, Jeff Zhang <[hidden email] > <mailto:[hidden email]> > >>>>>>> <mailto:[hidden email] <mailto:[hidden email]>>> wrote: > >>>>>>> > >>>>>>> Hi Folks, > >>>>>>> > >>>>>>> Zeppelin meet this kind of issue before, we solve it by > >> delegating > >>>>>>> each > >>>>>>> one's PR build to his travis account (Everyone can > have 5 free > >>>>>>> slot for > >>>>>>> travis build). > >>>>>>> Apache account travis build is only triggered when > PR is merged. > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> Kurt Young <[hidden email] > <mailto:[hidden email]> <mailto:[hidden email] > <mailto:[hidden email]>>> > >>>>>>> 于2019年6月25日周二 上午10:16写道: > >>>>>>> > >>>>>>> > (Forgot to cc George) > >>>>>>> > > >>>>>>> > Best, > >>>>>>> > Kurt > >>>>>>> > > >>>>>>> > > >>>>>>> > On Tue, Jun 25, 2019 at 10:16 AM Kurt Young > <[hidden email] <mailto:[hidden email]> > >>>>>>> <mailto:[hidden email] <mailto:[hidden email]>>> > wrote: > >>>>>>> > > >>>>>>> > > Hi Bowen, > >>>>>>> > > > >>>>>>> > > Thanks for bringing this up. We actually have > discussed > >> about > >>>>>>> this, and I > >>>>>>> > > think Till and George have > >>>>>>> > > already spend sometime investigating it. I have > cced both of > >>>>>>> them, and > >>>>>>> > > maybe they can share > >>>>>>> > > their findings. > >>>>>>> > > > >>>>>>> > > Best, > >>>>>>> > > Kurt > >>>>>>> > > > >>>>>>> > > > >>>>>>> > > On Tue, Jun 25, 2019 at 10:08 AM Jark Wu > <[hidden email] <mailto:[hidden email]> > >>>>>>> <mailto:[hidden email] <mailto:[hidden email]>>> > wrote: > >>>>>>> > > > >>>>>>> > >> Hi Bowen, > >>>>>>> > >> > >>>>>>> > >> Thanks for bringing this. We also suffered from > the long > >>>>>>> build time. > >>>>>>> > >> I agree that we should focus on solving build > capacity > >>>>>>> problem in the > >>>>>>> > >> thread. > >>>>>>> > >> > >>>>>>> > >> My observation is there is only one build is > running, all > >> the > >>>>>>> others > >>>>>>> > >> (other > >>>>>>> > >> PRs, master) are pending. > >>>>>>> > >> The pricing plan[1] of travis shows it can support > >> concurrent > >>>>>>> build > >>>>>>> > jobs. > >>>>>>> > >> But I don't know which plan we are using, might > be the free > >>>>>>> plan for > >>>>>>> > open > >>>>>>> > >> source. > >>>>>>> > >> > >>>>>>> > >> I cc-ed Chesnay who may have some experience on > Travis. > >>>>>>> > >> > >>>>>>> > >> Regards, > >>>>>>> > >> Jark > >>>>>>> > >> > >>>>>>> > >> [1]: https://travis-ci.com/plans > >>>>>>> > >> > >>>>>>> > >> On Tue, 25 Jun 2019 at 08:11, Bowen Li < > >> [hidden email] <mailto:[hidden email]> > >>>>>>> <mailto:[hidden email] > <mailto:[hidden email]>>> wrote: > >>>>>>> > >> > >>>>>>> > >> > Hi Steven, > >>>>>>> > >> > > >>>>>>> > >> > I think you may not read what I wrote. The > discussion is > >>>> about > >>>>>>> > "unstable > >>>>>>> > >> > build **capacity**", in another word > "unstable / lack of > >>>> build > >>>>>>> > >> resources", > >>>>>>> > >> > not "unstable build". > >>>>>>> > >> > > >>>>>>> > >> > On Mon, Jun 24, 2019 at 4:40 PM Steven Wu > >>>>>>> <[hidden email] <mailto:[hidden email]> > <mailto:[hidden email] <mailto:[hidden email]>>> > >>>>>>> > wrote: > >>>>>>> > >> > > >>>>>>> > >> > > long and sometimes unstable build is > definitely a pain > >>>>>> point. > >>>>>>> > >> > > > >>>>>>> > >> > > I suspect the build failure here in > >> flink-connector-kafka > >>>>>>> is not > >>>>>>> > >> related > >>>>>>> > >> > to > >>>>>>> > >> > > my change. but there is no easy re-run the > build on > >>>>>>> travis UI. > >>>>>>> > >> > > search showed a trick of close-and-open the > PR will > >>>>>>> trigger rebuild. > >>>>>>> > >> but > >>>>>>> > >> > > that could add noises to the PR activities. > >>>>>>> > >> > > > https://travis-ci.org/apache/flink/jobs/545555519 > >>>>>>> > >> > > > >>>>>>> > >> > > travis-ci for my personal repo often failed > with > >>>>>>> exceeding time > >>>>>>> > limit > >>>>>>> > >> > after > >>>>>>> > >> > > 4+ hours. > >>>>>>> > >> > > The job exceeded the maximum time limit for > jobs, and > >> has > >>>>>>> been > >>>>>>> > >> > terminated. > >>>>>>> > >> > > > >>>>>>> > >> > > On Mon, Jun 24, 2019 at 4:15 PM Bowen Li > >>>>>>> <[hidden email] <mailto:[hidden email]> > <mailto:[hidden email] <mailto:[hidden email]>>> > >>>>>>> > wrote: > >>>>>>> > >> > > > >>>>>>> > >> > > > > https://travis-ci.org/apache/flink/builds/549681530 > >>>>>>> This build > >>>>>>> > >> > request > >>>>>>> > >> > > > has > >>>>>>> > >> > > > been sitting at **HEAD of the queue** > since I first > >> saw > >>>>>>> it at PST > >>>>>>> > >> > 10:30am > >>>>>>> > >> > > > (not sure how long it's been there before > 10:30am). > >>>>>>> It's PST > >>>>>>> > 4:12pm > >>>>>>> > >> now > >>>>>>> > >> > > and > >>>>>>> > >> > > > it hasn't started yet. > >>>>>>> > >> > > > > >>>>>>> > >> > > > On Mon, Jun 24, 2019 at 2:48 PM Bowen Li > >>>>>>> <[hidden email] <mailto:[hidden email]> > <mailto:[hidden email] <mailto:[hidden email]>>> > >>>>>>> > >> wrote: > >>>>>>> > >> > > > > >>>>>>> > >> > > > > Hi devs, > >>>>>>> > >> > > > > > >>>>>>> > >> > > > > I've been experiencing the pain > resulting from lack > >>>>>>> of stable > >>>>>>> > >> build > >>>>>>> > >> > > > > capacity on Travis for Flink PRs [1]. > >> Specifically, I > >>>>>>> noticed > >>>>>>> > >> often > >>>>>>> > >> > > that > >>>>>>> > >> > > > no > >>>>>>> > >> > > > > build in the queue is making any > progress for > >> hours, > >>>> and > >>>>>>> > suddenly > >>>>>>> > >> 5 > >>>>>>> > >> > or > >>>>>>> > >> > > 6 > >>>>>>> > >> > > > > builds kick off all together after the > long pause. > >>>>>>> I'm at PST > >>>>>>> > >> > (UTC-08) > >>>>>>> > >> > > > time > >>>>>>> > >> > > > > zone, and I've seen pause can be as > long as 6 hours > >>>>>>> from PST 9am > >>>>>>> > >> to > >>>>>>> > >> > 3pm > >>>>>>> > >> > > > > (let alone the time needed to drain the > queue > >>>>>>> afterwards). > >>>>>>> > >> > > > > > >>>>>>> > >> > > > > I think this has greatly impacted our > productivity. > >>>> I've > >>>>>>> > >> experienced > >>>>>>> > >> > > that > >>>>>>> > >> > > > > PRs submitted in the early morning of > PST time zone > >>>>>>> won't finish > >>>>>>> > >> > their > >>>>>>> > >> > > > > build until late night of the same day. > >>>>>>> > >> > > > > > >>>>>>> > >> > > > > So my questions are: > >>>>>>> > >> > > > > > >>>>>>> > >> > > > > - Has anyone else experienced the same > problem or > >>>>>>> have similar > >>>>>>> > >> > > > observation > >>>>>>> > >> > > > > on TravisCI? (I suspect it has things > to do with > >> time > >>>>>>> zone) > >>>>>>> > >> > > > > > >>>>>>> > >> > > > > - What pricing plan of TravisCI is > Flink currently > >>>>>>> using? Is it > >>>>>>> > >> the > >>>>>>> > >> > > free > >>>>>>> > >> > > > > plan for open source projects? What are the > >>>>>>> guaranteed build > >>>>>>> > >> capacity > >>>>>>> > >> > > of > >>>>>>> > >> > > > > the current plan? > >>>>>>> > >> > > > > > >>>>>>> > >> > > > > - If the current pricing plan (either > free or paid) > >>>>>> can't > >>>>>>> > provide > >>>>>>> > >> > > stable > >>>>>>> > >> > > > > build capacity, can we upgrade to a > higher priced > >>>>>>> plan with > >>>>>>> > larger > >>>>>>> > >> > and > >>>>>>> > >> > > > more > >>>>>>> > >> > > > > stable build capacity? > >>>>>>> > >> > > > > > >>>>>>> > >> > > > > BTW, another factor that contribute to the > >>>>>>> productivity problem > >>>>>>> > is > >>>>>>> > >> > that > >>>>>>> > >> > > > > our build is slow - we run full build > for every PR > >>>> and a > >>>>>>> > >> successful > >>>>>>> > >> > > full > >>>>>>> > >> > > > > build takes ~5h. We definitely have > more options to > >>>>>>> solve it, > >>>>>>> > for > >>>>>>> > >> > > > instance, > >>>>>>> > >> > > > > modularize the build graphs and reuse > artifacts > >> from > >>>> the > >>>>>>> > previous > >>>>>>> > >> > > build. > >>>>>>> > >> > > > > But I think that can be a big effort > which is much > >>>>>>> harder to > >>>>>>> > >> > accomplish > >>>>>>> > >> > > > in > >>>>>>> > >> > > > > a short period of time and may deserve > its own > >>>> separate > >>>>>>> > >> discussion. > >>>>>>> > >> > > > > > >>>>>>> > >> > > > > [1] > >> https://travis-ci.org/apache/flink/pull_requests > >>>>>>> > >> > > > > > >>>>>>> > >> > > > > > >>>>>>> > >> > > > > >>>>>>> > >> > > > >>>>>>> > >> > > >>>>>>> > >> > >>>>>>> > > > >>>>>>> > > >>>>>>> > >>>>>>> > >>>>>>> -- > >>>>>>> Best Regards > >>>>>>> > >>>>>>> Jeff Zhang > >>>>>>> > >> > |
As a short-term stopgap, since we can assume this issue to become much
worse in the following days/weeks, we could disable IT cases in PRs and only run them on master. On 02/07/2019 12:03, Chesnay Schepler wrote: > People really have to stop thinking that just because something works > for us it is also a good solution. > Also, please remember that our builds run for 2h from start to finish, > and not the 14 _minutes_ it takes for zeppelin. > We are dealing with an entirely different scale here, both in terms of > build times and number of builds. > > In this very thread people have been complaining about long queue > times for their builds. Surprise, other Apache projects have been > suffering the very same thing due to us not controlling our build > times. While switching services (be it Jenkins, CircleCI or whatever) > will possibly work for us (and these options are actually attractive, > like CircleCI's proper support for build artifacts), it will also > result in us likely negatively affecting other projects in significant > ways. > > Sure, the Jenkins setup has a good user experience for us, at the cost > of blocking Jenkins workers for a _lot_ of time. Right now we have 25 > PR's in our queue; that's possibly 50h we'd consume of Jenkins > resources, and the European contributors haven't even really started yet. > > FYI, the latest INFRA response from INFRA-18533: > > "Our rough metrics shows that Flink used over 5800 hours of build time > last month. That is equal to EIGHT servers running 24/7 for the ENTIRE > MONTH. EIGHT. nonstop. > When we discovered this last night, we discussed it some and are going > to tune down Flink to allow only five executors maximum. We cannot > allow Flink to consume so much of a Foundation shared resource." > > So yes, we either > a) have to heavily reduce our CI usage or > b) fund our own, either maintaining it ourselves or donating to Apache. > > On 02/07/2019 05:11, Bowen Li wrote: >> By looking at the git history of the Jenkins script, its core part >> was finished in March 2017 (and only two minor update in 2017/2018), >> so it's been running for over two years now and feels like Zepplin >> community has been quite happy with it. @Jeff Zhang >> <mailto:[hidden email]> can you share your insights and user >> experience with the Jenkins+Travis approach? >> >> Things like: >> >> - has the approach completely solved the resource capacity problem >> for Zepplin community? is Zepplin community happy with the result? >> - is the whole configuration chain stable (e.g. uptime) enough? >> - how often do you need to maintain the Jenkins infra? how many >> people are usually involved in maintenance and bug-fixes? >> >> The downside of this approach seems mostly to be on the maintenance >> to me - maintain the script and Jenkins infra. >> >> ** Having Our Own Travis-CI.com Account ** >> >> Another alternative I've been thinking of is to have our own >> travis-ci.com <http://travis-ci.com> account with paid dedicated >> resources. Note travis-ci.org <http://travis-ci.org> is the free >> version and travis-ci.com <http://travis-ci.com> is the commercial >> version. We currently use a shared resource pool managed by ASK INFRA >> team on travis-ci.org <http://travis-ci.org>, but we have no control >> over it - we can't see how it's configured, how much resources are >> available, how resources are allocated among Apache projects, etc. >> The nice thing about having an account on travis-ci.com >> <http://travis-ci.com> are: >> >> - relatively low cost with much better resource guarantee than what >> we currently have [1]: $249/month with 5 dedicated concurrency, >> $489/month with 10 concurrency >> - low maintenance work compared to using Jenkins >> - (potentially) no migration cost according to Travis's doc [2] >> (pending verification) >> - full control over the build capacity/configuration compared to >> using ASF INFRA's pool >> >> I'd be surprised if we as such a vibrant community cannot find and >> fund $249*12=$2988 a year in exchange for a much better developer >> experience and much higher productivity. >> >> [1] https://travis-ci.com/plans >> [2] >> https://docs.travis-ci.com/user/migrate/open-source-repository-migration >> >> On Sat, Jun 29, 2019 at 8:39 AM Chesnay Schepler <[hidden email] >> <mailto:[hidden email]>> wrote: >> >> So yes, the Jenkins job keeps pulling the state from Travis until it >> finishes. >> >> Note sure I'm comfortable with the idea of using Jenkins workers >> just to >> idle for a several hours. >> >> On 29/06/2019 14:56, Jeff Zhang wrote: >> > Here's what zeppelin community did, we make a python script to >> check the >> > build status of pull request. >> > Here's script: >> > https://github.com/apache/zeppelin/blob/master/travis_check.py >> > >> > And this is the script we used in Jenkins build job. >> > >> > if [ -f "travis_check.py" ]; then >> > git log -n 1 >> > STATUS=$(curl -s $BUILD_URL | grep -e "GitHub pull >> request.*from.*" | sed >> > 's/.*GitHub pull request <a >> > href=\"\(https[^"]*\).*from[^"]*.\(https[^"]*\).*/\1 \2/g') >> > AUTHOR=$(echo $STATUS | sed 's/.*[/]\(.*\)$/\1/g') >> > PR=$(echo $STATUS | awk '{print $1}' | sed >> 's/.*[/]\(.*\)$/\1/g') >> > #COMMIT=$(git log -n 1 | grep "^Merge:" | awk '{print $3}') >> > #if [ -z $COMMIT ]; then >> > # COMMIT=$(curl -s >> https://api.github.com/repos/apache/zeppelin/pulls/$PR >> > | grep -e "\"label\":" -e "\"ref\":" -e "\"sha\":" | tr '\n' ' ' >> | sed >> > 's/\(.*sha[^,]*,\)\(.*ref.*\)/\1 = \2/g' | tr = '\n' | grep -v >> "apache:" | >> > sed 's/.*sha.[^"]*["]\([^"]*\).*/\1/g') >> > #fi >> > >> > # get commit hash from PR >> > COMMIT=$(curl -s >> https://api.github.com/repos/apache/zeppelin/pulls/$PR | >> > grep -e "\"label\":" -e "\"ref\":" -e "\"sha\":" | tr '\n' ' ' >> | sed >> > 's/\(.*sha[^,]*,\)\(.*ref.*\)/\1 = \2/g' | tr = '\n' | grep -v >> "apache:" | >> > sed 's/.*sha.[^"]*["]\([^"]*\).*/\1/g') >> > sleep 30 # sleep few moment to wait travis starts the build >> > RET_CODE=0 >> > python ./travis_check.py ${AUTHOR} ${COMMIT} || RET_CODE=$? >> > if [ $RET_CODE -eq 2 ]; then # try with repository name when >> travis-ci is >> > not available in the account >> > RET_CODE=0 >> > AUTHOR=$(curl -s >> https://api.github.com/repos/apache/zeppelin/pulls/$PR >> > | grep '"full_name":' | grep -v "apache/zeppelin" | sed >> > 's/.*[:][^"]*["]\([^/]*\).*/\1/g') >> > python ./travis_check.py ${AUTHOR} ${COMMIT} || RET_CODE=$? >> > fi >> > >> > if [ $RET_CODE -eq 2 ]; then # fail with can't find build >> information in >> > the travis >> > set +x >> > echo "-----------------------------------------------------" >> > echo "Looks like travis-ci is not configured for your fork." >> > echo "Please setup by swich on 'zeppelin' repository at >> > https://travis-ci.org/profile and travis-ci." >> > echo "And then make sure 'Build branch updates' option is >> enabled in >> > the settings https://travis-ci.org/${AUTHOR}/zeppelin/settings >> <https://travis-ci.org/$%7BAUTHOR%7D/zeppelin/settings>." >> > echo "" >> > echo "To trigger CI after setup, you will need ammend your >> last commit >> > with" >> > echo "git commit --amend" >> > echo "git push your-remote HEAD --force" >> > echo "" >> > echo "See >> > >> http://zeppelin.apache.org/contribution/contributions.html#continuous-integration >> > ." >> > fi >> > >> > exit $RET_CODE >> > else >> > set +x >> > echo "travis_check.py does not exists" >> > exit 1 >> > fi >> > >> > Chesnay Schepler <[hidden email] >> <mailto:[hidden email]>> 于2019年6月29日周六 下午3:17写道: >> > >> >> Does this imply that a Jenkins job is active as long as the >> Travis build >> >> runs? >> >> >> >> On 26/06/2019 21:28, Bowen Li wrote: >> >>> Hi, >> >>> >> >>> @Dawid, I think the "long test running" as I mentioned in the >> first >> >> email, >> >>> also as you guys said, belongs to "a big effort which is much >> harder to >> >>> accomplish in a short period of time and may deserve its own >> separate >> >>> discussion". Thus I didn't include it in what we can do in a >> foreseeable >> >>> short term. >> >>> >> >>> Besides, I don't think that's the ultimate reason for lack of >> build >> >>> resources. Even if the build is shortened to something like >> 2h, the >> >>> problems of no build machine works about 6 or more hours in >> PST daytime >> >>> that I described will still happen, because no machine from >> ASF INFRA's >> >>> pool is allocated to Flink. As I have paid close attention to >> the build >> >>> queue in the past few weekdays, it's a pretty clear pattern now. >> >>> >> >>> **The ultimate root cause** for that is - we don't have any >> **dedicated** >> >>> build resources that we can stably rely on. I'm actually ok to >> wait for a >> >>> long time if there are build requests running, it means at >> least we are >> >>> making progress. But I'm not ok with no build resource. A >> better place I >> >>> think we should aim at in short term is to always have at >> least a central >> >>> pool (can be 3 or 5) of machines dedicated to build Flink at >> any time, or >> >>> maybe use users resources. >> >>> >> >>> @Chesnay @Robert I synced with Jeff offline that Zeppelin >> community is >> >>> using a Jenkins job to automatically build on users' travis >> account and >> >>> link the result back to github PR. I guess the Jenkins job >> would fetch >> >>> latest upstream master and build the PR against it. Jeff has >> filed >> >> tickets >> >>> to learn and get access to the Jenkins infra. It'll better to >> fully >> >>> understand it first before judging this approach. >> >>> >> >>> I also heard good things about CircleCI, and ASF INFRA seems >> to have a >> >> pool >> >>> of build capacity there too. Can be an alternative to consider. >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> On Wed, Jun 26, 2019 at 12:44 AM Dawid Wysakowicz < >> >> [hidden email] <mailto:[hidden email]>> >> >>> wrote: >> >>> >> >>>> Sorry to jump in late, but I think Bowen missed the most >> important point >> >>>> from Chesnay's previous message in the summary. The ultimate >> reason for >> >>>> all the problems is that the tests take close to 2 hours to >> run already. >> >>>> I fully support this claim: "Unless people start caring about >> test times >> >>>> before adding them, this issue cannot be solved" >> >>>> >> >>>> This is also another reason why using user's Travis account >> won't help. >> >>>> Every few weeks we reach the user's time limit for a single >> profile. >> >>>> This makes the user's builds simply fail, until we either >> properly >> >>>> decrease the time the tests take (which I am not sure we ever >> did) or >> >>>> postpone the problem by splitting into more profiles. (Note >> that the ASF >> >>>> Travis account has higher time limits) >> >>>> >> >>>> Best, >> >>>> >> >>>> Dawid >> >>>> >> >>>> On 26/06/2019 09:36, Robert Metzger wrote: >> >>>>> Do we know if using "the best" available hardware would >> improve the >> >> build >> >>>>> times? >> >>>>> Imagine we would run the build on machines with plenty of >> main memory >> >> to >> >>>>> mount everything to ramdisk + the latest CPU architecture? >> >>>>> >> >>>>> Throwing hardware at the problem could help reduce the time >> of an >> >>>>> individual build, and using our own infrastructure would >> remove our >> >>>>> dependency on Apache's Travis account (with the obvious >> downside of >> >>>> having >> >>>>> to maintain the infrastructure) >> >>>>> We could use an open source travis alternative, to have a >> similar >> >>>>> experience and make the migration easy. >> >>>>> >> >>>>> >> >>>>> On Wed, Jun 26, 2019 at 9:34 AM Chesnay Schepler >> <[hidden email] <mailto:[hidden email]>> >> >>>> wrote: >> >>>>>> From what I gathered, there's no special sauce that the >> Zeppelin >> >>>>>> project uses which actually integrates a users Travis >> account into the >> >>>> PR. >> >>>>>> They just disabled Travis for PRs. And that's kind of it. >> >>>>>> >> >>>>>> Naturally we can do this (duh) and safe the ASF a fair >> amount of >> >>>>>> resources, but there are downsides: >> >>>>>> >> >>>>>> The discoverability of the Travis check takes a nose-dive. >> Either we >> >>>>>> require every contributor to always, an every commit, also >> post a >> >> Travis >> >>>>>> build, or we have the reviewer sift through the >> contributors account >> >> to >> >>>>>> find it. >> >>>>>> >> >>>>>> This is rather cumbersome. Additionally, it's also not >> equivalent to >> >>>>>> having a PR build. >> >>>>>> >> >>>>>> A normal branch build takes a branch as is and tests it. A >> PR build >> >>>>>> merges the branch into master, and then runs it. (Fun fact: >> This is >> >> why >> >>>>>> a PR without merge conflicts is not being run on Travis.) >> >>>>>> >> >>>>>> And ultimately, everyone can already make use of this >> approach anyway. >> >>>>>> >> >>>>>> On 25/06/2019 08:02, Jark Wu wrote: >> >>>>>>> Hi Jeff, >> >>>>>>> >> >>>>>>> Thanks for sharing the Zeppelin approach. I think it's a >> good idea to >> >>>>>>> leverage user's travis account. >> >>>>>>> In this way, we can have almost unlimited concurrent build >> jobs and >> >>>>>>> developers can restart build by themselves (currently only >> committers >> >>>>>>> can restart PR's build). >> >>>>>>> >> >>>>>>> But I'm still not very clear how to integrate user's >> travis build >> >> into >> >>>>>>> the Flink pull request's build automatically. Can you >> explain more in >> >>>>>>> detail? >> >>>>>>> >> >>>>>>> Another question: does travis only build branches for user >> account? >> >>>>>>> My concern is that builds for PRs will rebase user's >> commits against >> >>>>>>> current master branch. >> >>>>>>> This will help us to find problems before merge. Builds >> for branches >> >>>>>>> will lose the impact of new commits in master. >> >>>>>>> How does Zeppelin solve this problem? >> >>>>>>> >> >>>>>>> Thanks again for sharing the idea. >> >>>>>>> >> >>>>>>> Regards, >> >>>>>>> Jark >> >>>>>>> >> >>>>>>> On Tue, 25 Jun 2019 at 11:01, Jeff Zhang <[hidden email] >> <mailto:[hidden email]> >> >>>>>>> <mailto:[hidden email] <mailto:[hidden email]>>> wrote: >> >>>>>>> >> >>>>>>> Hi Folks, >> >>>>>>> >> >>>>>>> Zeppelin meet this kind of issue before, we solve >> it by >> >> delegating >> >>>>>>> each >> >>>>>>> one's PR build to his travis account (Everyone can >> have 5 free >> >>>>>>> slot for >> >>>>>>> travis build). >> >>>>>>> Apache account travis build is only triggered when >> PR is merged. >> >>>>>>> >> >>>>>>> >> >>>>>>> >> >>>>>>> Kurt Young <[hidden email] >> <mailto:[hidden email]> <mailto:[hidden email] >> <mailto:[hidden email]>>> >> >>>>>>> 于2019年6月25日周二 上午10:16写道: >> >>>>>>> >> >>>>>>> > (Forgot to cc George) >> >>>>>>> > >> >>>>>>> > Best, >> >>>>>>> > Kurt >> >>>>>>> > >> >>>>>>> > >> >>>>>>> > On Tue, Jun 25, 2019 at 10:16 AM Kurt Young >> <[hidden email] <mailto:[hidden email]> >> >>>>>>> <mailto:[hidden email] <mailto:[hidden email]>>> >> wrote: >> >>>>>>> > >> >>>>>>> > > Hi Bowen, >> >>>>>>> > > >> >>>>>>> > > Thanks for bringing this up. We actually have >> discussed >> >> about >> >>>>>>> this, and I >> >>>>>>> > > think Till and George have >> >>>>>>> > > already spend sometime investigating it. I have >> cced both of >> >>>>>>> them, and >> >>>>>>> > > maybe they can share >> >>>>>>> > > their findings. >> >>>>>>> > > >> >>>>>>> > > Best, >> >>>>>>> > > Kurt >> >>>>>>> > > >> >>>>>>> > > >> >>>>>>> > > On Tue, Jun 25, 2019 at 10:08 AM Jark Wu >> <[hidden email] <mailto:[hidden email]> >> >>>>>>> <mailto:[hidden email] <mailto:[hidden email]>>> >> wrote: >> >>>>>>> > > >> >>>>>>> > >> Hi Bowen, >> >>>>>>> > >> >> >>>>>>> > >> Thanks for bringing this. We also suffered from >> the long >> >>>>>>> build time. >> >>>>>>> > >> I agree that we should focus on solving build >> capacity >> >>>>>>> problem in the >> >>>>>>> > >> thread. >> >>>>>>> > >> >> >>>>>>> > >> My observation is there is only one build is >> running, all >> >> the >> >>>>>>> others >> >>>>>>> > >> (other >> >>>>>>> > >> PRs, master) are pending. >> >>>>>>> > >> The pricing plan[1] of travis shows it can >> support >> >> concurrent >> >>>>>>> build >> >>>>>>> > jobs. >> >>>>>>> > >> But I don't know which plan we are using, might >> be the free >> >>>>>>> plan for >> >>>>>>> > open >> >>>>>>> > >> source. >> >>>>>>> > >> >> >>>>>>> > >> I cc-ed Chesnay who may have some experience on >> Travis. >> >>>>>>> > >> >> >>>>>>> > >> Regards, >> >>>>>>> > >> Jark >> >>>>>>> > >> >> >>>>>>> > >> [1]: https://travis-ci.com/plans >> >>>>>>> > >> >> >>>>>>> > >> On Tue, 25 Jun 2019 at 08:11, Bowen Li < >> >> [hidden email] <mailto:[hidden email]> >> >>>>>>> <mailto:[hidden email] >> <mailto:[hidden email]>>> wrote: >> >>>>>>> > >> >> >>>>>>> > >> > Hi Steven, >> >>>>>>> > >> > >> >>>>>>> > >> > I think you may not read what I wrote. The >> discussion is >> >>>> about >> >>>>>>> > "unstable >> >>>>>>> > >> > build **capacity**", in another word >> "unstable / lack of >> >>>> build >> >>>>>>> > >> resources", >> >>>>>>> > >> > not "unstable build". >> >>>>>>> > >> > >> >>>>>>> > >> > On Mon, Jun 24, 2019 at 4:40 PM Steven Wu >> >>>>>>> <[hidden email] <mailto:[hidden email]> >> <mailto:[hidden email] <mailto:[hidden email]>>> >> >>>>>>> > wrote: >> >>>>>>> > >> > >> >>>>>>> > >> > > long and sometimes unstable build is >> definitely a pain >> >>>>>> point. >> >>>>>>> > >> > > >> >>>>>>> > >> > > I suspect the build failure here in >> >> flink-connector-kafka >> >>>>>>> is not >> >>>>>>> > >> related >> >>>>>>> > >> > to >> >>>>>>> > >> > > my change. but there is no easy re-run the >> build on >> >>>>>>> travis UI. >> >>>>>>> > >> > > search showed a trick of close-and-open the >> PR will >> >>>>>>> trigger rebuild. >> >>>>>>> > >> but >> >>>>>>> > >> > > that could add noises to the PR activities. >> >>>>>>> > >> > > >> https://travis-ci.org/apache/flink/jobs/545555519 >> >>>>>>> > >> > > >> >>>>>>> > >> > > travis-ci for my personal repo often failed >> with >> >>>>>>> exceeding time >> >>>>>>> > limit >> >>>>>>> > >> > after >> >>>>>>> > >> > > 4+ hours. >> >>>>>>> > >> > > The job exceeded the maximum time limit for >> jobs, and >> >> has >> >>>>>>> been >> >>>>>>> > >> > terminated. >> >>>>>>> > >> > > >> >>>>>>> > >> > > On Mon, Jun 24, 2019 at 4:15 PM Bowen Li >> >>>>>>> <[hidden email] <mailto:[hidden email]> >> <mailto:[hidden email] <mailto:[hidden email]>>> >> >>>>>>> > wrote: >> >>>>>>> > >> > > >> >>>>>>> > >> > > > >> https://travis-ci.org/apache/flink/builds/549681530 >> >>>>>>> This build >> >>>>>>> > >> > request >> >>>>>>> > >> > > > has >> >>>>>>> > >> > > > been sitting at **HEAD of the queue** >> since I first >> >> saw >> >>>>>>> it at PST >> >>>>>>> > >> > 10:30am >> >>>>>>> > >> > > > (not sure how long it's been there before >> 10:30am). >> >>>>>>> It's PST >> >>>>>>> > 4:12pm >> >>>>>>> > >> now >> >>>>>>> > >> > > and >> >>>>>>> > >> > > > it hasn't started yet. >> >>>>>>> > >> > > > >> >>>>>>> > >> > > > On Mon, Jun 24, 2019 at 2:48 PM Bowen Li >> >>>>>>> <[hidden email] <mailto:[hidden email]> >> <mailto:[hidden email] <mailto:[hidden email]>>> >> >>>>>>> > >> wrote: >> >>>>>>> > >> > > > >> >>>>>>> > >> > > > > Hi devs, >> >>>>>>> > >> > > > > >> >>>>>>> > >> > > > > I've been experiencing the pain >> resulting from lack >> >>>>>>> of stable >> >>>>>>> > >> build >> >>>>>>> > >> > > > > capacity on Travis for Flink PRs [1]. >> >> Specifically, I >> >>>>>>> noticed >> >>>>>>> > >> often >> >>>>>>> > >> > > that >> >>>>>>> > >> > > > no >> >>>>>>> > >> > > > > build in the queue is making any >> progress for >> >> hours, >> >>>> and >> >>>>>>> > suddenly >> >>>>>>> > >> 5 >> >>>>>>> > >> > or >> >>>>>>> > >> > > 6 >> >>>>>>> > >> > > > > builds kick off all together after the >> long pause. >> >>>>>>> I'm at PST >> >>>>>>> > >> > (UTC-08) >> >>>>>>> > >> > > > time >> >>>>>>> > >> > > > > zone, and I've seen pause can be as >> long as 6 hours >> >>>>>>> from PST 9am >> >>>>>>> > >> to >> >>>>>>> > >> > 3pm >> >>>>>>> > >> > > > > (let alone the time needed to drain the >> queue >> >>>>>>> afterwards). >> >>>>>>> > >> > > > > >> >>>>>>> > >> > > > > I think this has greatly impacted our >> productivity. >> >>>> I've >> >>>>>>> > >> experienced >> >>>>>>> > >> > > that >> >>>>>>> > >> > > > > PRs submitted in the early morning of >> PST time zone >> >>>>>>> won't finish >> >>>>>>> > >> > their >> >>>>>>> > >> > > > > build until late night of the same day. >> >>>>>>> > >> > > > > >> >>>>>>> > >> > > > > So my questions are: >> >>>>>>> > >> > > > > >> >>>>>>> > >> > > > > - Has anyone else experienced the same >> problem or >> >>>>>>> have similar >> >>>>>>> > >> > > > observation >> >>>>>>> > >> > > > > on TravisCI? (I suspect it has things >> to do with >> >> time >> >>>>>>> zone) >> >>>>>>> > >> > > > > >> >>>>>>> > >> > > > > - What pricing plan of TravisCI is >> Flink currently >> >>>>>>> using? Is it >> >>>>>>> > >> the >> >>>>>>> > >> > > free >> >>>>>>> > >> > > > > plan for open source projects? What >> are the >> >>>>>>> guaranteed build >> >>>>>>> > >> capacity >> >>>>>>> > >> > > of >> >>>>>>> > >> > > > > the current plan? >> >>>>>>> > >> > > > > >> >>>>>>> > >> > > > > - If the current pricing plan (either >> free or paid) >> >>>>>> can't >> >>>>>>> > provide >> >>>>>>> > >> > > stable >> >>>>>>> > >> > > > > build capacity, can we upgrade to a >> higher priced >> >>>>>>> plan with >> >>>>>>> > larger >> >>>>>>> > >> > and >> >>>>>>> > >> > > > more >> >>>>>>> > >> > > > > stable build capacity? >> >>>>>>> > >> > > > > >> >>>>>>> > >> > > > > BTW, another factor that contribute to >> the >> >>>>>>> productivity problem >> >>>>>>> > is >> >>>>>>> > >> > that >> >>>>>>> > >> > > > > our build is slow - we run full build >> for every PR >> >>>> and a >> >>>>>>> > >> successful >> >>>>>>> > >> > > full >> >>>>>>> > >> > > > > build takes ~5h. We definitely have >> more options to >> >>>>>>> solve it, >> >>>>>>> > for >> >>>>>>> > >> > > > instance, >> >>>>>>> > >> > > > > modularize the build graphs and reuse >> artifacts >> >> from >> >>>> the >> >>>>>>> > previous >> >>>>>>> > >> > > build. >> >>>>>>> > >> > > > > But I think that can be a big effort >> which is much >> >>>>>>> harder to >> >>>>>>> > >> > accomplish >> >>>>>>> > >> > > > in >> >>>>>>> > >> > > > > a short period of time and may deserve >> its own >> >>>> separate >> >>>>>>> > >> discussion. >> >>>>>>> > >> > > > > >> >>>>>>> > >> > > > > [1] >> >> https://travis-ci.org/apache/flink/pull_requests >> >>>>>>> > >> > > > > >> >>>>>>> > >> > > > > >> >>>>>>> > >> > > > >> >>>>>>> > >> > > >> >>>>>>> > >> > >> >>>>>>> > >> >> >>>>>>> > > >> >>>>>>> > >> >>>>>>> >> >>>>>>> >> >>>>>>> -- >> >>>>>>> Best Regards >> >>>>>>> >> >>>>>>> Jeff Zhang >> >>>>>>> >> >> >> > > |
I responded in the INFRA ticket [1] that I believe they are using a wrong
metric against Flink and the total build time is a completely different thing than guaranteed build capacity. My response: "As mentioned above, since I started to pay attention to Flink's build queue a few tens of days ago, I'm in Seattle and I saw no build was kicking off in PST daytime in weekdays for Flink. Our teammates in China and Europe have also reported similar observations. So we need to evaluate how the large total build time came from - if 1) your number and 2) our observations from three locations that cover pretty much a full day, are all true, I **guess** one reason can be that - highly likely the extra build time came from weekends when other Apache projects may be idle and Flink just drains hard its congested queue. Please be aware of that we're not complaining about the lack of resources in general, I'm complaining about the lack of **stable, dedicated** resources. An example for the latter one is, currently even if no build is in Flink's queue and I submit a request to be the queue head in PST morning, my build won't even start in 6-8+h. That is an absurd amount of waiting time. That's saying, if ASF INFRA decides to adopt a quota system and grants Flink five DEDICATED servers that runs all the time only for Flink, that'll be PERFECT and can totally solve our problem now. Please be aware of that we're not complaining about the lack of resources in general, I'm complaining about the lack of **stable, dedicated** resources. An example for the latter one is, currently even if no build is in Flink's queue and I submit a request to be the queue head in PST morning, my build won't even start in 6-8+h. That is an absurd amount of waiting time. That's saying, if ASF INFRA decides to adopt a quota system and grants Flink five DEDICATED servers that runs all the time only for Flink, that'll be PERFECT and can totally solve our problem now. I feel what's missing in the ASF INFRA's Travis resource pool is some level of build capacity SLAs and certainty" Again, I believe there are differences in nature of these two problems, long build time v.s. lack of dedicated build resource. That's saying, shortening build time may relieve the situation, and may not. I'm sightly negative on disabling IT cases for PRs, due to the downside is that we are at risk of any potential bugs in PR that UTs doesn't catch, and may cost a lot more to fix and if it slows others down or even block others, but am open to others opinions on it. AFAICT from INFRA ticket[1], donating to ASF INFRA won't be feasible to solve our problem since INFRA's pool is fully shared and they have no control and finer insights over resource allocation to a specific Apache project. As mentioned in [1], Apache Arrow is moving away from ASF INFRA Travis pool (they are actually surprised Flink hasn't plan to do so). I know that Spark is on its own build infra. If we all agree that funding our own build infra, I'd be glad to help investigate any potential options after releasing 1.9 since I'm super busy with 1.9 now. [1] https://issues.apache.org/jira/browse/INFRA-18533 On Tue, Jul 2, 2019 at 4:46 AM Chesnay Schepler <[hidden email]> wrote: > As a short-term stopgap, since we can assume this issue to become much > worse in the following days/weeks, we could disable IT cases in PRs and > only run them on master. > > On 02/07/2019 12:03, Chesnay Schepler wrote: > > People really have to stop thinking that just because something works > > for us it is also a good solution. > > Also, please remember that our builds run for 2h from start to finish, > > and not the 14 _minutes_ it takes for zeppelin. > > We are dealing with an entirely different scale here, both in terms of > > build times and number of builds. > > > > In this very thread people have been complaining about long queue > > times for their builds. Surprise, other Apache projects have been > > suffering the very same thing due to us not controlling our build > > times. While switching services (be it Jenkins, CircleCI or whatever) > > will possibly work for us (and these options are actually attractive, > > like CircleCI's proper support for build artifacts), it will also > > result in us likely negatively affecting other projects in significant > > ways. > > > > Sure, the Jenkins setup has a good user experience for us, at the cost > > of blocking Jenkins workers for a _lot_ of time. Right now we have 25 > > PR's in our queue; that's possibly 50h we'd consume of Jenkins > > resources, and the European contributors haven't even really started yet. > > > > FYI, the latest INFRA response from INFRA-18533: > > > > "Our rough metrics shows that Flink used over 5800 hours of build time > > last month. That is equal to EIGHT servers running 24/7 for the ENTIRE > > MONTH. EIGHT. nonstop. > > When we discovered this last night, we discussed it some and are going > > to tune down Flink to allow only five executors maximum. We cannot > > allow Flink to consume so much of a Foundation shared resource." > > > > So yes, we either > > a) have to heavily reduce our CI usage or > > b) fund our own, either maintaining it ourselves or donating to Apache. > > > > On 02/07/2019 05:11, Bowen Li wrote: > >> By looking at the git history of the Jenkins script, its core part > >> was finished in March 2017 (and only two minor update in 2017/2018), > >> so it's been running for over two years now and feels like Zepplin > >> community has been quite happy with it. @Jeff Zhang > >> <mailto:[hidden email]> can you share your insights and user > >> experience with the Jenkins+Travis approach? > >> > >> Things like: > >> > >> - has the approach completely solved the resource capacity problem > >> for Zepplin community? is Zepplin community happy with the result? > >> - is the whole configuration chain stable (e.g. uptime) enough? > >> - how often do you need to maintain the Jenkins infra? how many > >> people are usually involved in maintenance and bug-fixes? > >> > >> The downside of this approach seems mostly to be on the maintenance > >> to me - maintain the script and Jenkins infra. > >> > >> ** Having Our Own Travis-CI.com Account ** > >> > >> Another alternative I've been thinking of is to have our own > >> travis-ci.com <http://travis-ci.com> account with paid dedicated > >> resources. Note travis-ci.org <http://travis-ci.org> is the free > >> version and travis-ci.com <http://travis-ci.com> is the commercial > >> version. We currently use a shared resource pool managed by ASK INFRA > >> team on travis-ci.org <http://travis-ci.org>, but we have no control > >> over it - we can't see how it's configured, how much resources are > >> available, how resources are allocated among Apache projects, etc. > >> The nice thing about having an account on travis-ci.com > >> <http://travis-ci.com> are: > >> > >> - relatively low cost with much better resource guarantee than what > >> we currently have [1]: $249/month with 5 dedicated concurrency, > >> $489/month with 10 concurrency > >> - low maintenance work compared to using Jenkins > >> - (potentially) no migration cost according to Travis's doc [2] > >> (pending verification) > >> - full control over the build capacity/configuration compared to > >> using ASF INFRA's pool > >> > >> I'd be surprised if we as such a vibrant community cannot find and > >> fund $249*12=$2988 a year in exchange for a much better developer > >> experience and much higher productivity. > >> > >> [1] https://travis-ci.com/plans > >> [2] > >> > https://docs.travis-ci.com/user/migrate/open-source-repository-migration > >> > >> On Sat, Jun 29, 2019 at 8:39 AM Chesnay Schepler <[hidden email] > >> <mailto:[hidden email]>> wrote: > >> > >> So yes, the Jenkins job keeps pulling the state from Travis until it > >> finishes. > >> > >> Note sure I'm comfortable with the idea of using Jenkins workers > >> just to > >> idle for a several hours. > >> > >> On 29/06/2019 14:56, Jeff Zhang wrote: > >> > Here's what zeppelin community did, we make a python script to > >> check the > >> > build status of pull request. > >> > Here's script: > >> > https://github.com/apache/zeppelin/blob/master/travis_check.py > >> > > >> > And this is the script we used in Jenkins build job. > >> > > >> > if [ -f "travis_check.py" ]; then > >> > git log -n 1 > >> > STATUS=$(curl -s $BUILD_URL | grep -e "GitHub pull > >> request.*from.*" | sed > >> > 's/.*GitHub pull request <a > >> > href=\"\(https[^"]*\).*from[^"]*.\(https[^"]*\).*/\1 \2/g') > >> > AUTHOR=$(echo $STATUS | sed 's/.*[/]\(.*\)$/\1/g') > >> > PR=$(echo $STATUS | awk '{print $1}' | sed > >> 's/.*[/]\(.*\)$/\1/g') > >> > #COMMIT=$(git log -n 1 | grep "^Merge:" | awk '{print $3}') > >> > #if [ -z $COMMIT ]; then > >> > # COMMIT=$(curl -s > >> https://api.github.com/repos/apache/zeppelin/pulls/$PR > >> > | grep -e "\"label\":" -e "\"ref\":" -e "\"sha\":" | tr '\n' ' ' > >> | sed > >> > 's/\(.*sha[^,]*,\)\(.*ref.*\)/\1 = \2/g' | tr = '\n' | grep -v > >> "apache:" | > >> > sed 's/.*sha.[^"]*["]\([^"]*\).*/\1/g') > >> > #fi > >> > > >> > # get commit hash from PR > >> > COMMIT=$(curl -s > >> https://api.github.com/repos/apache/zeppelin/pulls/$PR | > >> > grep -e "\"label\":" -e "\"ref\":" -e "\"sha\":" | tr '\n' ' ' > >> | sed > >> > 's/\(.*sha[^,]*,\)\(.*ref.*\)/\1 = \2/g' | tr = '\n' | grep -v > >> "apache:" | > >> > sed 's/.*sha.[^"]*["]\([^"]*\).*/\1/g') > >> > sleep 30 # sleep few moment to wait travis starts the build > >> > RET_CODE=0 > >> > python ./travis_check.py ${AUTHOR} ${COMMIT} || RET_CODE=$? > >> > if [ $RET_CODE -eq 2 ]; then # try with repository name when > >> travis-ci is > >> > not available in the account > >> > RET_CODE=0 > >> > AUTHOR=$(curl -s > >> https://api.github.com/repos/apache/zeppelin/pulls/$PR > >> > | grep '"full_name":' | grep -v "apache/zeppelin" | sed > >> > 's/.*[:][^"]*["]\([^/]*\).*/\1/g') > >> > python ./travis_check.py ${AUTHOR} ${COMMIT} || RET_CODE=$? > >> > fi > >> > > >> > if [ $RET_CODE -eq 2 ]; then # fail with can't find build > >> information in > >> > the travis > >> > set +x > >> > echo "-----------------------------------------------------" > >> > echo "Looks like travis-ci is not configured for your fork." > >> > echo "Please setup by swich on 'zeppelin' repository at > >> > https://travis-ci.org/profile and travis-ci." > >> > echo "And then make sure 'Build branch updates' option is > >> enabled in > >> > the settings https://travis-ci.org/${AUTHOR}/zeppelin/settings > >> <https://travis-ci.org/$%7BAUTHOR%7D/zeppelin/settings>." > >> > echo "" > >> > echo "To trigger CI after setup, you will need ammend your > >> last commit > >> > with" > >> > echo "git commit --amend" > >> > echo "git push your-remote HEAD --force" > >> > echo "" > >> > echo "See > >> > > >> > http://zeppelin.apache.org/contribution/contributions.html#continuous-integration > >> > ." > >> > fi > >> > > >> > exit $RET_CODE > >> > else > >> > set +x > >> > echo "travis_check.py does not exists" > >> > exit 1 > >> > fi > >> > > >> > Chesnay Schepler <[hidden email] > >> <mailto:[hidden email]>> 于2019年6月29日周六 下午3:17写道: > >> > > >> >> Does this imply that a Jenkins job is active as long as the > >> Travis build > >> >> runs? > >> >> > >> >> On 26/06/2019 21:28, Bowen Li wrote: > >> >>> Hi, > >> >>> > >> >>> @Dawid, I think the "long test running" as I mentioned in the > >> first > >> >> email, > >> >>> also as you guys said, belongs to "a big effort which is much > >> harder to > >> >>> accomplish in a short period of time and may deserve its own > >> separate > >> >>> discussion". Thus I didn't include it in what we can do in a > >> foreseeable > >> >>> short term. > >> >>> > >> >>> Besides, I don't think that's the ultimate reason for lack of > >> build > >> >>> resources. Even if the build is shortened to something like > >> 2h, the > >> >>> problems of no build machine works about 6 or more hours in > >> PST daytime > >> >>> that I described will still happen, because no machine from > >> ASF INFRA's > >> >>> pool is allocated to Flink. As I have paid close attention to > >> the build > >> >>> queue in the past few weekdays, it's a pretty clear pattern now. > >> >>> > >> >>> **The ultimate root cause** for that is - we don't have any > >> **dedicated** > >> >>> build resources that we can stably rely on. I'm actually ok to > >> wait for a > >> >>> long time if there are build requests running, it means at > >> least we are > >> >>> making progress. But I'm not ok with no build resource. A > >> better place I > >> >>> think we should aim at in short term is to always have at > >> least a central > >> >>> pool (can be 3 or 5) of machines dedicated to build Flink at > >> any time, or > >> >>> maybe use users resources. > >> >>> > >> >>> @Chesnay @Robert I synced with Jeff offline that Zeppelin > >> community is > >> >>> using a Jenkins job to automatically build on users' travis > >> account and > >> >>> link the result back to github PR. I guess the Jenkins job > >> would fetch > >> >>> latest upstream master and build the PR against it. Jeff has > >> filed > >> >> tickets > >> >>> to learn and get access to the Jenkins infra. It'll better to > >> fully > >> >>> understand it first before judging this approach. > >> >>> > >> >>> I also heard good things about CircleCI, and ASF INFRA seems > >> to have a > >> >> pool > >> >>> of build capacity there too. Can be an alternative to consider. > >> >>> > >> >>> > >> >>> > >> >>> > >> >>> > >> >>> > >> >>> > >> >>> > >> >>> > >> >>> On Wed, Jun 26, 2019 at 12:44 AM Dawid Wysakowicz < > >> >> [hidden email] <mailto:[hidden email]>> > >> >>> wrote: > >> >>> > >> >>>> Sorry to jump in late, but I think Bowen missed the most > >> important point > >> >>>> from Chesnay's previous message in the summary. The ultimate > >> reason for > >> >>>> all the problems is that the tests take close to 2 hours to > >> run already. > >> >>>> I fully support this claim: "Unless people start caring about > >> test times > >> >>>> before adding them, this issue cannot be solved" > >> >>>> > >> >>>> This is also another reason why using user's Travis account > >> won't help. > >> >>>> Every few weeks we reach the user's time limit for a single > >> profile. > >> >>>> This makes the user's builds simply fail, until we either > >> properly > >> >>>> decrease the time the tests take (which I am not sure we ever > >> did) or > >> >>>> postpone the problem by splitting into more profiles. (Note > >> that the ASF > >> >>>> Travis account has higher time limits) > >> >>>> > >> >>>> Best, > >> >>>> > >> >>>> Dawid > >> >>>> > >> >>>> On 26/06/2019 09:36, Robert Metzger wrote: > >> >>>>> Do we know if using "the best" available hardware would > >> improve the > >> >> build > >> >>>>> times? > >> >>>>> Imagine we would run the build on machines with plenty of > >> main memory > >> >> to > >> >>>>> mount everything to ramdisk + the latest CPU architecture? > >> >>>>> > >> >>>>> Throwing hardware at the problem could help reduce the time > >> of an > >> >>>>> individual build, and using our own infrastructure would > >> remove our > >> >>>>> dependency on Apache's Travis account (with the obvious > >> downside of > >> >>>> having > >> >>>>> to maintain the infrastructure) > >> >>>>> We could use an open source travis alternative, to have a > >> similar > >> >>>>> experience and make the migration easy. > >> >>>>> > >> >>>>> > >> >>>>> On Wed, Jun 26, 2019 at 9:34 AM Chesnay Schepler > >> <[hidden email] <mailto:[hidden email]>> > >> >>>> wrote: > >> >>>>>> From what I gathered, there's no special sauce that the > >> Zeppelin > >> >>>>>> project uses which actually integrates a users Travis > >> account into the > >> >>>> PR. > >> >>>>>> They just disabled Travis for PRs. And that's kind of it. > >> >>>>>> > >> >>>>>> Naturally we can do this (duh) and safe the ASF a fair > >> amount of > >> >>>>>> resources, but there are downsides: > >> >>>>>> > >> >>>>>> The discoverability of the Travis check takes a nose-dive. > >> Either we > >> >>>>>> require every contributor to always, an every commit, also > >> post a > >> >> Travis > >> >>>>>> build, or we have the reviewer sift through the > >> contributors account > >> >> to > >> >>>>>> find it. > >> >>>>>> > >> >>>>>> This is rather cumbersome. Additionally, it's also not > >> equivalent to > >> >>>>>> having a PR build. > >> >>>>>> > >> >>>>>> A normal branch build takes a branch as is and tests it. A > >> PR build > >> >>>>>> merges the branch into master, and then runs it. (Fun fact: > >> This is > >> >> why > >> >>>>>> a PR without merge conflicts is not being run on Travis.) > >> >>>>>> > >> >>>>>> And ultimately, everyone can already make use of this > >> approach anyway. > >> >>>>>> > >> >>>>>> On 25/06/2019 08:02, Jark Wu wrote: > >> >>>>>>> Hi Jeff, > >> >>>>>>> > >> >>>>>>> Thanks for sharing the Zeppelin approach. I think it's a > >> good idea to > >> >>>>>>> leverage user's travis account. > >> >>>>>>> In this way, we can have almost unlimited concurrent build > >> jobs and > >> >>>>>>> developers can restart build by themselves (currently only > >> committers > >> >>>>>>> can restart PR's build). > >> >>>>>>> > >> >>>>>>> But I'm still not very clear how to integrate user's > >> travis build > >> >> into > >> >>>>>>> the Flink pull request's build automatically. Can you > >> explain more in > >> >>>>>>> detail? > >> >>>>>>> > >> >>>>>>> Another question: does travis only build branches for user > >> account? > >> >>>>>>> My concern is that builds for PRs will rebase user's > >> commits against > >> >>>>>>> current master branch. > >> >>>>>>> This will help us to find problems before merge. Builds > >> for branches > >> >>>>>>> will lose the impact of new commits in master. > >> >>>>>>> How does Zeppelin solve this problem? > >> >>>>>>> > >> >>>>>>> Thanks again for sharing the idea. > >> >>>>>>> > >> >>>>>>> Regards, > >> >>>>>>> Jark > >> >>>>>>> > >> >>>>>>> On Tue, 25 Jun 2019 at 11:01, Jeff Zhang <[hidden email] > >> <mailto:[hidden email]> > >> >>>>>>> <mailto:[hidden email] <mailto:[hidden email]>>> wrote: > >> >>>>>>> > >> >>>>>>> Hi Folks, > >> >>>>>>> > >> >>>>>>> Zeppelin meet this kind of issue before, we solve > >> it by > >> >> delegating > >> >>>>>>> each > >> >>>>>>> one's PR build to his travis account (Everyone can > >> have 5 free > >> >>>>>>> slot for > >> >>>>>>> travis build). > >> >>>>>>> Apache account travis build is only triggered when > >> PR is merged. > >> >>>>>>> > >> >>>>>>> > >> >>>>>>> > >> >>>>>>> Kurt Young <[hidden email] > >> <mailto:[hidden email]> <mailto:[hidden email] > >> <mailto:[hidden email]>>> > >> >>>>>>> 于2019年6月25日周二 上午10:16写道: > >> >>>>>>> > >> >>>>>>> > (Forgot to cc George) > >> >>>>>>> > > >> >>>>>>> > Best, > >> >>>>>>> > Kurt > >> >>>>>>> > > >> >>>>>>> > > >> >>>>>>> > On Tue, Jun 25, 2019 at 10:16 AM Kurt Young > >> <[hidden email] <mailto:[hidden email]> > >> >>>>>>> <mailto:[hidden email] <mailto:[hidden email]>>> > >> wrote: > >> >>>>>>> > > >> >>>>>>> > > Hi Bowen, > >> >>>>>>> > > > >> >>>>>>> > > Thanks for bringing this up. We actually have > >> discussed > >> >> about > >> >>>>>>> this, and I > >> >>>>>>> > > think Till and George have > >> >>>>>>> > > already spend sometime investigating it. I have > >> cced both of > >> >>>>>>> them, and > >> >>>>>>> > > maybe they can share > >> >>>>>>> > > their findings. > >> >>>>>>> > > > >> >>>>>>> > > Best, > >> >>>>>>> > > Kurt > >> >>>>>>> > > > >> >>>>>>> > > > >> >>>>>>> > > On Tue, Jun 25, 2019 at 10:08 AM Jark Wu > >> <[hidden email] <mailto:[hidden email]> > >> >>>>>>> <mailto:[hidden email] <mailto:[hidden email]>>> > >> wrote: > >> >>>>>>> > > > >> >>>>>>> > >> Hi Bowen, > >> >>>>>>> > >> > >> >>>>>>> > >> Thanks for bringing this. We also suffered from > >> the long > >> >>>>>>> build time. > >> >>>>>>> > >> I agree that we should focus on solving build > >> capacity > >> >>>>>>> problem in the > >> >>>>>>> > >> thread. > >> >>>>>>> > >> > >> >>>>>>> > >> My observation is there is only one build is > >> running, all > >> >> the > >> >>>>>>> others > >> >>>>>>> > >> (other > >> >>>>>>> > >> PRs, master) are pending. > >> >>>>>>> > >> The pricing plan[1] of travis shows it can > >> support > >> >> concurrent > >> >>>>>>> build > >> >>>>>>> > jobs. > >> >>>>>>> > >> But I don't know which plan we are using, might > >> be the free > >> >>>>>>> plan for > >> >>>>>>> > open > >> >>>>>>> > >> source. > >> >>>>>>> > >> > >> >>>>>>> > >> I cc-ed Chesnay who may have some experience on > >> Travis. > >> >>>>>>> > >> > >> >>>>>>> > >> Regards, > >> >>>>>>> > >> Jark > >> >>>>>>> > >> > >> >>>>>>> > >> [1]: https://travis-ci.com/plans > >> >>>>>>> > >> > >> >>>>>>> > >> On Tue, 25 Jun 2019 at 08:11, Bowen Li < > >> >> [hidden email] <mailto:[hidden email]> > >> >>>>>>> <mailto:[hidden email] > >> <mailto:[hidden email]>>> wrote: > >> >>>>>>> > >> > >> >>>>>>> > >> > Hi Steven, > >> >>>>>>> > >> > > >> >>>>>>> > >> > I think you may not read what I wrote. The > >> discussion is > >> >>>> about > >> >>>>>>> > "unstable > >> >>>>>>> > >> > build **capacity**", in another word > >> "unstable / lack of > >> >>>> build > >> >>>>>>> > >> resources", > >> >>>>>>> > >> > not "unstable build". > >> >>>>>>> > >> > > >> >>>>>>> > >> > On Mon, Jun 24, 2019 at 4:40 PM Steven Wu > >> >>>>>>> <[hidden email] <mailto:[hidden email]> > >> <mailto:[hidden email] <mailto:[hidden email]>>> > >> >>>>>>> > wrote: > >> >>>>>>> > >> > > >> >>>>>>> > >> > > long and sometimes unstable build is > >> definitely a pain > >> >>>>>> point. > >> >>>>>>> > >> > > > >> >>>>>>> > >> > > I suspect the build failure here in > >> >> flink-connector-kafka > >> >>>>>>> is not > >> >>>>>>> > >> related > >> >>>>>>> > >> > to > >> >>>>>>> > >> > > my change. but there is no easy re-run the > >> build on > >> >>>>>>> travis UI. > >> >>>>>>> > >> > > search showed a trick of close-and-open the > >> PR will > >> >>>>>>> trigger rebuild. > >> >>>>>>> > >> but > >> >>>>>>> > >> > > that could add noises to the PR activities. > >> >>>>>>> > >> > > > >> https://travis-ci.org/apache/flink/jobs/545555519 > >> >>>>>>> > >> > > > >> >>>>>>> > >> > > travis-ci for my personal repo often failed > >> with > >> >>>>>>> exceeding time > >> >>>>>>> > limit > >> >>>>>>> > >> > after > >> >>>>>>> > >> > > 4+ hours. > >> >>>>>>> > >> > > The job exceeded the maximum time limit for > >> jobs, and > >> >> has > >> >>>>>>> been > >> >>>>>>> > >> > terminated. > >> >>>>>>> > >> > > > >> >>>>>>> > >> > > On Mon, Jun 24, 2019 at 4:15 PM Bowen Li > >> >>>>>>> <[hidden email] <mailto:[hidden email]> > >> <mailto:[hidden email] <mailto:[hidden email]>>> > >> >>>>>>> > wrote: > >> >>>>>>> > >> > > > >> >>>>>>> > >> > > > > >> https://travis-ci.org/apache/flink/builds/549681530 > >> >>>>>>> This build > >> >>>>>>> > >> > request > >> >>>>>>> > >> > > > has > >> >>>>>>> > >> > > > been sitting at **HEAD of the queue** > >> since I first > >> >> saw > >> >>>>>>> it at PST > >> >>>>>>> > >> > 10:30am > >> >>>>>>> > >> > > > (not sure how long it's been there before > >> 10:30am). > >> >>>>>>> It's PST > >> >>>>>>> > 4:12pm > >> >>>>>>> > >> now > >> >>>>>>> > >> > > and > >> >>>>>>> > >> > > > it hasn't started yet. > >> >>>>>>> > >> > > > > >> >>>>>>> > >> > > > On Mon, Jun 24, 2019 at 2:48 PM Bowen Li > >> >>>>>>> <[hidden email] <mailto:[hidden email]> > >> <mailto:[hidden email] <mailto:[hidden email]>>> > >> >>>>>>> > >> wrote: > >> >>>>>>> > >> > > > > >> >>>>>>> > >> > > > > Hi devs, > >> >>>>>>> > >> > > > > > >> >>>>>>> > >> > > > > I've been experiencing the pain > >> resulting from lack > >> >>>>>>> of stable > >> >>>>>>> > >> build > >> >>>>>>> > >> > > > > capacity on Travis for Flink PRs [1]. > >> >> Specifically, I > >> >>>>>>> noticed > >> >>>>>>> > >> often > >> >>>>>>> > >> > > that > >> >>>>>>> > >> > > > no > >> >>>>>>> > >> > > > > build in the queue is making any > >> progress for > >> >> hours, > >> >>>> and > >> >>>>>>> > suddenly > >> >>>>>>> > >> 5 > >> >>>>>>> > >> > or > >> >>>>>>> > >> > > 6 > >> >>>>>>> > >> > > > > builds kick off all together after the > >> long pause. > >> >>>>>>> I'm at PST > >> >>>>>>> > >> > (UTC-08) > >> >>>>>>> > >> > > > time > >> >>>>>>> > >> > > > > zone, and I've seen pause can be as > >> long as 6 hours > >> >>>>>>> from PST 9am > >> >>>>>>> > >> to > >> >>>>>>> > >> > 3pm > >> >>>>>>> > >> > > > > (let alone the time needed to drain the > >> queue > >> >>>>>>> afterwards). > >> >>>>>>> > >> > > > > > >> >>>>>>> > >> > > > > I think this has greatly impacted our > >> productivity. > >> >>>> I've > >> >>>>>>> > >> experienced > >> >>>>>>> > >> > > that > >> >>>>>>> > >> > > > > PRs submitted in the early morning of > >> PST time zone > >> >>>>>>> won't finish > >> >>>>>>> > >> > their > >> >>>>>>> > >> > > > > build until late night of the same day. > >> >>>>>>> > >> > > > > > >> >>>>>>> > >> > > > > So my questions are: > >> >>>>>>> > >> > > > > > >> >>>>>>> > >> > > > > - Has anyone else experienced the same > >> problem or > >> >>>>>>> have similar > >> >>>>>>> > >> > > > observation > >> >>>>>>> > >> > > > > on TravisCI? (I suspect it has things > >> to do with > >> >> time > >> >>>>>>> zone) > >> >>>>>>> > >> > > > > > >> >>>>>>> > >> > > > > - What pricing plan of TravisCI is > >> Flink currently > >> >>>>>>> using? Is it > >> >>>>>>> > >> the > >> >>>>>>> > >> > > free > >> >>>>>>> > >> > > > > plan for open source projects? What > >> are the > >> >>>>>>> guaranteed build > >> >>>>>>> > >> capacity > >> >>>>>>> > >> > > of > >> >>>>>>> > >> > > > > the current plan? > >> >>>>>>> > >> > > > > > >> >>>>>>> > >> > > > > - If the current pricing plan (either > >> free or paid) > >> >>>>>> can't > >> >>>>>>> > provide > >> >>>>>>> > >> > > stable > >> >>>>>>> > >> > > > > build capacity, can we upgrade to a > >> higher priced > >> >>>>>>> plan with > >> >>>>>>> > larger > >> >>>>>>> > >> > and > >> >>>>>>> > >> > > > more > >> >>>>>>> > >> > > > > stable build capacity? > >> >>>>>>> > >> > > > > > >> >>>>>>> > >> > > > > BTW, another factor that contribute to > >> the > >> >>>>>>> productivity problem > >> >>>>>>> > is > >> >>>>>>> > >> > that > >> >>>>>>> > >> > > > > our build is slow - we run full build > >> for every PR > >> >>>> and a > >> >>>>>>> > >> successful > >> >>>>>>> > >> > > full > >> >>>>>>> > >> > > > > build takes ~5h. We definitely have > >> more options to > >> >>>>>>> solve it, > >> >>>>>>> > for > >> >>>>>>> > >> > > > instance, > >> >>>>>>> > >> > > > > modularize the build graphs and reuse > >> artifacts > >> >> from > >> >>>> the > >> >>>>>>> > previous > >> >>>>>>> > >> > > build. > >> >>>>>>> > >> > > > > But I think that can be a big effort > >> which is much > >> >>>>>>> harder to > >> >>>>>>> > >> > accomplish > >> >>>>>>> > >> > > > in > >> >>>>>>> > >> > > > > a short period of time and may deserve > >> its own > >> >>>> separate > >> >>>>>>> > >> discussion. > >> >>>>>>> > >> > > > > > >> >>>>>>> > >> > > > > [1] > >> >> https://travis-ci.org/apache/flink/pull_requests > >> >>>>>>> > >> > > > > > >> >>>>>>> > >> > > > > > >> >>>>>>> > >> > > > > >> >>>>>>> > >> > > > >> >>>>>>> > >> > > >> >>>>>>> > >> > >> >>>>>>> > > > >> >>>>>>> > > >> >>>>>>> > >> >>>>>>> > >> >>>>>>> -- > >> >>>>>>> Best Regards > >> >>>>>>> > >> >>>>>>> Jeff Zhang > >> >>>>>>> > >> >> > >> > > > > > > |
Are they using their own Travis CI pool, or did the switch to an
entirely different CI service? If we can just switch to our own Travis pool, just for our project, then this might be something we can do fairly quickly? On 03/07/2019 05:55, Bowen Li wrote: > I responded in the INFRA ticket [1] that I believe they are using a wrong > metric against Flink and the total build time is a completely different > thing than guaranteed build capacity. > > My response: > > "As mentioned above, since I started to pay attention to Flink's build > queue a few tens of days ago, I'm in Seattle and I saw no build was kicking > off in PST daytime in weekdays for Flink. Our teammates in China and Europe > have also reported similar observations. So we need to evaluate how the > large total build time came from - if 1) your number and 2) our > observations from three locations that cover pretty much a full day, are > all true, I **guess** one reason can be that - highly likely the extra > build time came from weekends when other Apache projects may be idle and > Flink just drains hard its congested queue. > > Please be aware of that we're not complaining about the lack of resources > in general, I'm complaining about the lack of **stable, dedicated** > resources. An example for the latter one is, currently even if no build is > in Flink's queue and I submit a request to be the queue head in PST > morning, my build won't even start in 6-8+h. That is an absurd amount of > waiting time. > > That's saying, if ASF INFRA decides to adopt a quota system and grants > Flink five DEDICATED servers that runs all the time only for Flink, that'll > be PERFECT and can totally solve our problem now. > > Please be aware of that we're not complaining about the lack of resources > in general, I'm complaining about the lack of **stable, dedicated** > resources. An example for the latter one is, currently even if no build is > in Flink's queue and I submit a request to be the queue head in PST > morning, my build won't even start in 6-8+h. That is an absurd amount of > waiting time. > > > That's saying, if ASF INFRA decides to adopt a quota system and grants > Flink five DEDICATED servers that runs all the time only for Flink, that'll > be PERFECT and can totally solve our problem now. > > I feel what's missing in the ASF INFRA's Travis resource pool is some level > of build capacity SLAs and certainty" > > > Again, I believe there are differences in nature of these two problems, > long build time v.s. lack of dedicated build resource. That's saying, > shortening build time may relieve the situation, and may not. I'm sightly > negative on disabling IT cases for PRs, due to the downside is that we are > at risk of any potential bugs in PR that UTs doesn't catch, and may cost a > lot more to fix and if it slows others down or even block others, but am > open to others opinions on it. > > AFAICT from INFRA ticket[1], donating to ASF INFRA won't be feasible to > solve our problem since INFRA's pool is fully shared and they have no > control and finer insights over resource allocation to a specific Apache > project. As mentioned in [1], Apache Arrow is moving away from ASF INFRA > Travis pool (they are actually surprised Flink hasn't plan to do so). I > know that Spark is on its own build infra. If we all agree that funding our > own build infra, I'd be glad to help investigate any potential options > after releasing 1.9 since I'm super busy with 1.9 now. > > [1] https://issues.apache.org/jira/browse/INFRA-18533 > > > > On Tue, Jul 2, 2019 at 4:46 AM Chesnay Schepler <[hidden email]> wrote: > >> As a short-term stopgap, since we can assume this issue to become much >> worse in the following days/weeks, we could disable IT cases in PRs and >> only run them on master. >> >> On 02/07/2019 12:03, Chesnay Schepler wrote: >>> People really have to stop thinking that just because something works >>> for us it is also a good solution. >>> Also, please remember that our builds run for 2h from start to finish, >>> and not the 14 _minutes_ it takes for zeppelin. >>> We are dealing with an entirely different scale here, both in terms of >>> build times and number of builds. >>> >>> In this very thread people have been complaining about long queue >>> times for their builds. Surprise, other Apache projects have been >>> suffering the very same thing due to us not controlling our build >>> times. While switching services (be it Jenkins, CircleCI or whatever) >>> will possibly work for us (and these options are actually attractive, >>> like CircleCI's proper support for build artifacts), it will also >>> result in us likely negatively affecting other projects in significant >>> ways. >>> >>> Sure, the Jenkins setup has a good user experience for us, at the cost >>> of blocking Jenkins workers for a _lot_ of time. Right now we have 25 >>> PR's in our queue; that's possibly 50h we'd consume of Jenkins >>> resources, and the European contributors haven't even really started yet. >>> >>> FYI, the latest INFRA response from INFRA-18533: >>> >>> "Our rough metrics shows that Flink used over 5800 hours of build time >>> last month. That is equal to EIGHT servers running 24/7 for the ENTIRE >>> MONTH. EIGHT. nonstop. >>> When we discovered this last night, we discussed it some and are going >>> to tune down Flink to allow only five executors maximum. We cannot >>> allow Flink to consume so much of a Foundation shared resource." >>> >>> So yes, we either >>> a) have to heavily reduce our CI usage or >>> b) fund our own, either maintaining it ourselves or donating to Apache. >>> >>> On 02/07/2019 05:11, Bowen Li wrote: >>>> By looking at the git history of the Jenkins script, its core part >>>> was finished in March 2017 (and only two minor update in 2017/2018), >>>> so it's been running for over two years now and feels like Zepplin >>>> community has been quite happy with it. @Jeff Zhang >>>> <mailto:[hidden email]> can you share your insights and user >>>> experience with the Jenkins+Travis approach? >>>> >>>> Things like: >>>> >>>> - has the approach completely solved the resource capacity problem >>>> for Zepplin community? is Zepplin community happy with the result? >>>> - is the whole configuration chain stable (e.g. uptime) enough? >>>> - how often do you need to maintain the Jenkins infra? how many >>>> people are usually involved in maintenance and bug-fixes? >>>> >>>> The downside of this approach seems mostly to be on the maintenance >>>> to me - maintain the script and Jenkins infra. >>>> >>>> ** Having Our Own Travis-CI.com Account ** >>>> >>>> Another alternative I've been thinking of is to have our own >>>> travis-ci.com <http://travis-ci.com> account with paid dedicated >>>> resources. Note travis-ci.org <http://travis-ci.org> is the free >>>> version and travis-ci.com <http://travis-ci.com> is the commercial >>>> version. We currently use a shared resource pool managed by ASK INFRA >>>> team on travis-ci.org <http://travis-ci.org>, but we have no control >>>> over it - we can't see how it's configured, how much resources are >>>> available, how resources are allocated among Apache projects, etc. >>>> The nice thing about having an account on travis-ci.com >>>> <http://travis-ci.com> are: >>>> >>>> - relatively low cost with much better resource guarantee than what >>>> we currently have [1]: $249/month with 5 dedicated concurrency, >>>> $489/month with 10 concurrency >>>> - low maintenance work compared to using Jenkins >>>> - (potentially) no migration cost according to Travis's doc [2] >>>> (pending verification) >>>> - full control over the build capacity/configuration compared to >>>> using ASF INFRA's pool >>>> >>>> I'd be surprised if we as such a vibrant community cannot find and >>>> fund $249*12=$2988 a year in exchange for a much better developer >>>> experience and much higher productivity. >>>> >>>> [1] https://travis-ci.com/plans >>>> [2] >>>> >> https://docs.travis-ci.com/user/migrate/open-source-repository-migration >>>> On Sat, Jun 29, 2019 at 8:39 AM Chesnay Schepler <[hidden email] >>>> <mailto:[hidden email]>> wrote: >>>> >>>> So yes, the Jenkins job keeps pulling the state from Travis until it >>>> finishes. >>>> >>>> Note sure I'm comfortable with the idea of using Jenkins workers >>>> just to >>>> idle for a several hours. >>>> >>>> On 29/06/2019 14:56, Jeff Zhang wrote: >>>> > Here's what zeppelin community did, we make a python script to >>>> check the >>>> > build status of pull request. >>>> > Here's script: >>>> > https://github.com/apache/zeppelin/blob/master/travis_check.py >>>> > >>>> > And this is the script we used in Jenkins build job. >>>> > >>>> > if [ -f "travis_check.py" ]; then >>>> > git log -n 1 >>>> > STATUS=$(curl -s $BUILD_URL | grep -e "GitHub pull >>>> request.*from.*" | sed >>>> > 's/.*GitHub pull request <a >>>> > href=\"\(https[^"]*\).*from[^"]*.\(https[^"]*\).*/\1 \2/g') >>>> > AUTHOR=$(echo $STATUS | sed 's/.*[/]\(.*\)$/\1/g') >>>> > PR=$(echo $STATUS | awk '{print $1}' | sed >>>> 's/.*[/]\(.*\)$/\1/g') >>>> > #COMMIT=$(git log -n 1 | grep "^Merge:" | awk '{print $3}') >>>> > #if [ -z $COMMIT ]; then >>>> > # COMMIT=$(curl -s >>>> https://api.github.com/repos/apache/zeppelin/pulls/$PR >>>> > | grep -e "\"label\":" -e "\"ref\":" -e "\"sha\":" | tr '\n' ' ' >>>> | sed >>>> > 's/\(.*sha[^,]*,\)\(.*ref.*\)/\1 = \2/g' | tr = '\n' | grep -v >>>> "apache:" | >>>> > sed 's/.*sha.[^"]*["]\([^"]*\).*/\1/g') >>>> > #fi >>>> > >>>> > # get commit hash from PR >>>> > COMMIT=$(curl -s >>>> https://api.github.com/repos/apache/zeppelin/pulls/$PR | >>>> > grep -e "\"label\":" -e "\"ref\":" -e "\"sha\":" | tr '\n' ' ' >>>> | sed >>>> > 's/\(.*sha[^,]*,\)\(.*ref.*\)/\1 = \2/g' | tr = '\n' | grep -v >>>> "apache:" | >>>> > sed 's/.*sha.[^"]*["]\([^"]*\).*/\1/g') >>>> > sleep 30 # sleep few moment to wait travis starts the build >>>> > RET_CODE=0 >>>> > python ./travis_check.py ${AUTHOR} ${COMMIT} || RET_CODE=$? >>>> > if [ $RET_CODE -eq 2 ]; then # try with repository name when >>>> travis-ci is >>>> > not available in the account >>>> > RET_CODE=0 >>>> > AUTHOR=$(curl -s >>>> https://api.github.com/repos/apache/zeppelin/pulls/$PR >>>> > | grep '"full_name":' | grep -v "apache/zeppelin" | sed >>>> > 's/.*[:][^"]*["]\([^/]*\).*/\1/g') >>>> > python ./travis_check.py ${AUTHOR} ${COMMIT} || RET_CODE=$? >>>> > fi >>>> > >>>> > if [ $RET_CODE -eq 2 ]; then # fail with can't find build >>>> information in >>>> > the travis >>>> > set +x >>>> > echo "-----------------------------------------------------" >>>> > echo "Looks like travis-ci is not configured for your fork." >>>> > echo "Please setup by swich on 'zeppelin' repository at >>>> > https://travis-ci.org/profile and travis-ci." >>>> > echo "And then make sure 'Build branch updates' option is >>>> enabled in >>>> > the settings https://travis-ci.org/${AUTHOR}/zeppelin/settings >>>> <https://travis-ci.org/$%7BAUTHOR%7D/zeppelin/settings>." >>>> > echo "" >>>> > echo "To trigger CI after setup, you will need ammend your >>>> last commit >>>> > with" >>>> > echo "git commit --amend" >>>> > echo "git push your-remote HEAD --force" >>>> > echo "" >>>> > echo "See >>>> > >>>> >> http://zeppelin.apache.org/contribution/contributions.html#continuous-integration >>>> > ." >>>> > fi >>>> > >>>> > exit $RET_CODE >>>> > else >>>> > set +x >>>> > echo "travis_check.py does not exists" >>>> > exit 1 >>>> > fi >>>> > >>>> > Chesnay Schepler <[hidden email] >>>> <mailto:[hidden email]>> 于2019年6月29日周六 下午3:17写道: >>>> > >>>> >> Does this imply that a Jenkins job is active as long as the >>>> Travis build >>>> >> runs? >>>> >> >>>> >> On 26/06/2019 21:28, Bowen Li wrote: >>>> >>> Hi, >>>> >>> >>>> >>> @Dawid, I think the "long test running" as I mentioned in the >>>> first >>>> >> email, >>>> >>> also as you guys said, belongs to "a big effort which is much >>>> harder to >>>> >>> accomplish in a short period of time and may deserve its own >>>> separate >>>> >>> discussion". Thus I didn't include it in what we can do in a >>>> foreseeable >>>> >>> short term. >>>> >>> >>>> >>> Besides, I don't think that's the ultimate reason for lack of >>>> build >>>> >>> resources. Even if the build is shortened to something like >>>> 2h, the >>>> >>> problems of no build machine works about 6 or more hours in >>>> PST daytime >>>> >>> that I described will still happen, because no machine from >>>> ASF INFRA's >>>> >>> pool is allocated to Flink. As I have paid close attention to >>>> the build >>>> >>> queue in the past few weekdays, it's a pretty clear pattern now. >>>> >>> >>>> >>> **The ultimate root cause** for that is - we don't have any >>>> **dedicated** >>>> >>> build resources that we can stably rely on. I'm actually ok to >>>> wait for a >>>> >>> long time if there are build requests running, it means at >>>> least we are >>>> >>> making progress. But I'm not ok with no build resource. A >>>> better place I >>>> >>> think we should aim at in short term is to always have at >>>> least a central >>>> >>> pool (can be 3 or 5) of machines dedicated to build Flink at >>>> any time, or >>>> >>> maybe use users resources. >>>> >>> >>>> >>> @Chesnay @Robert I synced with Jeff offline that Zeppelin >>>> community is >>>> >>> using a Jenkins job to automatically build on users' travis >>>> account and >>>> >>> link the result back to github PR. I guess the Jenkins job >>>> would fetch >>>> >>> latest upstream master and build the PR against it. Jeff has >>>> filed >>>> >> tickets >>>> >>> to learn and get access to the Jenkins infra. It'll better to >>>> fully >>>> >>> understand it first before judging this approach. >>>> >>> >>>> >>> I also heard good things about CircleCI, and ASF INFRA seems >>>> to have a >>>> >> pool >>>> >>> of build capacity there too. Can be an alternative to consider. >>>> >>> >>>> >>> >>>> >>> >>>> >>> >>>> >>> >>>> >>> >>>> >>> >>>> >>> >>>> >>> >>>> >>> On Wed, Jun 26, 2019 at 12:44 AM Dawid Wysakowicz < >>>> >> [hidden email] <mailto:[hidden email]>> >>>> >>> wrote: >>>> >>> >>>> >>>> Sorry to jump in late, but I think Bowen missed the most >>>> important point >>>> >>>> from Chesnay's previous message in the summary. The ultimate >>>> reason for >>>> >>>> all the problems is that the tests take close to 2 hours to >>>> run already. >>>> >>>> I fully support this claim: "Unless people start caring about >>>> test times >>>> >>>> before adding them, this issue cannot be solved" >>>> >>>> >>>> >>>> This is also another reason why using user's Travis account >>>> won't help. >>>> >>>> Every few weeks we reach the user's time limit for a single >>>> profile. >>>> >>>> This makes the user's builds simply fail, until we either >>>> properly >>>> >>>> decrease the time the tests take (which I am not sure we ever >>>> did) or >>>> >>>> postpone the problem by splitting into more profiles. (Note >>>> that the ASF >>>> >>>> Travis account has higher time limits) >>>> >>>> >>>> >>>> Best, >>>> >>>> >>>> >>>> Dawid >>>> >>>> >>>> >>>> On 26/06/2019 09:36, Robert Metzger wrote: >>>> >>>>> Do we know if using "the best" available hardware would >>>> improve the >>>> >> build >>>> >>>>> times? >>>> >>>>> Imagine we would run the build on machines with plenty of >>>> main memory >>>> >> to >>>> >>>>> mount everything to ramdisk + the latest CPU architecture? >>>> >>>>> >>>> >>>>> Throwing hardware at the problem could help reduce the time >>>> of an >>>> >>>>> individual build, and using our own infrastructure would >>>> remove our >>>> >>>>> dependency on Apache's Travis account (with the obvious >>>> downside of >>>> >>>> having >>>> >>>>> to maintain the infrastructure) >>>> >>>>> We could use an open source travis alternative, to have a >>>> similar >>>> >>>>> experience and make the migration easy. >>>> >>>>> >>>> >>>>> >>>> >>>>> On Wed, Jun 26, 2019 at 9:34 AM Chesnay Schepler >>>> <[hidden email] <mailto:[hidden email]>> >>>> >>>> wrote: >>>> >>>>>> From what I gathered, there's no special sauce that the >>>> Zeppelin >>>> >>>>>> project uses which actually integrates a users Travis >>>> account into the >>>> >>>> PR. >>>> >>>>>> They just disabled Travis for PRs. And that's kind of it. >>>> >>>>>> >>>> >>>>>> Naturally we can do this (duh) and safe the ASF a fair >>>> amount of >>>> >>>>>> resources, but there are downsides: >>>> >>>>>> >>>> >>>>>> The discoverability of the Travis check takes a nose-dive. >>>> Either we >>>> >>>>>> require every contributor to always, an every commit, also >>>> post a >>>> >> Travis >>>> >>>>>> build, or we have the reviewer sift through the >>>> contributors account >>>> >> to >>>> >>>>>> find it. >>>> >>>>>> >>>> >>>>>> This is rather cumbersome. Additionally, it's also not >>>> equivalent to >>>> >>>>>> having a PR build. >>>> >>>>>> >>>> >>>>>> A normal branch build takes a branch as is and tests it. A >>>> PR build >>>> >>>>>> merges the branch into master, and then runs it. (Fun fact: >>>> This is >>>> >> why >>>> >>>>>> a PR without merge conflicts is not being run on Travis.) >>>> >>>>>> >>>> >>>>>> And ultimately, everyone can already make use of this >>>> approach anyway. >>>> >>>>>> >>>> >>>>>> On 25/06/2019 08:02, Jark Wu wrote: >>>> >>>>>>> Hi Jeff, >>>> >>>>>>> >>>> >>>>>>> Thanks for sharing the Zeppelin approach. I think it's a >>>> good idea to >>>> >>>>>>> leverage user's travis account. >>>> >>>>>>> In this way, we can have almost unlimited concurrent build >>>> jobs and >>>> >>>>>>> developers can restart build by themselves (currently only >>>> committers >>>> >>>>>>> can restart PR's build). >>>> >>>>>>> >>>> >>>>>>> But I'm still not very clear how to integrate user's >>>> travis build >>>> >> into >>>> >>>>>>> the Flink pull request's build automatically. Can you >>>> explain more in >>>> >>>>>>> detail? >>>> >>>>>>> >>>> >>>>>>> Another question: does travis only build branches for user >>>> account? >>>> >>>>>>> My concern is that builds for PRs will rebase user's >>>> commits against >>>> >>>>>>> current master branch. >>>> >>>>>>> This will help us to find problems before merge. Builds >>>> for branches >>>> >>>>>>> will lose the impact of new commits in master. >>>> >>>>>>> How does Zeppelin solve this problem? >>>> >>>>>>> >>>> >>>>>>> Thanks again for sharing the idea. >>>> >>>>>>> >>>> >>>>>>> Regards, >>>> >>>>>>> Jark >>>> >>>>>>> >>>> >>>>>>> On Tue, 25 Jun 2019 at 11:01, Jeff Zhang <[hidden email] >>>> <mailto:[hidden email]> >>>> >>>>>>> <mailto:[hidden email] <mailto:[hidden email]>>> wrote: >>>> >>>>>>> >>>> >>>>>>> Hi Folks, >>>> >>>>>>> >>>> >>>>>>> Zeppelin meet this kind of issue before, we solve >>>> it by >>>> >> delegating >>>> >>>>>>> each >>>> >>>>>>> one's PR build to his travis account (Everyone can >>>> have 5 free >>>> >>>>>>> slot for >>>> >>>>>>> travis build). >>>> >>>>>>> Apache account travis build is only triggered when >>>> PR is merged. >>>> >>>>>>> >>>> >>>>>>> >>>> >>>>>>> >>>> >>>>>>> Kurt Young <[hidden email] >>>> <mailto:[hidden email]> <mailto:[hidden email] >>>> <mailto:[hidden email]>>> >>>> >>>>>>> 于2019年6月25日周二 上午10:16写道: >>>> >>>>>>> >>>> >>>>>>> > (Forgot to cc George) >>>> >>>>>>> > >>>> >>>>>>> > Best, >>>> >>>>>>> > Kurt >>>> >>>>>>> > >>>> >>>>>>> > >>>> >>>>>>> > On Tue, Jun 25, 2019 at 10:16 AM Kurt Young >>>> <[hidden email] <mailto:[hidden email]> >>>> >>>>>>> <mailto:[hidden email] <mailto:[hidden email]>>> >>>> wrote: >>>> >>>>>>> > >>>> >>>>>>> > > Hi Bowen, >>>> >>>>>>> > > >>>> >>>>>>> > > Thanks for bringing this up. We actually have >>>> discussed >>>> >> about >>>> >>>>>>> this, and I >>>> >>>>>>> > > think Till and George have >>>> >>>>>>> > > already spend sometime investigating it. I have >>>> cced both of >>>> >>>>>>> them, and >>>> >>>>>>> > > maybe they can share >>>> >>>>>>> > > their findings. >>>> >>>>>>> > > >>>> >>>>>>> > > Best, >>>> >>>>>>> > > Kurt >>>> >>>>>>> > > >>>> >>>>>>> > > >>>> >>>>>>> > > On Tue, Jun 25, 2019 at 10:08 AM Jark Wu >>>> <[hidden email] <mailto:[hidden email]> >>>> >>>>>>> <mailto:[hidden email] <mailto:[hidden email]>>> >>>> wrote: >>>> >>>>>>> > > >>>> >>>>>>> > >> Hi Bowen, >>>> >>>>>>> > >> >>>> >>>>>>> > >> Thanks for bringing this. We also suffered from >>>> the long >>>> >>>>>>> build time. >>>> >>>>>>> > >> I agree that we should focus on solving build >>>> capacity >>>> >>>>>>> problem in the >>>> >>>>>>> > >> thread. >>>> >>>>>>> > >> >>>> >>>>>>> > >> My observation is there is only one build is >>>> running, all >>>> >> the >>>> >>>>>>> others >>>> >>>>>>> > >> (other >>>> >>>>>>> > >> PRs, master) are pending. >>>> >>>>>>> > >> The pricing plan[1] of travis shows it can >>>> support >>>> >> concurrent >>>> >>>>>>> build >>>> >>>>>>> > jobs. >>>> >>>>>>> > >> But I don't know which plan we are using, might >>>> be the free >>>> >>>>>>> plan for >>>> >>>>>>> > open >>>> >>>>>>> > >> source. >>>> >>>>>>> > >> >>>> >>>>>>> > >> I cc-ed Chesnay who may have some experience on >>>> Travis. >>>> >>>>>>> > >> >>>> >>>>>>> > >> Regards, >>>> >>>>>>> > >> Jark >>>> >>>>>>> > >> >>>> >>>>>>> > >> [1]: https://travis-ci.com/plans >>>> >>>>>>> > >> >>>> >>>>>>> > >> On Tue, 25 Jun 2019 at 08:11, Bowen Li < >>>> >> [hidden email] <mailto:[hidden email]> >>>> >>>>>>> <mailto:[hidden email] >>>> <mailto:[hidden email]>>> wrote: >>>> >>>>>>> > >> >>>> >>>>>>> > >> > Hi Steven, >>>> >>>>>>> > >> > >>>> >>>>>>> > >> > I think you may not read what I wrote. The >>>> discussion is >>>> >>>> about >>>> >>>>>>> > "unstable >>>> >>>>>>> > >> > build **capacity**", in another word >>>> "unstable / lack of >>>> >>>> build >>>> >>>>>>> > >> resources", >>>> >>>>>>> > >> > not "unstable build". >>>> >>>>>>> > >> > >>>> >>>>>>> > >> > On Mon, Jun 24, 2019 at 4:40 PM Steven Wu >>>> >>>>>>> <[hidden email] <mailto:[hidden email]> >>>> <mailto:[hidden email] <mailto:[hidden email]>>> >>>> >>>>>>> > wrote: >>>> >>>>>>> > >> > >>>> >>>>>>> > >> > > long and sometimes unstable build is >>>> definitely a pain >>>> >>>>>> point. >>>> >>>>>>> > >> > > >>>> >>>>>>> > >> > > I suspect the build failure here in >>>> >> flink-connector-kafka >>>> >>>>>>> is not >>>> >>>>>>> > >> related >>>> >>>>>>> > >> > to >>>> >>>>>>> > >> > > my change. but there is no easy re-run the >>>> build on >>>> >>>>>>> travis UI. >>>> >>>>>>> > >> > > search showed a trick of close-and-open the >>>> PR will >>>> >>>>>>> trigger rebuild. >>>> >>>>>>> > >> but >>>> >>>>>>> > >> > > that could add noises to the PR activities. >>>> >>>>>>> > >> > > >>>> https://travis-ci.org/apache/flink/jobs/545555519 >>>> >>>>>>> > >> > > >>>> >>>>>>> > >> > > travis-ci for my personal repo often failed >>>> with >>>> >>>>>>> exceeding time >>>> >>>>>>> > limit >>>> >>>>>>> > >> > after >>>> >>>>>>> > >> > > 4+ hours. >>>> >>>>>>> > >> > > The job exceeded the maximum time limit for >>>> jobs, and >>>> >> has >>>> >>>>>>> been >>>> >>>>>>> > >> > terminated. >>>> >>>>>>> > >> > > >>>> >>>>>>> > >> > > On Mon, Jun 24, 2019 at 4:15 PM Bowen Li >>>> >>>>>>> <[hidden email] <mailto:[hidden email]> >>>> <mailto:[hidden email] <mailto:[hidden email]>>> >>>> >>>>>>> > wrote: >>>> >>>>>>> > >> > > >>>> >>>>>>> > >> > > > >>>> https://travis-ci.org/apache/flink/builds/549681530 >>>> >>>>>>> This build >>>> >>>>>>> > >> > request >>>> >>>>>>> > >> > > > has >>>> >>>>>>> > >> > > > been sitting at **HEAD of the queue** >>>> since I first >>>> >> saw >>>> >>>>>>> it at PST >>>> >>>>>>> > >> > 10:30am >>>> >>>>>>> > >> > > > (not sure how long it's been there before >>>> 10:30am). >>>> >>>>>>> It's PST >>>> >>>>>>> > 4:12pm >>>> >>>>>>> > >> now >>>> >>>>>>> > >> > > and >>>> >>>>>>> > >> > > > it hasn't started yet. >>>> >>>>>>> > >> > > > >>>> >>>>>>> > >> > > > On Mon, Jun 24, 2019 at 2:48 PM Bowen Li >>>> >>>>>>> <[hidden email] <mailto:[hidden email]> >>>> <mailto:[hidden email] <mailto:[hidden email]>>> >>>> >>>>>>> > >> wrote: >>>> >>>>>>> > >> > > > >>>> >>>>>>> > >> > > > > Hi devs, >>>> >>>>>>> > >> > > > > >>>> >>>>>>> > >> > > > > I've been experiencing the pain >>>> resulting from lack >>>> >>>>>>> of stable >>>> >>>>>>> > >> build >>>> >>>>>>> > >> > > > > capacity on Travis for Flink PRs [1]. >>>> >> Specifically, I >>>> >>>>>>> noticed >>>> >>>>>>> > >> often >>>> >>>>>>> > >> > > that >>>> >>>>>>> > >> > > > no >>>> >>>>>>> > >> > > > > build in the queue is making any >>>> progress for >>>> >> hours, >>>> >>>> and >>>> >>>>>>> > suddenly >>>> >>>>>>> > >> 5 >>>> >>>>>>> > >> > or >>>> >>>>>>> > >> > > 6 >>>> >>>>>>> > >> > > > > builds kick off all together after the >>>> long pause. >>>> >>>>>>> I'm at PST >>>> >>>>>>> > >> > (UTC-08) >>>> >>>>>>> > >> > > > time >>>> >>>>>>> > >> > > > > zone, and I've seen pause can be as >>>> long as 6 hours >>>> >>>>>>> from PST 9am >>>> >>>>>>> > >> to >>>> >>>>>>> > >> > 3pm >>>> >>>>>>> > >> > > > > (let alone the time needed to drain the >>>> queue >>>> >>>>>>> afterwards). >>>> >>>>>>> > >> > > > > >>>> >>>>>>> > >> > > > > I think this has greatly impacted our >>>> productivity. >>>> >>>> I've >>>> >>>>>>> > >> experienced >>>> >>>>>>> > >> > > that >>>> >>>>>>> > >> > > > > PRs submitted in the early morning of >>>> PST time zone >>>> >>>>>>> won't finish >>>> >>>>>>> > >> > their >>>> >>>>>>> > >> > > > > build until late night of the same day. >>>> >>>>>>> > >> > > > > >>>> >>>>>>> > >> > > > > So my questions are: >>>> >>>>>>> > >> > > > > >>>> >>>>>>> > >> > > > > - Has anyone else experienced the same >>>> problem or >>>> >>>>>>> have similar >>>> >>>>>>> > >> > > > observation >>>> >>>>>>> > >> > > > > on TravisCI? (I suspect it has things >>>> to do with >>>> >> time >>>> >>>>>>> zone) >>>> >>>>>>> > >> > > > > >>>> >>>>>>> > >> > > > > - What pricing plan of TravisCI is >>>> Flink currently >>>> >>>>>>> using? Is it >>>> >>>>>>> > >> the >>>> >>>>>>> > >> > > free >>>> >>>>>>> > >> > > > > plan for open source projects? What >>>> are the >>>> >>>>>>> guaranteed build >>>> >>>>>>> > >> capacity >>>> >>>>>>> > >> > > of >>>> >>>>>>> > >> > > > > the current plan? >>>> >>>>>>> > >> > > > > >>>> >>>>>>> > >> > > > > - If the current pricing plan (either >>>> free or paid) >>>> >>>>>> can't >>>> >>>>>>> > provide >>>> >>>>>>> > >> > > stable >>>> >>>>>>> > >> > > > > build capacity, can we upgrade to a >>>> higher priced >>>> >>>>>>> plan with >>>> >>>>>>> > larger >>>> >>>>>>> > >> > and >>>> >>>>>>> > >> > > > more >>>> >>>>>>> > >> > > > > stable build capacity? >>>> >>>>>>> > >> > > > > >>>> >>>>>>> > >> > > > > BTW, another factor that contribute to >>>> the >>>> >>>>>>> productivity problem >>>> >>>>>>> > is >>>> >>>>>>> > >> > that >>>> >>>>>>> > >> > > > > our build is slow - we run full build >>>> for every PR >>>> >>>> and a >>>> >>>>>>> > >> successful >>>> >>>>>>> > >> > > full >>>> >>>>>>> > >> > > > > build takes ~5h. We definitely have >>>> more options to >>>> >>>>>>> solve it, >>>> >>>>>>> > for >>>> >>>>>>> > >> > > > instance, >>>> >>>>>>> > >> > > > > modularize the build graphs and reuse >>>> artifacts >>>> >> from >>>> >>>> the >>>> >>>>>>> > previous >>>> >>>>>>> > >> > > build. >>>> >>>>>>> > >> > > > > But I think that can be a big effort >>>> which is much >>>> >>>>>>> harder to >>>> >>>>>>> > >> > accomplish >>>> >>>>>>> > >> > > > in >>>> >>>>>>> > >> > > > > a short period of time and may deserve >>>> its own >>>> >>>> separate >>>> >>>>>>> > >> discussion. >>>> >>>>>>> > >> > > > > >>>> >>>>>>> > >> > > > > [1] >>>> >> https://travis-ci.org/apache/flink/pull_requests >>>> >>>>>>> > >> > > > > >>>> >>>>>>> > >> > > > > >>>> >>>>>>> > >> > > > >>>> >>>>>>> > >> > > >>>> >>>>>>> > >> > >>>> >>>>>>> > >> >>>> >>>>>>> > > >>>> >>>>>>> > >>>> >>>>>>> >>>> >>>>>>> >>>> >>>>>>> -- >>>> >>>>>>> Best Regards >>>> >>>>>>> >>>> >>>>>>> Jeff Zhang >>>> >>>>>>> >>>> >> >>>> >>> >> |
Re: > Are they using their own Travis CI pool, or did the switch to an
entirely different CI service? I reached out to Wes and Krisztián from Apache Arrow PMC. They are currently moving away from ASF's Travis to their own in-house metal machines at [1] with custom CI application at [2]. They've seen significant improvement w.r.t both much higher performance and basically no resource waiting time, "night-and-day" difference quoting Wes. Re: > If we can just switch to our own Travis pool, just for our project, then this might be something we can do fairly quickly? I believe so, according to [3] and [4] [1] https://ci.ursalabs.org/ <https://ci.ursalabs.org/#/> [2] https://github.com/ursa-labs/ursabot [3] https://docs.travis-ci.com/user/migrate/open-source-repository-migration [4] https://docs.travis-ci.com/user/migrate/open-source-on-travis-ci-com On Wed, Jul 3, 2019 at 12:01 AM Chesnay Schepler <[hidden email]> wrote: > Are they using their own Travis CI pool, or did the switch to an > entirely different CI service? > > If we can just switch to our own Travis pool, just for our project, then > this might be something we can do fairly quickly? > > On 03/07/2019 05:55, Bowen Li wrote: > > I responded in the INFRA ticket [1] that I believe they are using a wrong > > metric against Flink and the total build time is a completely different > > thing than guaranteed build capacity. > > > > My response: > > > > "As mentioned above, since I started to pay attention to Flink's build > > queue a few tens of days ago, I'm in Seattle and I saw no build was > kicking > > off in PST daytime in weekdays for Flink. Our teammates in China and > Europe > > have also reported similar observations. So we need to evaluate how the > > large total build time came from - if 1) your number and 2) our > > observations from three locations that cover pretty much a full day, are > > all true, I **guess** one reason can be that - highly likely the extra > > build time came from weekends when other Apache projects may be idle and > > Flink just drains hard its congested queue. > > > > Please be aware of that we're not complaining about the lack of resources > > in general, I'm complaining about the lack of **stable, dedicated** > > resources. An example for the latter one is, currently even if no build > is > > in Flink's queue and I submit a request to be the queue head in PST > > morning, my build won't even start in 6-8+h. That is an absurd amount of > > waiting time. > > > > That's saying, if ASF INFRA decides to adopt a quota system and grants > > Flink five DEDICATED servers that runs all the time only for Flink, > that'll > > be PERFECT and can totally solve our problem now. > > > > Please be aware of that we're not complaining about the lack of resources > > in general, I'm complaining about the lack of **stable, dedicated** > > resources. An example for the latter one is, currently even if no build > is > > in Flink's queue and I submit a request to be the queue head in PST > > morning, my build won't even start in 6-8+h. That is an absurd amount of > > waiting time. > > > > > > That's saying, if ASF INFRA decides to adopt a quota system and grants > > Flink five DEDICATED servers that runs all the time only for Flink, > that'll > > be PERFECT and can totally solve our problem now. > > > > I feel what's missing in the ASF INFRA's Travis resource pool is some > level > > of build capacity SLAs and certainty" > > > > > > Again, I believe there are differences in nature of these two problems, > > long build time v.s. lack of dedicated build resource. That's saying, > > shortening build time may relieve the situation, and may not. I'm sightly > > negative on disabling IT cases for PRs, due to the downside is that we > are > > at risk of any potential bugs in PR that UTs doesn't catch, and may cost > a > > lot more to fix and if it slows others down or even block others, but am > > open to others opinions on it. > > > > AFAICT from INFRA ticket[1], donating to ASF INFRA won't be feasible to > > solve our problem since INFRA's pool is fully shared and they have no > > control and finer insights over resource allocation to a specific Apache > > project. As mentioned in [1], Apache Arrow is moving away from ASF INFRA > > Travis pool (they are actually surprised Flink hasn't plan to do so). I > > know that Spark is on its own build infra. If we all agree that funding > our > > own build infra, I'd be glad to help investigate any potential options > > after releasing 1.9 since I'm super busy with 1.9 now. > > > > [1] https://issues.apache.org/jira/browse/INFRA-18533 > > > > > > > > On Tue, Jul 2, 2019 at 4:46 AM Chesnay Schepler <[hidden email]> > wrote: > > > >> As a short-term stopgap, since we can assume this issue to become much > >> worse in the following days/weeks, we could disable IT cases in PRs and > >> only run them on master. > >> > >> On 02/07/2019 12:03, Chesnay Schepler wrote: > >>> People really have to stop thinking that just because something works > >>> for us it is also a good solution. > >>> Also, please remember that our builds run for 2h from start to finish, > >>> and not the 14 _minutes_ it takes for zeppelin. > >>> We are dealing with an entirely different scale here, both in terms of > >>> build times and number of builds. > >>> > >>> In this very thread people have been complaining about long queue > >>> times for their builds. Surprise, other Apache projects have been > >>> suffering the very same thing due to us not controlling our build > >>> times. While switching services (be it Jenkins, CircleCI or whatever) > >>> will possibly work for us (and these options are actually attractive, > >>> like CircleCI's proper support for build artifacts), it will also > >>> result in us likely negatively affecting other projects in significant > >>> ways. > >>> > >>> Sure, the Jenkins setup has a good user experience for us, at the cost > >>> of blocking Jenkins workers for a _lot_ of time. Right now we have 25 > >>> PR's in our queue; that's possibly 50h we'd consume of Jenkins > >>> resources, and the European contributors haven't even really started > yet. > >>> > >>> FYI, the latest INFRA response from INFRA-18533: > >>> > >>> "Our rough metrics shows that Flink used over 5800 hours of build time > >>> last month. That is equal to EIGHT servers running 24/7 for the ENTIRE > >>> MONTH. EIGHT. nonstop. > >>> When we discovered this last night, we discussed it some and are going > >>> to tune down Flink to allow only five executors maximum. We cannot > >>> allow Flink to consume so much of a Foundation shared resource." > >>> > >>> So yes, we either > >>> a) have to heavily reduce our CI usage or > >>> b) fund our own, either maintaining it ourselves or donating to Apache. > >>> > >>> On 02/07/2019 05:11, Bowen Li wrote: > >>>> By looking at the git history of the Jenkins script, its core part > >>>> was finished in March 2017 (and only two minor update in 2017/2018), > >>>> so it's been running for over two years now and feels like Zepplin > >>>> community has been quite happy with it. @Jeff Zhang > >>>> <mailto:[hidden email]> can you share your insights and user > >>>> experience with the Jenkins+Travis approach? > >>>> > >>>> Things like: > >>>> > >>>> - has the approach completely solved the resource capacity problem > >>>> for Zepplin community? is Zepplin community happy with the result? > >>>> - is the whole configuration chain stable (e.g. uptime) enough? > >>>> - how often do you need to maintain the Jenkins infra? how many > >>>> people are usually involved in maintenance and bug-fixes? > >>>> > >>>> The downside of this approach seems mostly to be on the maintenance > >>>> to me - maintain the script and Jenkins infra. > >>>> > >>>> ** Having Our Own Travis-CI.com Account ** > >>>> > >>>> Another alternative I've been thinking of is to have our own > >>>> travis-ci.com <http://travis-ci.com> account with paid dedicated > >>>> resources. Note travis-ci.org <http://travis-ci.org> is the free > >>>> version and travis-ci.com <http://travis-ci.com> is the commercial > >>>> version. We currently use a shared resource pool managed by ASK INFRA > >>>> team on travis-ci.org <http://travis-ci.org>, but we have no control > >>>> over it - we can't see how it's configured, how much resources are > >>>> available, how resources are allocated among Apache projects, etc. > >>>> The nice thing about having an account on travis-ci.com > >>>> <http://travis-ci.com> are: > >>>> > >>>> - relatively low cost with much better resource guarantee than what > >>>> we currently have [1]: $249/month with 5 dedicated concurrency, > >>>> $489/month with 10 concurrency > >>>> - low maintenance work compared to using Jenkins > >>>> - (potentially) no migration cost according to Travis's doc [2] > >>>> (pending verification) > >>>> - full control over the build capacity/configuration compared to > >>>> using ASF INFRA's pool > >>>> > >>>> I'd be surprised if we as such a vibrant community cannot find and > >>>> fund $249*12=$2988 a year in exchange for a much better developer > >>>> experience and much higher productivity. > >>>> > >>>> [1] https://travis-ci.com/plans > >>>> [2] > >>>> > >> > https://docs.travis-ci.com/user/migrate/open-source-repository-migration > >>>> On Sat, Jun 29, 2019 at 8:39 AM Chesnay Schepler <[hidden email] > >>>> <mailto:[hidden email]>> wrote: > >>>> > >>>> So yes, the Jenkins job keeps pulling the state from Travis > until it > >>>> finishes. > >>>> > >>>> Note sure I'm comfortable with the idea of using Jenkins workers > >>>> just to > >>>> idle for a several hours. > >>>> > >>>> On 29/06/2019 14:56, Jeff Zhang wrote: > >>>> > Here's what zeppelin community did, we make a python script to > >>>> check the > >>>> > build status of pull request. > >>>> > Here's script: > >>>> > https://github.com/apache/zeppelin/blob/master/travis_check.py > >>>> > > >>>> > And this is the script we used in Jenkins build job. > >>>> > > >>>> > if [ -f "travis_check.py" ]; then > >>>> > git log -n 1 > >>>> > STATUS=$(curl -s $BUILD_URL | grep -e "GitHub pull > >>>> request.*from.*" | sed > >>>> > 's/.*GitHub pull request <a > >>>> > href=\"\(https[^"]*\).*from[^"]*.\(https[^"]*\).*/\1 \2/g') > >>>> > AUTHOR=$(echo $STATUS | sed 's/.*[/]\(.*\)$/\1/g') > >>>> > PR=$(echo $STATUS | awk '{print $1}' | sed > >>>> 's/.*[/]\(.*\)$/\1/g') > >>>> > #COMMIT=$(git log -n 1 | grep "^Merge:" | awk '{print $3}') > >>>> > #if [ -z $COMMIT ]; then > >>>> > # COMMIT=$(curl -s > >>>> https://api.github.com/repos/apache/zeppelin/pulls/$PR > >>>> > | grep -e "\"label\":" -e "\"ref\":" -e "\"sha\":" | tr '\n' ' > ' > >>>> | sed > >>>> > 's/\(.*sha[^,]*,\)\(.*ref.*\)/\1 = \2/g' | tr = '\n' | grep -v > >>>> "apache:" | > >>>> > sed 's/.*sha.[^"]*["]\([^"]*\).*/\1/g') > >>>> > #fi > >>>> > > >>>> > # get commit hash from PR > >>>> > COMMIT=$(curl -s > >>>> https://api.github.com/repos/apache/zeppelin/pulls/$PR | > >>>> > grep -e "\"label\":" -e "\"ref\":" -e "\"sha\":" | tr '\n' ' ' > >>>> | sed > >>>> > 's/\(.*sha[^,]*,\)\(.*ref.*\)/\1 = \2/g' | tr = '\n' | grep -v > >>>> "apache:" | > >>>> > sed 's/.*sha.[^"]*["]\([^"]*\).*/\1/g') > >>>> > sleep 30 # sleep few moment to wait travis starts the build > >>>> > RET_CODE=0 > >>>> > python ./travis_check.py ${AUTHOR} ${COMMIT} || RET_CODE=$? > >>>> > if [ $RET_CODE -eq 2 ]; then # try with repository name when > >>>> travis-ci is > >>>> > not available in the account > >>>> > RET_CODE=0 > >>>> > AUTHOR=$(curl -s > >>>> https://api.github.com/repos/apache/zeppelin/pulls/$PR > >>>> > | grep '"full_name":' | grep -v "apache/zeppelin" | sed > >>>> > 's/.*[:][^"]*["]\([^/]*\).*/\1/g') > >>>> > python ./travis_check.py ${AUTHOR} ${COMMIT} || RET_CODE=$? > >>>> > fi > >>>> > > >>>> > if [ $RET_CODE -eq 2 ]; then # fail with can't find build > >>>> information in > >>>> > the travis > >>>> > set +x > >>>> > echo > "-----------------------------------------------------" > >>>> > echo "Looks like travis-ci is not configured for your > fork." > >>>> > echo "Please setup by swich on 'zeppelin' repository at > >>>> > https://travis-ci.org/profile and travis-ci." > >>>> > echo "And then make sure 'Build branch updates' option is > >>>> enabled in > >>>> > the settings https://travis-ci.org/${AUTHOR}/zeppelin/settings > >>>> <https://travis-ci.org/$%7BAUTHOR%7D/zeppelin/settings>." > >>>> > echo "" > >>>> > echo "To trigger CI after setup, you will need ammend your > >>>> last commit > >>>> > with" > >>>> > echo "git commit --amend" > >>>> > echo "git push your-remote HEAD --force" > >>>> > echo "" > >>>> > echo "See > >>>> > > >>>> > >> > http://zeppelin.apache.org/contribution/contributions.html#continuous-integration > >>>> > ." > >>>> > fi > >>>> > > >>>> > exit $RET_CODE > >>>> > else > >>>> > set +x > >>>> > echo "travis_check.py does not exists" > >>>> > exit 1 > >>>> > fi > >>>> > > >>>> > Chesnay Schepler <[hidden email] > >>>> <mailto:[hidden email]>> 于2019年6月29日周六 下午3:17写道: > >>>> > > >>>> >> Does this imply that a Jenkins job is active as long as the > >>>> Travis build > >>>> >> runs? > >>>> >> > >>>> >> On 26/06/2019 21:28, Bowen Li wrote: > >>>> >>> Hi, > >>>> >>> > >>>> >>> @Dawid, I think the "long test running" as I mentioned in the > >>>> first > >>>> >> email, > >>>> >>> also as you guys said, belongs to "a big effort which is much > >>>> harder to > >>>> >>> accomplish in a short period of time and may deserve its own > >>>> separate > >>>> >>> discussion". Thus I didn't include it in what we can do in a > >>>> foreseeable > >>>> >>> short term. > >>>> >>> > >>>> >>> Besides, I don't think that's the ultimate reason for lack of > >>>> build > >>>> >>> resources. Even if the build is shortened to something like > >>>> 2h, the > >>>> >>> problems of no build machine works about 6 or more hours in > >>>> PST daytime > >>>> >>> that I described will still happen, because no machine from > >>>> ASF INFRA's > >>>> >>> pool is allocated to Flink. As I have paid close attention to > >>>> the build > >>>> >>> queue in the past few weekdays, it's a pretty clear pattern > now. > >>>> >>> > >>>> >>> **The ultimate root cause** for that is - we don't have any > >>>> **dedicated** > >>>> >>> build resources that we can stably rely on. I'm actually ok > to > >>>> wait for a > >>>> >>> long time if there are build requests running, it means at > >>>> least we are > >>>> >>> making progress. But I'm not ok with no build resource. A > >>>> better place I > >>>> >>> think we should aim at in short term is to always have at > >>>> least a central > >>>> >>> pool (can be 3 or 5) of machines dedicated to build Flink at > >>>> any time, or > >>>> >>> maybe use users resources. > >>>> >>> > >>>> >>> @Chesnay @Robert I synced with Jeff offline that Zeppelin > >>>> community is > >>>> >>> using a Jenkins job to automatically build on users' travis > >>>> account and > >>>> >>> link the result back to github PR. I guess the Jenkins job > >>>> would fetch > >>>> >>> latest upstream master and build the PR against it. Jeff has > >>>> filed > >>>> >> tickets > >>>> >>> to learn and get access to the Jenkins infra. It'll better to > >>>> fully > >>>> >>> understand it first before judging this approach. > >>>> >>> > >>>> >>> I also heard good things about CircleCI, and ASF INFRA seems > >>>> to have a > >>>> >> pool > >>>> >>> of build capacity there too. Can be an alternative to > consider. > >>>> >>> > >>>> >>> > >>>> >>> > >>>> >>> > >>>> >>> > >>>> >>> > >>>> >>> > >>>> >>> > >>>> >>> > >>>> >>> On Wed, Jun 26, 2019 at 12:44 AM Dawid Wysakowicz < > >>>> >> [hidden email] <mailto:[hidden email]>> > >>>> >>> wrote: > >>>> >>> > >>>> >>>> Sorry to jump in late, but I think Bowen missed the most > >>>> important point > >>>> >>>> from Chesnay's previous message in the summary. The ultimate > >>>> reason for > >>>> >>>> all the problems is that the tests take close to 2 hours to > >>>> run already. > >>>> >>>> I fully support this claim: "Unless people start caring > about > >>>> test times > >>>> >>>> before adding them, this issue cannot be solved" > >>>> >>>> > >>>> >>>> This is also another reason why using user's Travis account > >>>> won't help. > >>>> >>>> Every few weeks we reach the user's time limit for a single > >>>> profile. > >>>> >>>> This makes the user's builds simply fail, until we either > >>>> properly > >>>> >>>> decrease the time the tests take (which I am not sure we > ever > >>>> did) or > >>>> >>>> postpone the problem by splitting into more profiles. (Note > >>>> that the ASF > >>>> >>>> Travis account has higher time limits) > >>>> >>>> > >>>> >>>> Best, > >>>> >>>> > >>>> >>>> Dawid > >>>> >>>> > >>>> >>>> On 26/06/2019 09:36, Robert Metzger wrote: > >>>> >>>>> Do we know if using "the best" available hardware would > >>>> improve the > >>>> >> build > >>>> >>>>> times? > >>>> >>>>> Imagine we would run the build on machines with plenty of > >>>> main memory > >>>> >> to > >>>> >>>>> mount everything to ramdisk + the latest CPU architecture? > >>>> >>>>> > >>>> >>>>> Throwing hardware at the problem could help reduce the time > >>>> of an > >>>> >>>>> individual build, and using our own infrastructure would > >>>> remove our > >>>> >>>>> dependency on Apache's Travis account (with the obvious > >>>> downside of > >>>> >>>> having > >>>> >>>>> to maintain the infrastructure) > >>>> >>>>> We could use an open source travis alternative, to have a > >>>> similar > >>>> >>>>> experience and make the migration easy. > >>>> >>>>> > >>>> >>>>> > >>>> >>>>> On Wed, Jun 26, 2019 at 9:34 AM Chesnay Schepler > >>>> <[hidden email] <mailto:[hidden email]>> > >>>> >>>> wrote: > >>>> >>>>>> From what I gathered, there's no special sauce that the > >>>> Zeppelin > >>>> >>>>>> project uses which actually integrates a users Travis > >>>> account into the > >>>> >>>> PR. > >>>> >>>>>> They just disabled Travis for PRs. And that's kind of it. > >>>> >>>>>> > >>>> >>>>>> Naturally we can do this (duh) and safe the ASF a fair > >>>> amount of > >>>> >>>>>> resources, but there are downsides: > >>>> >>>>>> > >>>> >>>>>> The discoverability of the Travis check takes a nose-dive. > >>>> Either we > >>>> >>>>>> require every contributor to always, an every commit, also > >>>> post a > >>>> >> Travis > >>>> >>>>>> build, or we have the reviewer sift through the > >>>> contributors account > >>>> >> to > >>>> >>>>>> find it. > >>>> >>>>>> > >>>> >>>>>> This is rather cumbersome. Additionally, it's also not > >>>> equivalent to > >>>> >>>>>> having a PR build. > >>>> >>>>>> > >>>> >>>>>> A normal branch build takes a branch as is and tests it. A > >>>> PR build > >>>> >>>>>> merges the branch into master, and then runs it. (Fun > fact: > >>>> This is > >>>> >> why > >>>> >>>>>> a PR without merge conflicts is not being run on Travis.) > >>>> >>>>>> > >>>> >>>>>> And ultimately, everyone can already make use of this > >>>> approach anyway. > >>>> >>>>>> > >>>> >>>>>> On 25/06/2019 08:02, Jark Wu wrote: > >>>> >>>>>>> Hi Jeff, > >>>> >>>>>>> > >>>> >>>>>>> Thanks for sharing the Zeppelin approach. I think it's a > >>>> good idea to > >>>> >>>>>>> leverage user's travis account. > >>>> >>>>>>> In this way, we can have almost unlimited concurrent > build > >>>> jobs and > >>>> >>>>>>> developers can restart build by themselves (currently > only > >>>> committers > >>>> >>>>>>> can restart PR's build). > >>>> >>>>>>> > >>>> >>>>>>> But I'm still not very clear how to integrate user's > >>>> travis build > >>>> >> into > >>>> >>>>>>> the Flink pull request's build automatically. Can you > >>>> explain more in > >>>> >>>>>>> detail? > >>>> >>>>>>> > >>>> >>>>>>> Another question: does travis only build branches for > user > >>>> account? > >>>> >>>>>>> My concern is that builds for PRs will rebase user's > >>>> commits against > >>>> >>>>>>> current master branch. > >>>> >>>>>>> This will help us to find problems before merge. Builds > >>>> for branches > >>>> >>>>>>> will lose the impact of new commits in master. > >>>> >>>>>>> How does Zeppelin solve this problem? > >>>> >>>>>>> > >>>> >>>>>>> Thanks again for sharing the idea. > >>>> >>>>>>> > >>>> >>>>>>> Regards, > >>>> >>>>>>> Jark > >>>> >>>>>>> > >>>> >>>>>>> On Tue, 25 Jun 2019 at 11:01, Jeff Zhang < > [hidden email] > >>>> <mailto:[hidden email]> > >>>> >>>>>>> <mailto:[hidden email] <mailto:[hidden email]>>> > wrote: > >>>> >>>>>>> > >>>> >>>>>>> Hi Folks, > >>>> >>>>>>> > >>>> >>>>>>> Zeppelin meet this kind of issue before, we solve > >>>> it by > >>>> >> delegating > >>>> >>>>>>> each > >>>> >>>>>>> one's PR build to his travis account (Everyone can > >>>> have 5 free > >>>> >>>>>>> slot for > >>>> >>>>>>> travis build). > >>>> >>>>>>> Apache account travis build is only triggered when > >>>> PR is merged. > >>>> >>>>>>> > >>>> >>>>>>> > >>>> >>>>>>> > >>>> >>>>>>> Kurt Young <[hidden email] > >>>> <mailto:[hidden email]> <mailto:[hidden email] > >>>> <mailto:[hidden email]>>> > >>>> >>>>>>> 于2019年6月25日周二 上午10:16写道: > >>>> >>>>>>> > >>>> >>>>>>> > (Forgot to cc George) > >>>> >>>>>>> > > >>>> >>>>>>> > Best, > >>>> >>>>>>> > Kurt > >>>> >>>>>>> > > >>>> >>>>>>> > > >>>> >>>>>>> > On Tue, Jun 25, 2019 at 10:16 AM Kurt Young > >>>> <[hidden email] <mailto:[hidden email]> > >>>> >>>>>>> <mailto:[hidden email] <mailto:[hidden email]>>> > >>>> wrote: > >>>> >>>>>>> > > >>>> >>>>>>> > > Hi Bowen, > >>>> >>>>>>> > > > >>>> >>>>>>> > > Thanks for bringing this up. We actually have > >>>> discussed > >>>> >> about > >>>> >>>>>>> this, and I > >>>> >>>>>>> > > think Till and George have > >>>> >>>>>>> > > already spend sometime investigating it. I have > >>>> cced both of > >>>> >>>>>>> them, and > >>>> >>>>>>> > > maybe they can share > >>>> >>>>>>> > > their findings. > >>>> >>>>>>> > > > >>>> >>>>>>> > > Best, > >>>> >>>>>>> > > Kurt > >>>> >>>>>>> > > > >>>> >>>>>>> > > > >>>> >>>>>>> > > On Tue, Jun 25, 2019 at 10:08 AM Jark Wu > >>>> <[hidden email] <mailto:[hidden email]> > >>>> >>>>>>> <mailto:[hidden email] <mailto:[hidden email]>>> > >>>> wrote: > >>>> >>>>>>> > > > >>>> >>>>>>> > >> Hi Bowen, > >>>> >>>>>>> > >> > >>>> >>>>>>> > >> Thanks for bringing this. We also suffered > from > >>>> the long > >>>> >>>>>>> build time. > >>>> >>>>>>> > >> I agree that we should focus on solving build > >>>> capacity > >>>> >>>>>>> problem in the > >>>> >>>>>>> > >> thread. > >>>> >>>>>>> > >> > >>>> >>>>>>> > >> My observation is there is only one build is > >>>> running, all > >>>> >> the > >>>> >>>>>>> others > >>>> >>>>>>> > >> (other > >>>> >>>>>>> > >> PRs, master) are pending. > >>>> >>>>>>> > >> The pricing plan[1] of travis shows it can > >>>> support > >>>> >> concurrent > >>>> >>>>>>> build > >>>> >>>>>>> > jobs. > >>>> >>>>>>> > >> But I don't know which plan we are using, > might > >>>> be the free > >>>> >>>>>>> plan for > >>>> >>>>>>> > open > >>>> >>>>>>> > >> source. > >>>> >>>>>>> > >> > >>>> >>>>>>> > >> I cc-ed Chesnay who may have some experience > on > >>>> Travis. > >>>> >>>>>>> > >> > >>>> >>>>>>> > >> Regards, > >>>> >>>>>>> > >> Jark > >>>> >>>>>>> > >> > >>>> >>>>>>> > >> [1]: https://travis-ci.com/plans > >>>> >>>>>>> > >> > >>>> >>>>>>> > >> On Tue, 25 Jun 2019 at 08:11, Bowen Li < > >>>> >> [hidden email] <mailto:[hidden email]> > >>>> >>>>>>> <mailto:[hidden email] > >>>> <mailto:[hidden email]>>> wrote: > >>>> >>>>>>> > >> > >>>> >>>>>>> > >> > Hi Steven, > >>>> >>>>>>> > >> > > >>>> >>>>>>> > >> > I think you may not read what I wrote. The > >>>> discussion is > >>>> >>>> about > >>>> >>>>>>> > "unstable > >>>> >>>>>>> > >> > build **capacity**", in another word > >>>> "unstable / lack of > >>>> >>>> build > >>>> >>>>>>> > >> resources", > >>>> >>>>>>> > >> > not "unstable build". > >>>> >>>>>>> > >> > > >>>> >>>>>>> > >> > On Mon, Jun 24, 2019 at 4:40 PM Steven Wu > >>>> >>>>>>> <[hidden email] <mailto:[hidden email] > > > >>>> <mailto:[hidden email] <mailto:[hidden email]>>> > >>>> >>>>>>> > wrote: > >>>> >>>>>>> > >> > > >>>> >>>>>>> > >> > > long and sometimes unstable build is > >>>> definitely a pain > >>>> >>>>>> point. > >>>> >>>>>>> > >> > > > >>>> >>>>>>> > >> > > I suspect the build failure here in > >>>> >> flink-connector-kafka > >>>> >>>>>>> is not > >>>> >>>>>>> > >> related > >>>> >>>>>>> > >> > to > >>>> >>>>>>> > >> > > my change. but there is no easy re-run the > >>>> build on > >>>> >>>>>>> travis UI. > >>>> >>>>>>> > >> > > search showed a trick of close-and-open > the > >>>> PR will > >>>> >>>>>>> trigger rebuild. > >>>> >>>>>>> > >> but > >>>> >>>>>>> > >> > > that could add noises to the PR > activities. > >>>> >>>>>>> > >> > > > >>>> https://travis-ci.org/apache/flink/jobs/545555519 > >>>> >>>>>>> > >> > > > >>>> >>>>>>> > >> > > travis-ci for my personal repo often > failed > >>>> with > >>>> >>>>>>> exceeding time > >>>> >>>>>>> > limit > >>>> >>>>>>> > >> > after > >>>> >>>>>>> > >> > > 4+ hours. > >>>> >>>>>>> > >> > > The job exceeded the maximum time limit > for > >>>> jobs, and > >>>> >> has > >>>> >>>>>>> been > >>>> >>>>>>> > >> > terminated. > >>>> >>>>>>> > >> > > > >>>> >>>>>>> > >> > > On Mon, Jun 24, 2019 at 4:15 PM Bowen Li > >>>> >>>>>>> <[hidden email] <mailto:[hidden email]> > >>>> <mailto:[hidden email] <mailto:[hidden email]>>> > >>>> >>>>>>> > wrote: > >>>> >>>>>>> > >> > > > >>>> >>>>>>> > >> > > > > >>>> https://travis-ci.org/apache/flink/builds/549681530 > >>>> >>>>>>> This build > >>>> >>>>>>> > >> > request > >>>> >>>>>>> > >> > > > has > >>>> >>>>>>> > >> > > > been sitting at **HEAD of the queue** > >>>> since I first > >>>> >> saw > >>>> >>>>>>> it at PST > >>>> >>>>>>> > >> > 10:30am > >>>> >>>>>>> > >> > > > (not sure how long it's been there > before > >>>> 10:30am). > >>>> >>>>>>> It's PST > >>>> >>>>>>> > 4:12pm > >>>> >>>>>>> > >> now > >>>> >>>>>>> > >> > > and > >>>> >>>>>>> > >> > > > it hasn't started yet. > >>>> >>>>>>> > >> > > > > >>>> >>>>>>> > >> > > > On Mon, Jun 24, 2019 at 2:48 PM Bowen Li > >>>> >>>>>>> <[hidden email] <mailto:[hidden email]> > >>>> <mailto:[hidden email] <mailto:[hidden email]>>> > >>>> >>>>>>> > >> wrote: > >>>> >>>>>>> > >> > > > > >>>> >>>>>>> > >> > > > > Hi devs, > >>>> >>>>>>> > >> > > > > > >>>> >>>>>>> > >> > > > > I've been experiencing the pain > >>>> resulting from lack > >>>> >>>>>>> of stable > >>>> >>>>>>> > >> build > >>>> >>>>>>> > >> > > > > capacity on Travis for Flink PRs [1]. > >>>> >> Specifically, I > >>>> >>>>>>> noticed > >>>> >>>>>>> > >> often > >>>> >>>>>>> > >> > > that > >>>> >>>>>>> > >> > > > no > >>>> >>>>>>> > >> > > > > build in the queue is making any > >>>> progress for > >>>> >> hours, > >>>> >>>> and > >>>> >>>>>>> > suddenly > >>>> >>>>>>> > >> 5 > >>>> >>>>>>> > >> > or > >>>> >>>>>>> > >> > > 6 > >>>> >>>>>>> > >> > > > > builds kick off all together after the > >>>> long pause. > >>>> >>>>>>> I'm at PST > >>>> >>>>>>> > >> > (UTC-08) > >>>> >>>>>>> > >> > > > time > >>>> >>>>>>> > >> > > > > zone, and I've seen pause can be as > >>>> long as 6 hours > >>>> >>>>>>> from PST 9am > >>>> >>>>>>> > >> to > >>>> >>>>>>> > >> > 3pm > >>>> >>>>>>> > >> > > > > (let alone the time needed to drain > the > >>>> queue > >>>> >>>>>>> afterwards). > >>>> >>>>>>> > >> > > > > > >>>> >>>>>>> > >> > > > > I think this has greatly impacted our > >>>> productivity. > >>>> >>>> I've > >>>> >>>>>>> > >> experienced > >>>> >>>>>>> > >> > > that > >>>> >>>>>>> > >> > > > > PRs submitted in the early morning of > >>>> PST time zone > >>>> >>>>>>> won't finish > >>>> >>>>>>> > >> > their > >>>> >>>>>>> > >> > > > > build until late night of the same > day. > >>>> >>>>>>> > >> > > > > > >>>> >>>>>>> > >> > > > > So my questions are: > >>>> >>>>>>> > >> > > > > > >>>> >>>>>>> > >> > > > > - Has anyone else experienced the same > >>>> problem or > >>>> >>>>>>> have similar > >>>> >>>>>>> > >> > > > observation > >>>> >>>>>>> > >> > > > > on TravisCI? (I suspect it has things > >>>> to do with > >>>> >> time > >>>> >>>>>>> zone) > >>>> >>>>>>> > >> > > > > > >>>> >>>>>>> > >> > > > > - What pricing plan of TravisCI is > >>>> Flink currently > >>>> >>>>>>> using? Is it > >>>> >>>>>>> > >> the > >>>> >>>>>>> > >> > > free > >>>> >>>>>>> > >> > > > > plan for open source projects? What > >>>> are the > >>>> >>>>>>> guaranteed build > >>>> >>>>>>> > >> capacity > >>>> >>>>>>> > >> > > of > >>>> >>>>>>> > >> > > > > the current plan? > >>>> >>>>>>> > >> > > > > > >>>> >>>>>>> > >> > > > > - If the current pricing plan (either > >>>> free or paid) > >>>> >>>>>> can't > >>>> >>>>>>> > provide > >>>> >>>>>>> > >> > > stable > >>>> >>>>>>> > >> > > > > build capacity, can we upgrade to a > >>>> higher priced > >>>> >>>>>>> plan with > >>>> >>>>>>> > larger > >>>> >>>>>>> > >> > and > >>>> >>>>>>> > >> > > > more > >>>> >>>>>>> > >> > > > > stable build capacity? > >>>> >>>>>>> > >> > > > > > >>>> >>>>>>> > >> > > > > BTW, another factor that contribute to > >>>> the > >>>> >>>>>>> productivity problem > >>>> >>>>>>> > is > >>>> >>>>>>> > >> > that > >>>> >>>>>>> > >> > > > > our build is slow - we run full build > >>>> for every PR > >>>> >>>> and a > >>>> >>>>>>> > >> successful > >>>> >>>>>>> > >> > > full > >>>> >>>>>>> > >> > > > > build takes ~5h. We definitely have > >>>> more options to > >>>> >>>>>>> solve it, > >>>> >>>>>>> > for > >>>> >>>>>>> > >> > > > instance, > >>>> >>>>>>> > >> > > > > modularize the build graphs and reuse > >>>> artifacts > >>>> >> from > >>>> >>>> the > >>>> >>>>>>> > previous > >>>> >>>>>>> > >> > > build. > >>>> >>>>>>> > >> > > > > But I think that can be a big effort > >>>> which is much > >>>> >>>>>>> harder to > >>>> >>>>>>> > >> > accomplish > >>>> >>>>>>> > >> > > > in > >>>> >>>>>>> > >> > > > > a short period of time and may deserve > >>>> its own > >>>> >>>> separate > >>>> >>>>>>> > >> discussion. > >>>> >>>>>>> > >> > > > > > >>>> >>>>>>> > >> > > > > [1] > >>>> >> https://travis-ci.org/apache/flink/pull_requests > >>>> >>>>>>> > >> > > > > > >>>> >>>>>>> > >> > > > > > >>>> >>>>>>> > >> > > > > >>>> >>>>>>> > >> > > > >>>> >>>>>>> > >> > > >>>> >>>>>>> > >> > >>>> >>>>>>> > > > >>>> >>>>>>> > > >>>> >>>>>>> > >>>> >>>>>>> > >>>> >>>>>>> -- > >>>> >>>>>>> Best Regards > >>>> >>>>>>> > >>>> >>>>>>> Jeff Zhang > >>>> >>>>>>> > >>>> >> > >>>> > >>> > >> > > |
I've raised a JIRA
<https://issues.apache.org/jira/browse/INFRA-18703>with INFRA to inquire whether it would be possible to switch to a different Travis account, and if so what steps would need to be taken. We need a proper confirmation from INFRA since we are not in full control of the flink repository (for example, we cannot access the settings page). If this is indeed possible, Ververica is willing sponsor a Travis account for the Flink project. This would provide us with more than enough resources than we need. Since this makes the project more reliant on resources provided by external companies I would like to vote on this. Please vote on this proposal, as follows: [ ] +1, Approve the migration to a Ververica-sponsored Travis account, provided that INFRA approves [ ] -1, Do not approach the migration to a Ververica-sponsored Travis account The vote will be open for at least 24h, and until we have confirmation from INFRA. The voting period may be shorter than the usual 3 days since our current is effectively not working. On 04/07/2019 06:51, Bowen Li wrote: > Re: > Are they using their own Travis CI pool, or did the switch to an > entirely different CI service? > > I reached out to Wes and Krisztián from Apache Arrow PMC. They are > currently moving away from ASF's Travis to their own in-house metal > machines at [1] with custom CI application at [2]. They've seen > significant improvement w.r.t both much higher performance and > basically no resource waiting time, "night-and-day" difference quoting > Wes. > > Re: > If we can just switch to our own Travis pool, just for our > project, then this might be something we can do fairly quickly? > > I believe so, according to [3] and [4] > > > [1] https://ci.ursalabs.org/ <https://ci.ursalabs.org/#/> > [2] https://github.com/ursa-labs/ursabot > [3] > https://docs.travis-ci.com/user/migrate/open-source-repository-migration > [4] https://docs.travis-ci.com/user/migrate/open-source-on-travis-ci-com > > > > On Wed, Jul 3, 2019 at 12:01 AM Chesnay Schepler <[hidden email] > <mailto:[hidden email]>> wrote: > > Are they using their own Travis CI pool, or did the switch to an > entirely different CI service? > > If we can just switch to our own Travis pool, just for our > project, then > this might be something we can do fairly quickly? > > On 03/07/2019 05:55, Bowen Li wrote: > > I responded in the INFRA ticket [1] that I believe they are > using a wrong > > metric against Flink and the total build time is a completely > different > > thing than guaranteed build capacity. > > > > My response: > > > > "As mentioned above, since I started to pay attention to Flink's > build > > queue a few tens of days ago, I'm in Seattle and I saw no build > was kicking > > off in PST daytime in weekdays for Flink. Our teammates in China > and Europe > > have also reported similar observations. So we need to evaluate > how the > > large total build time came from - if 1) your number and 2) our > > observations from three locations that cover pretty much a full > day, are > > all true, I **guess** one reason can be that - highly likely the > extra > > build time came from weekends when other Apache projects may be > idle and > > Flink just drains hard its congested queue. > > > > Please be aware of that we're not complaining about the lack of > resources > > in general, I'm complaining about the lack of **stable, dedicated** > > resources. An example for the latter one is, currently even if > no build is > > in Flink's queue and I submit a request to be the queue head in PST > > morning, my build won't even start in 6-8+h. That is an absurd > amount of > > waiting time. > > > > That's saying, if ASF INFRA decides to adopt a quota system and > grants > > Flink five DEDICATED servers that runs all the time only for > Flink, that'll > > be PERFECT and can totally solve our problem now. > > > > Please be aware of that we're not complaining about the lack of > resources > > in general, I'm complaining about the lack of **stable, dedicated** > > resources. An example for the latter one is, currently even if > no build is > > in Flink's queue and I submit a request to be the queue head in PST > > morning, my build won't even start in 6-8+h. That is an absurd > amount of > > waiting time. > > > > > > That's saying, if ASF INFRA decides to adopt a quota system and > grants > > Flink five DEDICATED servers that runs all the time only for > Flink, that'll > > be PERFECT and can totally solve our problem now. > > > > I feel what's missing in the ASF INFRA's Travis resource pool is > some level > > of build capacity SLAs and certainty" > > > > > > Again, I believe there are differences in nature of these two > problems, > > long build time v.s. lack of dedicated build resource. That's > saying, > > shortening build time may relieve the situation, and may not. > I'm sightly > > negative on disabling IT cases for PRs, due to the downside is > that we are > > at risk of any potential bugs in PR that UTs doesn't catch, and > may cost a > > lot more to fix and if it slows others down or even block > others, but am > > open to others opinions on it. > > > > AFAICT from INFRA ticket[1], donating to ASF INFRA won't be > feasible to > > solve our problem since INFRA's pool is fully shared and they > have no > > control and finer insights over resource allocation to a > specific Apache > > project. As mentioned in [1], Apache Arrow is moving away from > ASF INFRA > > Travis pool (they are actually surprised Flink hasn't plan to do > so). I > > know that Spark is on its own build infra. If we all agree that > funding our > > own build infra, I'd be glad to help investigate any potential > options > > after releasing 1.9 since I'm super busy with 1.9 now. > > > > [1] https://issues.apache.org/jira/browse/INFRA-18533 > > > > > > > > On Tue, Jul 2, 2019 at 4:46 AM Chesnay Schepler > <[hidden email] <mailto:[hidden email]>> wrote: > > > >> As a short-term stopgap, since we can assume this issue to > become much > >> worse in the following days/weeks, we could disable IT cases in > PRs and > >> only run them on master. > >> > >> On 02/07/2019 12:03, Chesnay Schepler wrote: > >>> People really have to stop thinking that just because > something works > >>> for us it is also a good solution. > >>> Also, please remember that our builds run for 2h from start to > finish, > >>> and not the 14 _minutes_ it takes for zeppelin. > >>> We are dealing with an entirely different scale here, both in > terms of > >>> build times and number of builds. > >>> > >>> In this very thread people have been complaining about long queue > >>> times for their builds. Surprise, other Apache projects have been > >>> suffering the very same thing due to us not controlling our build > >>> times. While switching services (be it Jenkins, CircleCI or > whatever) > >>> will possibly work for us (and these options are actually > attractive, > >>> like CircleCI's proper support for build artifacts), it will also > >>> result in us likely negatively affecting other projects in > significant > >>> ways. > >>> > >>> Sure, the Jenkins setup has a good user experience for us, at > the cost > >>> of blocking Jenkins workers for a _lot_ of time. Right now we > have 25 > >>> PR's in our queue; that's possibly 50h we'd consume of Jenkins > >>> resources, and the European contributors haven't even really > started yet. > >>> > >>> FYI, the latest INFRA response from INFRA-18533: > >>> > >>> "Our rough metrics shows that Flink used over 5800 hours of > build time > >>> last month. That is equal to EIGHT servers running 24/7 for > the ENTIRE > >>> MONTH. EIGHT. nonstop. > >>> When we discovered this last night, we discussed it some and > are going > >>> to tune down Flink to allow only five executors maximum. We cannot > >>> allow Flink to consume so much of a Foundation shared resource." > >>> > >>> So yes, we either > >>> a) have to heavily reduce our CI usage or > >>> b) fund our own, either maintaining it ourselves or donating > to Apache. > >>> > >>> On 02/07/2019 05:11, Bowen Li wrote: > >>>> By looking at the git history of the Jenkins script, its core > part > >>>> was finished in March 2017 (and only two minor update in > 2017/2018), > >>>> so it's been running for over two years now and feels like > Zepplin > >>>> community has been quite happy with it. @Jeff Zhang > >>>> <mailto:[hidden email] <mailto:[hidden email]>> can you > share your insights and user > >>>> experience with the Jenkins+Travis approach? > >>>> > >>>> Things like: > >>>> > >>>> - has the approach completely solved the resource capacity > problem > >>>> for Zepplin community? is Zepplin community happy with the > result? > >>>> - is the whole configuration chain stable (e.g. uptime) enough? > >>>> - how often do you need to maintain the Jenkins infra? how many > >>>> people are usually involved in maintenance and bug-fixes? > >>>> > >>>> The downside of this approach seems mostly to be on the > maintenance > >>>> to me - maintain the script and Jenkins infra. > >>>> > >>>> ** Having Our Own Travis-CI.com Account ** > >>>> > >>>> Another alternative I've been thinking of is to have our own > >>>> travis-ci.com <http://travis-ci.com> <http://travis-ci.com> > account with paid dedicated > >>>> resources. Note travis-ci.org <http://travis-ci.org> > <http://travis-ci.org> is the free > >>>> version and travis-ci.com <http://travis-ci.com> > <http://travis-ci.com> is the commercial > >>>> version. We currently use a shared resource pool managed by > ASK INFRA > >>>> team on travis-ci.org <http://travis-ci.org> > <http://travis-ci.org>, but we have no control > >>>> over it - we can't see how it's configured, how much > resources are > >>>> available, how resources are allocated among Apache projects, > etc. > >>>> The nice thing about having an account on travis-ci.com > <http://travis-ci.com> > >>>> <http://travis-ci.com> are: > >>>> > >>>> - relatively low cost with much better resource guarantee > than what > >>>> we currently have [1]: $249/month with 5 dedicated concurrency, > >>>> $489/month with 10 concurrency > >>>> - low maintenance work compared to using Jenkins > >>>> - (potentially) no migration cost according to Travis's doc [2] > >>>> (pending verification) > >>>> - full control over the build capacity/configuration compared to > >>>> using ASF INFRA's pool > >>>> > >>>> I'd be surprised if we as such a vibrant community cannot > find and > >>>> fund $249*12=$2988 a year in exchange for a much better developer > >>>> experience and much higher productivity. > >>>> > >>>> [1] https://travis-ci.com/plans > >>>> [2] > >>>> > >> > https://docs.travis-ci.com/user/migrate/open-source-repository-migration > >>>> On Sat, Jun 29, 2019 at 8:39 AM Chesnay Schepler > <[hidden email] <mailto:[hidden email]> > >>>> <mailto:[hidden email] <mailto:[hidden email]>>> wrote: > >>>> > >>>> So yes, the Jenkins job keeps pulling the state from > Travis until it > >>>> finishes. > >>>> > >>>> Note sure I'm comfortable with the idea of using Jenkins > workers > >>>> just to > >>>> idle for a several hours. > >>>> > >>>> On 29/06/2019 14:56, Jeff Zhang wrote: > >>>> > Here's what zeppelin community did, we make a python > script to > >>>> check the > >>>> > build status of pull request. > >>>> > Here's script: > >>>> > > https://github.com/apache/zeppelin/blob/master/travis_check.py > >>>> > > >>>> > And this is the script we used in Jenkins build job. > >>>> > > >>>> > if [ -f "travis_check.py" ]; then > >>>> > git log -n 1 > >>>> > STATUS=$(curl -s $BUILD_URL | grep -e "GitHub pull > >>>> request.*from.*" | sed > >>>> > 's/.*GitHub pull request <a > >>>> > href=\"\(https[^"]*\).*from[^"]*.\(https[^"]*\).*/\1 > \2/g') > >>>> > AUTHOR=$(echo $STATUS | sed 's/.*[/]\(.*\)$/\1/g') > >>>> > PR=$(echo $STATUS | awk '{print $1}' | sed > >>>> 's/.*[/]\(.*\)$/\1/g') > >>>> > #COMMIT=$(git log -n 1 | grep "^Merge:" | awk > '{print $3}') > >>>> > #if [ -z $COMMIT ]; then > >>>> > # COMMIT=$(curl -s > >>>> https://api.github.com/repos/apache/zeppelin/pulls/$PR > >>>> > | grep -e "\"label\":" -e "\"ref\":" -e "\"sha\":" | > tr '\n' ' ' > >>>> | sed > >>>> > 's/\(.*sha[^,]*,\)\(.*ref.*\)/\1 = \2/g' | tr = '\n' | > grep -v > >>>> "apache:" | > >>>> > sed 's/.*sha.[^"]*["]\([^"]*\).*/\1/g') > >>>> > #fi > >>>> > > >>>> > # get commit hash from PR > >>>> > COMMIT=$(curl -s > >>>> https://api.github.com/repos/apache/zeppelin/pulls/$PR | > >>>> > grep -e "\"label\":" -e "\"ref\":" -e "\"sha\":" | tr > '\n' ' ' > >>>> | sed > >>>> > 's/\(.*sha[^,]*,\)\(.*ref.*\)/\1 = \2/g' | tr = '\n' | > grep -v > >>>> "apache:" | > >>>> > sed 's/.*sha.[^"]*["]\([^"]*\).*/\1/g') > >>>> > sleep 30 # sleep few moment to wait travis starts > the build > >>>> > RET_CODE=0 > >>>> > python ./travis_check.py ${AUTHOR} ${COMMIT} || > RET_CODE=$? > >>>> > if [ $RET_CODE -eq 2 ]; then # try with repository > name when > >>>> travis-ci is > >>>> > not available in the account > >>>> > RET_CODE=0 > >>>> > AUTHOR=$(curl -s > >>>> https://api.github.com/repos/apache/zeppelin/pulls/$PR > >>>> > | grep '"full_name":' | grep -v "apache/zeppelin" | sed > >>>> > 's/.*[:][^"]*["]\([^/]*\).*/\1/g') > >>>> > python ./travis_check.py ${AUTHOR} ${COMMIT} || > RET_CODE=$? > >>>> > fi > >>>> > > >>>> > if [ $RET_CODE -eq 2 ]; then # fail with can't find > build > >>>> information in > >>>> > the travis > >>>> > set +x > >>>> > echo > "-----------------------------------------------------" > >>>> > echo "Looks like travis-ci is not configured for > your fork." > >>>> > echo "Please setup by swich on 'zeppelin' > repository at > >>>> > https://travis-ci.org/profile and travis-ci." > >>>> > echo "And then make sure 'Build branch updates' > option is > >>>> enabled in > >>>> > the settings > https://travis-ci.org/${AUTHOR}/zeppelin/settings > <https://travis-ci.org/$%7BAUTHOR%7D/zeppelin/settings> > >>>> <https://travis-ci.org/$%7BAUTHOR%7D/zeppelin/settings>." > >>>> > echo "" > >>>> > echo "To trigger CI after setup, you will need > ammend your > >>>> last commit > >>>> > with" > >>>> > echo "git commit --amend" > >>>> > echo "git push your-remote HEAD --force" > >>>> > echo "" > >>>> > echo "See > >>>> > > >>>> > >> > http://zeppelin.apache.org/contribution/contributions.html#continuous-integration > >>>> > ." > >>>> > fi > >>>> > > >>>> > exit $RET_CODE > >>>> > else > >>>> > set +x > >>>> > echo "travis_check.py does not exists" > >>>> > exit 1 > >>>> > fi > >>>> > > >>>> > Chesnay Schepler <[hidden email] > <mailto:[hidden email]> > >>>> <mailto:[hidden email] <mailto:[hidden email]>>> > 于2019年6月29日周六 下午3:17写道: > >>>> > > >>>> >> Does this imply that a Jenkins job is active as long > as the > >>>> Travis build > >>>> >> runs? > >>>> >> > >>>> >> On 26/06/2019 21:28, Bowen Li wrote: > >>>> >>> Hi, > >>>> >>> > >>>> >>> @Dawid, I think the "long test running" as I > mentioned in the > >>>> first > >>>> >> email, > >>>> >>> also as you guys said, belongs to "a big effort > which is much > >>>> harder to > >>>> >>> accomplish in a short period of time and may deserve > its own > >>>> separate > >>>> >>> discussion". Thus I didn't include it in what we can > do in a > >>>> foreseeable > >>>> >>> short term. > >>>> >>> > >>>> >>> Besides, I don't think that's the ultimate reason > for lack of > >>>> build > >>>> >>> resources. Even if the build is shortened to > something like > >>>> 2h, the > >>>> >>> problems of no build machine works about 6 or more > hours in > >>>> PST daytime > >>>> >>> that I described will still happen, because no > machine from > >>>> ASF INFRA's > >>>> >>> pool is allocated to Flink. As I have paid close > attention to > >>>> the build > >>>> >>> queue in the past few weekdays, it's a pretty clear > pattern now. > >>>> >>> > >>>> >>> **The ultimate root cause** for that is - we don't > have any > >>>> **dedicated** > >>>> >>> build resources that we can stably rely on. I'm > actually ok to > >>>> wait for a > >>>> >>> long time if there are build requests running, it > means at > >>>> least we are > >>>> >>> making progress. But I'm not ok with no build > resource. A > >>>> better place I > >>>> >>> think we should aim at in short term is to always > have at > >>>> least a central > >>>> >>> pool (can be 3 or 5) of machines dedicated to build > Flink at > >>>> any time, or > >>>> >>> maybe use users resources. > >>>> >>> > >>>> >>> @Chesnay @Robert I synced with Jeff offline that > Zeppelin > >>>> community is > >>>> >>> using a Jenkins job to automatically build on users' > travis > >>>> account and > >>>> >>> link the result back to github PR. I guess the > Jenkins job > >>>> would fetch > >>>> >>> latest upstream master and build the PR against it. > Jeff has > >>>> filed > >>>> >> tickets > >>>> >>> to learn and get access to the Jenkins infra. It'll > better to > >>>> fully > >>>> >>> understand it first before judging this approach. > >>>> >>> > >>>> >>> I also heard good things about CircleCI, and ASF > INFRA seems > >>>> to have a > >>>> >> pool > >>>> >>> of build capacity there too. Can be an alternative > to consider. > >>>> >>> > >>>> >>> > >>>> >>> > >>>> >>> > >>>> >>> > >>>> >>> > >>>> >>> > >>>> >>> > >>>> >>> > >>>> >>> On Wed, Jun 26, 2019 at 12:44 AM Dawid Wysakowicz < > >>>> >> [hidden email] > <mailto:[hidden email]> <mailto:[hidden email] > <mailto:[hidden email]>>> > >>>> >>> wrote: > >>>> >>> > >>>> >>>> Sorry to jump in late, but I think Bowen missed the > most > >>>> important point > >>>> >>>> from Chesnay's previous message in the summary. The > ultimate > >>>> reason for > >>>> >>>> all the problems is that the tests take close to 2 > hours to > >>>> run already. > >>>> >>>> I fully support this claim: "Unless people start > caring about > >>>> test times > >>>> >>>> before adding them, this issue cannot be solved" > >>>> >>>> > >>>> >>>> This is also another reason why using user's Travis > account > >>>> won't help. > >>>> >>>> Every few weeks we reach the user's time limit for > a single > >>>> profile. > >>>> >>>> This makes the user's builds simply fail, until we > either > >>>> properly > >>>> >>>> decrease the time the tests take (which I am not > sure we ever > >>>> did) or > >>>> >>>> postpone the problem by splitting into more > profiles. (Note > >>>> that the ASF > >>>> >>>> Travis account has higher time limits) > >>>> >>>> > >>>> >>>> Best, > >>>> >>>> > >>>> >>>> Dawid > >>>> >>>> > >>>> >>>> On 26/06/2019 09:36, Robert Metzger wrote: > >>>> >>>>> Do we know if using "the best" available hardware > would > >>>> improve the > >>>> >> build > >>>> >>>>> times? > >>>> >>>>> Imagine we would run the build on machines with > plenty of > >>>> main memory > >>>> >> to > >>>> >>>>> mount everything to ramdisk + the latest CPU > architecture? > >>>> >>>>> > >>>> >>>>> Throwing hardware at the problem could help reduce > the time > >>>> of an > >>>> >>>>> individual build, and using our own infrastructure > would > >>>> remove our > >>>> >>>>> dependency on Apache's Travis account (with the > obvious > >>>> downside of > >>>> >>>> having > >>>> >>>>> to maintain the infrastructure) > >>>> >>>>> We could use an open source travis alternative, to > have a > >>>> similar > >>>> >>>>> experience and make the migration easy. > >>>> >>>>> > >>>> >>>>> > >>>> >>>>> On Wed, Jun 26, 2019 at 9:34 AM Chesnay Schepler > >>>> <[hidden email] <mailto:[hidden email]> > <mailto:[hidden email] <mailto:[hidden email]>>> > >>>> >>>> wrote: > >>>> >>>>>> >From what I gathered, there's no special > sauce that the > >>>> Zeppelin > >>>> >>>>>> project uses which actually integrates a users Travis > >>>> account into the > >>>> >>>> PR. > >>>> >>>>>> They just disabled Travis for PRs. And that's > kind of it. > >>>> >>>>>> > >>>> >>>>>> Naturally we can do this (duh) and safe the ASF a > fair > >>>> amount of > >>>> >>>>>> resources, but there are downsides: > >>>> >>>>>> > >>>> >>>>>> The discoverability of the Travis check takes a > nose-dive. > >>>> Either we > >>>> >>>>>> require every contributor to always, an every > commit, also > >>>> post a > >>>> >> Travis > >>>> >>>>>> build, or we have the reviewer sift through the > >>>> contributors account > >>>> >> to > >>>> >>>>>> find it. > >>>> >>>>>> > >>>> >>>>>> This is rather cumbersome. Additionally, it's > also not > >>>> equivalent to > >>>> >>>>>> having a PR build. > >>>> >>>>>> > >>>> >>>>>> A normal branch build takes a branch as is and > tests it. A > >>>> PR build > >>>> >>>>>> merges the branch into master, and then runs it. > (Fun fact: > >>>> This is > >>>> >> why > >>>> >>>>>> a PR without merge conflicts is not being run on > Travis.) > >>>> >>>>>> > >>>> >>>>>> And ultimately, everyone can already make use of this > >>>> approach anyway. > >>>> >>>>>> > >>>> >>>>>> On 25/06/2019 08:02, Jark Wu wrote: > >>>> >>>>>>> Hi Jeff, > >>>> >>>>>>> > >>>> >>>>>>> Thanks for sharing the Zeppelin approach. I > think it's a > >>>> good idea to > >>>> >>>>>>> leverage user's travis account. > >>>> >>>>>>> In this way, we can have almost unlimited > concurrent build > >>>> jobs and > >>>> >>>>>>> developers can restart build by themselves > (currently only > >>>> committers > >>>> >>>>>>> can restart PR's build). > >>>> >>>>>>> > >>>> >>>>>>> But I'm still not very clear how to integrate user's > >>>> travis build > >>>> >> into > >>>> >>>>>>> the Flink pull request's build automatically. > Can you > >>>> explain more in > >>>> >>>>>>> detail? > >>>> >>>>>>> > >>>> >>>>>>> Another question: does travis only build > branches for user > >>>> account? > >>>> >>>>>>> My concern is that builds for PRs will rebase user's > >>>> commits against > >>>> >>>>>>> current master branch. > >>>> >>>>>>> This will help us to find problems before > merge. Builds > >>>> for branches > >>>> >>>>>>> will lose the impact of new commits in master. > >>>> >>>>>>> How does Zeppelin solve this problem? > >>>> >>>>>>> > >>>> >>>>>>> Thanks again for sharing the idea. > >>>> >>>>>>> > >>>> >>>>>>> Regards, > >>>> >>>>>>> Jark > >>>> >>>>>>> > >>>> >>>>>>> On Tue, 25 Jun 2019 at 11:01, Jeff Zhang > <[hidden email] <mailto:[hidden email]> > >>>> <mailto:[hidden email] <mailto:[hidden email]>> > >>>> >>>>>>> <mailto:[hidden email] > <mailto:[hidden email]> <mailto:[hidden email] > <mailto:[hidden email]>>>> wrote: > >>>> >>>>>>> > >>>> >>>>>>> Hi Folks, > >>>> >>>>>>> > >>>> >>>>>>> Zeppelin meet this kind of issue before, we solve > >>>> it by > >>>> >> delegating > >>>> >>>>>>> each > >>>> >>>>>>> one's PR build to his travis account > (Everyone can > >>>> have 5 free > >>>> >>>>>>> slot for > >>>> >>>>>>> travis build). > >>>> >>>>>>> Apache account travis build is only triggered when > >>>> PR is merged. > >>>> >>>>>>> > >>>> >>>>>>> > >>>> >>>>>>> > >>>> >>>>>>> Kurt Young <[hidden email] > <mailto:[hidden email]> > >>>> <mailto:[hidden email] <mailto:[hidden email]>> > <mailto:[hidden email] <mailto:[hidden email]> > >>>> <mailto:[hidden email] <mailto:[hidden email]>>>> > >>>> >>>>>>> 于2019年6月25日周二 上午10:16写道: > >>>> >>>>>>> > >>>> >>>>>>> > (Forgot to cc George) > >>>> >>>>>>> > > >>>> >>>>>>> > Best, > >>>> >>>>>>> > Kurt > >>>> >>>>>>> > > >>>> >>>>>>> > > >>>> >>>>>>> > On Tue, Jun 25, 2019 at 10:16 AM Kurt Young > >>>> <[hidden email] <mailto:[hidden email]> > <mailto:[hidden email] <mailto:[hidden email]>> > >>>> >>>>>>> <mailto:[hidden email] > <mailto:[hidden email]> <mailto:[hidden email] > <mailto:[hidden email]>>>> > >>>> wrote: > >>>> >>>>>>> > > >>>> >>>>>>> > > Hi Bowen, > >>>> >>>>>>> > > > >>>> >>>>>>> > > Thanks for bringing this up. We > actually have > >>>> discussed > >>>> >> about > >>>> >>>>>>> this, and I > >>>> >>>>>>> > > think Till and George have > >>>> >>>>>>> > > already spend sometime investigating > it. I have > >>>> cced both of > >>>> >>>>>>> them, and > >>>> >>>>>>> > > maybe they can share > >>>> >>>>>>> > > their findings. > >>>> >>>>>>> > > > >>>> >>>>>>> > > Best, > >>>> >>>>>>> > > Kurt > >>>> >>>>>>> > > > >>>> >>>>>>> > > > >>>> >>>>>>> > > On Tue, Jun 25, 2019 at 10:08 AM Jark Wu > >>>> <[hidden email] <mailto:[hidden email]> > <mailto:[hidden email] <mailto:[hidden email]>> > >>>> >>>>>>> <mailto:[hidden email] > <mailto:[hidden email]> <mailto:[hidden email] > <mailto:[hidden email]>>>> > >>>> wrote: > >>>> >>>>>>> > > > >>>> >>>>>>> > >> Hi Bowen, > >>>> >>>>>>> > >> > >>>> >>>>>>> > >> Thanks for bringing this. We also > suffered from > >>>> the long > >>>> >>>>>>> build time. > >>>> >>>>>>> > >> I agree that we should focus on > solving build > >>>> capacity > >>>> >>>>>>> problem in the > >>>> >>>>>>> > >> thread. > >>>> >>>>>>> > >> > >>>> >>>>>>> > >> My observation is there is only one > build is > >>>> running, all > >>>> >> the > >>>> >>>>>>> others > >>>> >>>>>>> > >> (other > >>>> >>>>>>> > >> PRs, master) are pending. > >>>> >>>>>>> > >> The pricing plan[1] of travis shows > it can > >>>> support > >>>> >> concurrent > >>>> >>>>>>> build > >>>> >>>>>>> > jobs. > >>>> >>>>>>> > >> But I don't know which plan we are > using, might > >>>> be the free > >>>> >>>>>>> plan for > >>>> >>>>>>> > open > >>>> >>>>>>> > >> source. > >>>> >>>>>>> > >> > >>>> >>>>>>> > >> I cc-ed Chesnay who may have some > experience on > >>>> Travis. > >>>> >>>>>>> > >> > >>>> >>>>>>> > >> Regards, > >>>> >>>>>>> > >> Jark > >>>> >>>>>>> > >> > >>>> >>>>>>> > >> [1]: https://travis-ci.com/plans > >>>> >>>>>>> > >> > >>>> >>>>>>> > >> On Tue, 25 Jun 2019 at 08:11, Bowen Li < > >>>> >> [hidden email] <mailto:[hidden email]> > <mailto:[hidden email] <mailto:[hidden email]>> > >>>> >>>>>>> <mailto:[hidden email] > <mailto:[hidden email]> > >>>> <mailto:[hidden email] > <mailto:[hidden email]>>>> wrote: > >>>> >>>>>>> > >> > >>>> >>>>>>> > >> > Hi Steven, > >>>> >>>>>>> > >> > > >>>> >>>>>>> > >> > I think you may not read what I > wrote. The > >>>> discussion is > >>>> >>>> about > >>>> >>>>>>> > "unstable > >>>> >>>>>>> > >> > build **capacity**", in another word > >>>> "unstable / lack of > >>>> >>>> build > >>>> >>>>>>> > >> resources", > >>>> >>>>>>> > >> > not "unstable build". > >>>> >>>>>>> > >> > > >>>> >>>>>>> > >> > On Mon, Jun 24, 2019 at 4:40 PM > Steven Wu > >>>> >>>>>>> <[hidden email] > <mailto:[hidden email]> <mailto:[hidden email] > <mailto:[hidden email]>> > >>>> <mailto:[hidden email] > <mailto:[hidden email]> <mailto:[hidden email] > <mailto:[hidden email]>>>> > >>>> >>>>>>> > wrote: > >>>> >>>>>>> > >> > > >>>> >>>>>>> > >> > > long and sometimes unstable build is > >>>> definitely a pain > >>>> >>>>>> point. > >>>> >>>>>>> > >> > > > >>>> >>>>>>> > >> > > I suspect the build failure here in > >>>> >> flink-connector-kafka > >>>> >>>>>>> is not > >>>> >>>>>>> > >> related > >>>> >>>>>>> > >> > to > >>>> >>>>>>> > >> > > my change. but there is no easy > re-run the > >>>> build on > >>>> >>>>>>> travis UI. > >>>> >>>>>>> > >> > > search showed a trick of > close-and-open the > >>>> PR will > >>>> >>>>>>> trigger rebuild. > >>>> >>>>>>> > >> but > >>>> >>>>>>> > >> > > that could add noises to the PR > activities. > >>>> >>>>>>> > >> > > > >>>> https://travis-ci.org/apache/flink/jobs/545555519 > >>>> >>>>>>> > >> > > > >>>> >>>>>>> > >> > > travis-ci for my personal repo > often failed > >>>> with > >>>> >>>>>>> exceeding time > >>>> >>>>>>> > limit > >>>> >>>>>>> > >> > after > >>>> >>>>>>> > >> > > 4+ hours. > >>>> >>>>>>> > >> > > The job exceeded the maximum time > limit for > >>>> jobs, and > >>>> >> has > >>>> >>>>>>> been > >>>> >>>>>>> > >> > terminated. > >>>> >>>>>>> > >> > > > >>>> >>>>>>> > >> > > On Mon, Jun 24, 2019 at 4:15 PM > Bowen Li > >>>> >>>>>>> <[hidden email] > <mailto:[hidden email]> <mailto:[hidden email] > <mailto:[hidden email]>> > >>>> <mailto:[hidden email] <mailto:[hidden email]> > <mailto:[hidden email] <mailto:[hidden email]>>>> > >>>> >>>>>>> > wrote: > >>>> >>>>>>> > >> > > > >>>> >>>>>>> > >> > > > > >>>> https://travis-ci.org/apache/flink/builds/549681530 > >>>> >>>>>>> This build > >>>> >>>>>>> > >> > request > >>>> >>>>>>> > >> > > > has > >>>> >>>>>>> > >> > > > been sitting at **HEAD of the > queue** > >>>> since I first > >>>> >> saw > >>>> >>>>>>> it at PST > >>>> >>>>>>> > >> > 10:30am > >>>> >>>>>>> > >> > > > (not sure how long it's been > there before > >>>> 10:30am). > >>>> >>>>>>> It's PST > >>>> >>>>>>> > 4:12pm > >>>> >>>>>>> > >> now > >>>> >>>>>>> > >> > > and > >>>> >>>>>>> > >> > > > it hasn't started yet. > >>>> >>>>>>> > >> > > > > >>>> >>>>>>> > >> > > > On Mon, Jun 24, 2019 at 2:48 PM > Bowen Li > >>>> >>>>>>> <[hidden email] > <mailto:[hidden email]> <mailto:[hidden email] > <mailto:[hidden email]>> > >>>> <mailto:[hidden email] <mailto:[hidden email]> > <mailto:[hidden email] <mailto:[hidden email]>>>> > >>>> >>>>>>> > >> wrote: > >>>> >>>>>>> > >> > > > > >>>> >>>>>>> > >> > > > > Hi devs, > >>>> >>>>>>> > >> > > > > > >>>> >>>>>>> > >> > > > > I've been experiencing the pain > >>>> resulting from lack > >>>> >>>>>>> of stable > >>>> >>>>>>> > >> build > >>>> >>>>>>> > >> > > > > capacity on Travis for Flink > PRs [1]. > >>>> >> Specifically, I > >>>> >>>>>>> noticed > >>>> >>>>>>> > >> often > >>>> >>>>>>> > >> > > that > >>>> >>>>>>> > >> > > > no > >>>> >>>>>>> > >> > > > > build in the queue is making any > >>>> progress for > >>>> >> hours, > >>>> >>>> and > >>>> >>>>>>> > suddenly > >>>> >>>>>>> > >> 5 > >>>> >>>>>>> > >> > or > >>>> >>>>>>> > >> > > 6 > >>>> >>>>>>> > >> > > > > builds kick off all together > after the > >>>> long pause. > >>>> >>>>>>> I'm at PST > >>>> >>>>>>> > >> > (UTC-08) > >>>> >>>>>>> > >> > > > time > >>>> >>>>>>> > >> > > > > zone, and I've seen pause can > be as > >>>> long as 6 hours > >>>> >>>>>>> from PST 9am > >>>> >>>>>>> > >> to > >>>> >>>>>>> > >> > 3pm > >>>> >>>>>>> > >> > > > > (let alone the time needed to > drain the > >>>> queue > >>>> >>>>>>> afterwards). > >>>> >>>>>>> > >> > > > > > >>>> >>>>>>> > >> > > > > I think this has greatly > impacted our > >>>> productivity. > >>>> >>>> I've > >>>> >>>>>>> > >> experienced > >>>> >>>>>>> > >> > > that > >>>> >>>>>>> > >> > > > > PRs submitted in the early > morning of > >>>> PST time zone > >>>> >>>>>>> won't finish > >>>> >>>>>>> > >> > their > >>>> >>>>>>> > >> > > > > build until late night of the > same day. > >>>> >>>>>>> > >> > > > > > >>>> >>>>>>> > >> > > > > So my questions are: > >>>> >>>>>>> > >> > > > > > >>>> >>>>>>> > >> > > > > - Has anyone else experienced > the same > >>>> problem or > >>>> >>>>>>> have similar > >>>> >>>>>>> > >> > > > observation > >>>> >>>>>>> > >> > > > > on TravisCI? (I suspect it > has things > >>>> to do with > >>>> >> time > >>>> >>>>>>> zone) > >>>> >>>>>>> > >> > > > > > >>>> >>>>>>> > >> > > > > - What pricing plan of > TravisCI is > >>>> Flink currently > >>>> >>>>>>> using? Is it > >>>> >>>>>>> > >> the > >>>> >>>>>>> > >> > > free > >>>> >>>>>>> > >> > > > > plan for open source > projects? What > >>>> are the > >>>> >>>>>>> guaranteed build > >>>> >>>>>>> > >> capacity > >>>> >>>>>>> > >> > > of > >>>> >>>>>>> > >> > > > > the current plan? > >>>> >>>>>>> > >> > > > > > >>>> >>>>>>> > >> > > > > - If the current pricing plan > (either > >>>> free or paid) > >>>> >>>>>> can't > >>>> >>>>>>> > provide > >>>> >>>>>>> > >> > > stable > >>>> >>>>>>> > >> > > > > build capacity, can we > upgrade to a > >>>> higher priced > >>>> >>>>>>> plan with > >>>> >>>>>>> > larger > >>>> >>>>>>> > >> > and > >>>> >>>>>>> > >> > > > more > >>>> >>>>>>> > >> > > > > stable build capacity? > >>>> >>>>>>> > >> > > > > > >>>> >>>>>>> > >> > > > > BTW, another factor that > contribute to > >>>> the > >>>> >>>>>>> productivity problem > >>>> >>>>>>> > is > >>>> >>>>>>> > >> > that > >>>> >>>>>>> > >> > > > > our build is slow - we run > full build > >>>> for every PR > >>>> >>>> and a > >>>> >>>>>>> > >> successful > >>>> >>>>>>> > >> > > full > >>>> >>>>>>> > >> > > > > build takes ~5h. We > definitely have > >>>> more options to > >>>> >>>>>>> solve it, > >>>> >>>>>>> > for > >>>> >>>>>>> > >> > > > instance, > >>>> >>>>>>> > >> > > > > modularize the build graphs > and reuse > >>>> artifacts > >>>> >> from > >>>> >>>> the > >>>> >>>>>>> > previous > >>>> >>>>>>> > >> > > build. > >>>> >>>>>>> > >> > > > > But I think that can be a big > effort > >>>> which is much > >>>> >>>>>>> harder to > >>>> >>>>>>> > >> > accomplish > >>>> >>>>>>> > >> > > > in > >>>> >>>>>>> > >> > > > > a short period of time and > may deserve > >>>> its own > >>>> >>>> separate > >>>> >>>>>>> > >> discussion. > >>>> >>>>>>> > >> > > > > > >>>> >>>>>>> > >> > > > > [1] > >>>> >> https://travis-ci.org/apache/flink/pull_requests > >>>> >>>>>>> > >> > > > > > >>>> >>>>>>> > >> > > > > > >>>> >>>>>>> > >> > > > > >>>> >>>>>>> > >> > > > >>>> >>>>>>> > >> > > >>>> >>>>>>> > >> > >>>> >>>>>>> > > > >>>> >>>>>>> > > >>>> >>>>>>> > >>>> >>>>>>> > >>>> >>>>>>> -- > >>>> >>>>>>> Best Regards > >>>> >>>>>>> > >>>> >>>>>>> Jeff Zhang > >>>> >>>>>>> > >>>> >> > >>>> > >>> > >> > |
+1 to move to a private Travis account.
I can confirm that Ververica will sponsor a Travis CI plan that is equivalent or a bit higher than the previous ASF quota (10 concurrent build queues) Best, Stephan On Thu, Jul 4, 2019 at 10:46 AM Chesnay Schepler <[hidden email]> wrote: > I've raised a JIRA > <https://issues.apache.org/jira/browse/INFRA-18703>with INFRA to inquire > whether it would be possible to switch to a different Travis account, > and if so what steps would need to be taken. > We need a proper confirmation from INFRA since we are not in full > control of the flink repository (for example, we cannot access the > settings page). > > If this is indeed possible, Ververica is willing sponsor a Travis > account for the Flink project. > This would provide us with more than enough resources than we need. > > Since this makes the project more reliant on resources provided by > external companies I would like to vote on this. > > Please vote on this proposal, as follows: > [ ] +1, Approve the migration to a Ververica-sponsored Travis account, > provided that INFRA approves > [ ] -1, Do not approach the migration to a Ververica-sponsored Travis > account > > The vote will be open for at least 24h, and until we have confirmation > from INFRA. The voting period may be shorter than the usual 3 days since > our current is effectively not working. > > On 04/07/2019 06:51, Bowen Li wrote: > > Re: > Are they using their own Travis CI pool, or did the switch to an > > entirely different CI service? > > > > I reached out to Wes and Krisztián from Apache Arrow PMC. They are > > currently moving away from ASF's Travis to their own in-house metal > > machines at [1] with custom CI application at [2]. They've seen > > significant improvement w.r.t both much higher performance and > > basically no resource waiting time, "night-and-day" difference quoting > > Wes. > > > > Re: > If we can just switch to our own Travis pool, just for our > > project, then this might be something we can do fairly quickly? > > > > I believe so, according to [3] and [4] > > > > > > [1] https://ci.ursalabs.org/ <https://ci.ursalabs.org/#/> > > [2] https://github.com/ursa-labs/ursabot > > [3] > > https://docs.travis-ci.com/user/migrate/open-source-repository-migration > > [4] https://docs.travis-ci.com/user/migrate/open-source-on-travis-ci-com > > > > > > > > On Wed, Jul 3, 2019 at 12:01 AM Chesnay Schepler <[hidden email] > > <mailto:[hidden email]>> wrote: > > > > Are they using their own Travis CI pool, or did the switch to an > > entirely different CI service? > > > > If we can just switch to our own Travis pool, just for our > > project, then > > this might be something we can do fairly quickly? > > > > On 03/07/2019 05:55, Bowen Li wrote: > > > I responded in the INFRA ticket [1] that I believe they are > > using a wrong > > > metric against Flink and the total build time is a completely > > different > > > thing than guaranteed build capacity. > > > > > > My response: > > > > > > "As mentioned above, since I started to pay attention to Flink's > > build > > > queue a few tens of days ago, I'm in Seattle and I saw no build > > was kicking > > > off in PST daytime in weekdays for Flink. Our teammates in China > > and Europe > > > have also reported similar observations. So we need to evaluate > > how the > > > large total build time came from - if 1) your number and 2) our > > > observations from three locations that cover pretty much a full > > day, are > > > all true, I **guess** one reason can be that - highly likely the > > extra > > > build time came from weekends when other Apache projects may be > > idle and > > > Flink just drains hard its congested queue. > > > > > > Please be aware of that we're not complaining about the lack of > > resources > > > in general, I'm complaining about the lack of **stable, dedicated** > > > resources. An example for the latter one is, currently even if > > no build is > > > in Flink's queue and I submit a request to be the queue head in PST > > > morning, my build won't even start in 6-8+h. That is an absurd > > amount of > > > waiting time. > > > > > > That's saying, if ASF INFRA decides to adopt a quota system and > > grants > > > Flink five DEDICATED servers that runs all the time only for > > Flink, that'll > > > be PERFECT and can totally solve our problem now. > > > > > > Please be aware of that we're not complaining about the lack of > > resources > > > in general, I'm complaining about the lack of **stable, dedicated** > > > resources. An example for the latter one is, currently even if > > no build is > > > in Flink's queue and I submit a request to be the queue head in PST > > > morning, my build won't even start in 6-8+h. That is an absurd > > amount of > > > waiting time. > > > > > > > > > That's saying, if ASF INFRA decides to adopt a quota system and > > grants > > > Flink five DEDICATED servers that runs all the time only for > > Flink, that'll > > > be PERFECT and can totally solve our problem now. > > > > > > I feel what's missing in the ASF INFRA's Travis resource pool is > > some level > > > of build capacity SLAs and certainty" > > > > > > > > > Again, I believe there are differences in nature of these two > > problems, > > > long build time v.s. lack of dedicated build resource. That's > > saying, > > > shortening build time may relieve the situation, and may not. > > I'm sightly > > > negative on disabling IT cases for PRs, due to the downside is > > that we are > > > at risk of any potential bugs in PR that UTs doesn't catch, and > > may cost a > > > lot more to fix and if it slows others down or even block > > others, but am > > > open to others opinions on it. > > > > > > AFAICT from INFRA ticket[1], donating to ASF INFRA won't be > > feasible to > > > solve our problem since INFRA's pool is fully shared and they > > have no > > > control and finer insights over resource allocation to a > > specific Apache > > > project. As mentioned in [1], Apache Arrow is moving away from > > ASF INFRA > > > Travis pool (they are actually surprised Flink hasn't plan to do > > so). I > > > know that Spark is on its own build infra. If we all agree that > > funding our > > > own build infra, I'd be glad to help investigate any potential > > options > > > after releasing 1.9 since I'm super busy with 1.9 now. > > > > > > [1] https://issues.apache.org/jira/browse/INFRA-18533 > > > > > > > > > > > > On Tue, Jul 2, 2019 at 4:46 AM Chesnay Schepler > > <[hidden email] <mailto:[hidden email]>> wrote: > > > > > >> As a short-term stopgap, since we can assume this issue to > > become much > > >> worse in the following days/weeks, we could disable IT cases in > > PRs and > > >> only run them on master. > > >> > > >> On 02/07/2019 12:03, Chesnay Schepler wrote: > > >>> People really have to stop thinking that just because > > something works > > >>> for us it is also a good solution. > > >>> Also, please remember that our builds run for 2h from start to > > finish, > > >>> and not the 14 _minutes_ it takes for zeppelin. > > >>> We are dealing with an entirely different scale here, both in > > terms of > > >>> build times and number of builds. > > >>> > > >>> In this very thread people have been complaining about long queue > > >>> times for their builds. Surprise, other Apache projects have been > > >>> suffering the very same thing due to us not controlling our build > > >>> times. While switching services (be it Jenkins, CircleCI or > > whatever) > > >>> will possibly work for us (and these options are actually > > attractive, > > >>> like CircleCI's proper support for build artifacts), it will also > > >>> result in us likely negatively affecting other projects in > > significant > > >>> ways. > > >>> > > >>> Sure, the Jenkins setup has a good user experience for us, at > > the cost > > >>> of blocking Jenkins workers for a _lot_ of time. Right now we > > have 25 > > >>> PR's in our queue; that's possibly 50h we'd consume of Jenkins > > >>> resources, and the European contributors haven't even really > > started yet. > > >>> > > >>> FYI, the latest INFRA response from INFRA-18533: > > >>> > > >>> "Our rough metrics shows that Flink used over 5800 hours of > > build time > > >>> last month. That is equal to EIGHT servers running 24/7 for > > the ENTIRE > > >>> MONTH. EIGHT. nonstop. > > >>> When we discovered this last night, we discussed it some and > > are going > > >>> to tune down Flink to allow only five executors maximum. We > cannot > > >>> allow Flink to consume so much of a Foundation shared resource." > > >>> > > >>> So yes, we either > > >>> a) have to heavily reduce our CI usage or > > >>> b) fund our own, either maintaining it ourselves or donating > > to Apache. > > >>> > > >>> On 02/07/2019 05:11, Bowen Li wrote: > > >>>> By looking at the git history of the Jenkins script, its core > > part > > >>>> was finished in March 2017 (and only two minor update in > > 2017/2018), > > >>>> so it's been running for over two years now and feels like > > Zepplin > > >>>> community has been quite happy with it. @Jeff Zhang > > >>>> <mailto:[hidden email] <mailto:[hidden email]>> can you > > share your insights and user > > >>>> experience with the Jenkins+Travis approach? > > >>>> > > >>>> Things like: > > >>>> > > >>>> - has the approach completely solved the resource capacity > > problem > > >>>> for Zepplin community? is Zepplin community happy with the > > result? > > >>>> - is the whole configuration chain stable (e.g. uptime) enough? > > >>>> - how often do you need to maintain the Jenkins infra? how many > > >>>> people are usually involved in maintenance and bug-fixes? > > >>>> > > >>>> The downside of this approach seems mostly to be on the > > maintenance > > >>>> to me - maintain the script and Jenkins infra. > > >>>> > > >>>> ** Having Our Own Travis-CI.com Account ** > > >>>> > > >>>> Another alternative I've been thinking of is to have our own > > >>>> travis-ci.com <http://travis-ci.com> <http://travis-ci.com> > > account with paid dedicated > > >>>> resources. Note travis-ci.org <http://travis-ci.org> > > <http://travis-ci.org> is the free > > >>>> version and travis-ci.com <http://travis-ci.com> > > <http://travis-ci.com> is the commercial > > >>>> version. We currently use a shared resource pool managed by > > ASK INFRA > > >>>> team on travis-ci.org <http://travis-ci.org> > > <http://travis-ci.org>, but we have no control > > >>>> over it - we can't see how it's configured, how much > > resources are > > >>>> available, how resources are allocated among Apache projects, > > etc. > > >>>> The nice thing about having an account on travis-ci.com > > <http://travis-ci.com> > > >>>> <http://travis-ci.com> are: > > >>>> > > >>>> - relatively low cost with much better resource guarantee > > than what > > >>>> we currently have [1]: $249/month with 5 dedicated concurrency, > > >>>> $489/month with 10 concurrency > > >>>> - low maintenance work compared to using Jenkins > > >>>> - (potentially) no migration cost according to Travis's doc [2] > > >>>> (pending verification) > > >>>> - full control over the build capacity/configuration compared to > > >>>> using ASF INFRA's pool > > >>>> > > >>>> I'd be surprised if we as such a vibrant community cannot > > find and > > >>>> fund $249*12=$2988 a year in exchange for a much better > developer > > >>>> experience and much higher productivity. > > >>>> > > >>>> [1] https://travis-ci.com/plans > > >>>> [2] > > >>>> > > >> > > > https://docs.travis-ci.com/user/migrate/open-source-repository-migration > > >>>> On Sat, Jun 29, 2019 at 8:39 AM Chesnay Schepler > > <[hidden email] <mailto:[hidden email]> > > >>>> <mailto:[hidden email] <mailto:[hidden email]>>> wrote: > > >>>> > > >>>> So yes, the Jenkins job keeps pulling the state from > > Travis until it > > >>>> finishes. > > >>>> > > >>>> Note sure I'm comfortable with the idea of using Jenkins > > workers > > >>>> just to > > >>>> idle for a several hours. > > >>>> > > >>>> On 29/06/2019 14:56, Jeff Zhang wrote: > > >>>> > Here's what zeppelin community did, we make a python > > script to > > >>>> check the > > >>>> > build status of pull request. > > >>>> > Here's script: > > >>>> > > > https://github.com/apache/zeppelin/blob/master/travis_check.py > > >>>> > > > >>>> > And this is the script we used in Jenkins build job. > > >>>> > > > >>>> > if [ -f "travis_check.py" ]; then > > >>>> > git log -n 1 > > >>>> > STATUS=$(curl -s $BUILD_URL | grep -e "GitHub pull > > >>>> request.*from.*" | sed > > >>>> > 's/.*GitHub pull request <a > > >>>> > href=\"\(https[^"]*\).*from[^"]*.\(https[^"]*\).*/\1 > > \2/g') > > >>>> > AUTHOR=$(echo $STATUS | sed 's/.*[/]\(.*\)$/\1/g') > > >>>> > PR=$(echo $STATUS | awk '{print $1}' | sed > > >>>> 's/.*[/]\(.*\)$/\1/g') > > >>>> > #COMMIT=$(git log -n 1 | grep "^Merge:" | awk > > '{print $3}') > > >>>> > #if [ -z $COMMIT ]; then > > >>>> > # COMMIT=$(curl -s > > >>>> https://api.github.com/repos/apache/zeppelin/pulls/$PR > > >>>> > | grep -e "\"label\":" -e "\"ref\":" -e "\"sha\":" | > > tr '\n' ' ' > > >>>> | sed > > >>>> > 's/\(.*sha[^,]*,\)\(.*ref.*\)/\1 = \2/g' | tr = '\n' | > > grep -v > > >>>> "apache:" | > > >>>> > sed 's/.*sha.[^"]*["]\([^"]*\).*/\1/g') > > >>>> > #fi > > >>>> > > > >>>> > # get commit hash from PR > > >>>> > COMMIT=$(curl -s > > >>>> https://api.github.com/repos/apache/zeppelin/pulls/$PR | > > >>>> > grep -e "\"label\":" -e "\"ref\":" -e "\"sha\":" | tr > > '\n' ' ' > > >>>> | sed > > >>>> > 's/\(.*sha[^,]*,\)\(.*ref.*\)/\1 = \2/g' | tr = '\n' | > > grep -v > > >>>> "apache:" | > > >>>> > sed 's/.*sha.[^"]*["]\([^"]*\).*/\1/g') > > >>>> > sleep 30 # sleep few moment to wait travis starts > > the build > > >>>> > RET_CODE=0 > > >>>> > python ./travis_check.py ${AUTHOR} ${COMMIT} || > > RET_CODE=$? > > >>>> > if [ $RET_CODE -eq 2 ]; then # try with repository > > name when > > >>>> travis-ci is > > >>>> > not available in the account > > >>>> > RET_CODE=0 > > >>>> > AUTHOR=$(curl -s > > >>>> https://api.github.com/repos/apache/zeppelin/pulls/$PR > > >>>> > | grep '"full_name":' | grep -v "apache/zeppelin" | sed > > >>>> > 's/.*[:][^"]*["]\([^/]*\).*/\1/g') > > >>>> > python ./travis_check.py ${AUTHOR} ${COMMIT} || > > RET_CODE=$? > > >>>> > fi > > >>>> > > > >>>> > if [ $RET_CODE -eq 2 ]; then # fail with can't find > > build > > >>>> information in > > >>>> > the travis > > >>>> > set +x > > >>>> > echo > > "-----------------------------------------------------" > > >>>> > echo "Looks like travis-ci is not configured for > > your fork." > > >>>> > echo "Please setup by swich on 'zeppelin' > > repository at > > >>>> > https://travis-ci.org/profile and travis-ci." > > >>>> > echo "And then make sure 'Build branch updates' > > option is > > >>>> enabled in > > >>>> > the settings > > https://travis-ci.org/${AUTHOR}/zeppelin/settings > > <https://travis-ci.org/$%7BAUTHOR%7D/zeppelin/settings> > > >>>> <https://travis-ci.org/$%7BAUTHOR%7D/zeppelin/settings>." > > >>>> > echo "" > > >>>> > echo "To trigger CI after setup, you will need > > ammend your > > >>>> last commit > > >>>> > with" > > >>>> > echo "git commit --amend" > > >>>> > echo "git push your-remote HEAD --force" > > >>>> > echo "" > > >>>> > echo "See > > >>>> > > > >>>> > > >> > > > http://zeppelin.apache.org/contribution/contributions.html#continuous-integration > > >>>> > ." > > >>>> > fi > > >>>> > > > >>>> > exit $RET_CODE > > >>>> > else > > >>>> > set +x > > >>>> > echo "travis_check.py does not exists" > > >>>> > exit 1 > > >>>> > fi > > >>>> > > > >>>> > Chesnay Schepler <[hidden email] > > <mailto:[hidden email]> > > >>>> <mailto:[hidden email] <mailto:[hidden email]>>> > > 于2019年6月29日周六 下午3:17写道: > > >>>> > > > >>>> >> Does this imply that a Jenkins job is active as long > > as the > > >>>> Travis build > > >>>> >> runs? > > >>>> >> > > >>>> >> On 26/06/2019 21:28, Bowen Li wrote: > > >>>> >>> Hi, > > >>>> >>> > > >>>> >>> @Dawid, I think the "long test running" as I > > mentioned in the > > >>>> first > > >>>> >> email, > > >>>> >>> also as you guys said, belongs to "a big effort > > which is much > > >>>> harder to > > >>>> >>> accomplish in a short period of time and may deserve > > its own > > >>>> separate > > >>>> >>> discussion". Thus I didn't include it in what we can > > do in a > > >>>> foreseeable > > >>>> >>> short term. > > >>>> >>> > > >>>> >>> Besides, I don't think that's the ultimate reason > > for lack of > > >>>> build > > >>>> >>> resources. Even if the build is shortened to > > something like > > >>>> 2h, the > > >>>> >>> problems of no build machine works about 6 or more > > hours in > > >>>> PST daytime > > >>>> >>> that I described will still happen, because no > > machine from > > >>>> ASF INFRA's > > >>>> >>> pool is allocated to Flink. As I have paid close > > attention to > > >>>> the build > > >>>> >>> queue in the past few weekdays, it's a pretty clear > > pattern now. > > >>>> >>> > > >>>> >>> **The ultimate root cause** for that is - we don't > > have any > > >>>> **dedicated** > > >>>> >>> build resources that we can stably rely on. I'm > > actually ok to > > >>>> wait for a > > >>>> >>> long time if there are build requests running, it > > means at > > >>>> least we are > > >>>> >>> making progress. But I'm not ok with no build > > resource. A > > >>>> better place I > > >>>> >>> think we should aim at in short term is to always > > have at > > >>>> least a central > > >>>> >>> pool (can be 3 or 5) of machines dedicated to build > > Flink at > > >>>> any time, or > > >>>> >>> maybe use users resources. > > >>>> >>> > > >>>> >>> @Chesnay @Robert I synced with Jeff offline that > > Zeppelin > > >>>> community is > > >>>> >>> using a Jenkins job to automatically build on users' > > travis > > >>>> account and > > >>>> >>> link the result back to github PR. I guess the > > Jenkins job > > >>>> would fetch > > >>>> >>> latest upstream master and build the PR against it. > > Jeff has > > >>>> filed > > >>>> >> tickets > > >>>> >>> to learn and get access to the Jenkins infra. It'll > > better to > > >>>> fully > > >>>> >>> understand it first before judging this approach. > > >>>> >>> > > >>>> >>> I also heard good things about CircleCI, and ASF > > INFRA seems > > >>>> to have a > > >>>> >> pool > > >>>> >>> of build capacity there too. Can be an alternative > > to consider. > > >>>> >>> > > >>>> >>> > > >>>> >>> > > >>>> >>> > > >>>> >>> > > >>>> >>> > > >>>> >>> > > >>>> >>> > > >>>> >>> > > >>>> >>> On Wed, Jun 26, 2019 at 12:44 AM Dawid Wysakowicz < > > >>>> >> [hidden email] > > <mailto:[hidden email]> <mailto:[hidden email] > > <mailto:[hidden email]>>> > > >>>> >>> wrote: > > >>>> >>> > > >>>> >>>> Sorry to jump in late, but I think Bowen missed the > > most > > >>>> important point > > >>>> >>>> from Chesnay's previous message in the summary. The > > ultimate > > >>>> reason for > > >>>> >>>> all the problems is that the tests take close to 2 > > hours to > > >>>> run already. > > >>>> >>>> I fully support this claim: "Unless people start > > caring about > > >>>> test times > > >>>> >>>> before adding them, this issue cannot be solved" > > >>>> >>>> > > >>>> >>>> This is also another reason why using user's Travis > > account > > >>>> won't help. > > >>>> >>>> Every few weeks we reach the user's time limit for > > a single > > >>>> profile. > > >>>> >>>> This makes the user's builds simply fail, until we > > either > > >>>> properly > > >>>> >>>> decrease the time the tests take (which I am not > > sure we ever > > >>>> did) or > > >>>> >>>> postpone the problem by splitting into more > > profiles. (Note > > >>>> that the ASF > > >>>> >>>> Travis account has higher time limits) > > >>>> >>>> > > >>>> >>>> Best, > > >>>> >>>> > > >>>> >>>> Dawid > > >>>> >>>> > > >>>> >>>> On 26/06/2019 09:36, Robert Metzger wrote: > > >>>> >>>>> Do we know if using "the best" available hardware > > would > > >>>> improve the > > >>>> >> build > > >>>> >>>>> times? > > >>>> >>>>> Imagine we would run the build on machines with > > plenty of > > >>>> main memory > > >>>> >> to > > >>>> >>>>> mount everything to ramdisk + the latest CPU > > architecture? > > >>>> >>>>> > > >>>> >>>>> Throwing hardware at the problem could help reduce > > the time > > >>>> of an > > >>>> >>>>> individual build, and using our own infrastructure > > would > > >>>> remove our > > >>>> >>>>> dependency on Apache's Travis account (with the > > obvious > > >>>> downside of > > >>>> >>>> having > > >>>> >>>>> to maintain the infrastructure) > > >>>> >>>>> We could use an open source travis alternative, to > > have a > > >>>> similar > > >>>> >>>>> experience and make the migration easy. > > >>>> >>>>> > > >>>> >>>>> > > >>>> >>>>> On Wed, Jun 26, 2019 at 9:34 AM Chesnay Schepler > > >>>> <[hidden email] <mailto:[hidden email]> > > <mailto:[hidden email] <mailto:[hidden email]>>> > > >>>> >>>> wrote: > > >>>> >>>>>> >From what I gathered, there's no special > > sauce that the > > >>>> Zeppelin > > >>>> >>>>>> project uses which actually integrates a users > Travis > > >>>> account into the > > >>>> >>>> PR. > > >>>> >>>>>> They just disabled Travis for PRs. And that's > > kind of it. > > >>>> >>>>>> > > >>>> >>>>>> Naturally we can do this (duh) and safe the ASF a > > fair > > >>>> amount of > > >>>> >>>>>> resources, but there are downsides: > > >>>> >>>>>> > > >>>> >>>>>> The discoverability of the Travis check takes a > > nose-dive. > > >>>> Either we > > >>>> >>>>>> require every contributor to always, an every > > commit, also > > >>>> post a > > >>>> >> Travis > > >>>> >>>>>> build, or we have the reviewer sift through the > > >>>> contributors account > > >>>> >> to > > >>>> >>>>>> find it. > > >>>> >>>>>> > > >>>> >>>>>> This is rather cumbersome. Additionally, it's > > also not > > >>>> equivalent to > > >>>> >>>>>> having a PR build. > > >>>> >>>>>> > > >>>> >>>>>> A normal branch build takes a branch as is and > > tests it. A > > >>>> PR build > > >>>> >>>>>> merges the branch into master, and then runs it. > > (Fun fact: > > >>>> This is > > >>>> >> why > > >>>> >>>>>> a PR without merge conflicts is not being run on > > Travis.) > > >>>> >>>>>> > > >>>> >>>>>> And ultimately, everyone can already make use of > this > > >>>> approach anyway. > > >>>> >>>>>> > > >>>> >>>>>> On 25/06/2019 08:02, Jark Wu wrote: > > >>>> >>>>>>> Hi Jeff, > > >>>> >>>>>>> > > >>>> >>>>>>> Thanks for sharing the Zeppelin approach. I > > think it's a > > >>>> good idea to > > >>>> >>>>>>> leverage user's travis account. > > >>>> >>>>>>> In this way, we can have almost unlimited > > concurrent build > > >>>> jobs and > > >>>> >>>>>>> developers can restart build by themselves > > (currently only > > >>>> committers > > >>>> >>>>>>> can restart PR's build). > > >>>> >>>>>>> > > >>>> >>>>>>> But I'm still not very clear how to integrate > user's > > >>>> travis build > > >>>> >> into > > >>>> >>>>>>> the Flink pull request's build automatically. > > Can you > > >>>> explain more in > > >>>> >>>>>>> detail? > > >>>> >>>>>>> > > >>>> >>>>>>> Another question: does travis only build > > branches for user > > >>>> account? > > >>>> >>>>>>> My concern is that builds for PRs will rebase > user's > > >>>> commits against > > >>>> >>>>>>> current master branch. > > >>>> >>>>>>> This will help us to find problems before > > merge. Builds > > >>>> for branches > > >>>> >>>>>>> will lose the impact of new commits in master. > > >>>> >>>>>>> How does Zeppelin solve this problem? > > >>>> >>>>>>> > > >>>> >>>>>>> Thanks again for sharing the idea. > > >>>> >>>>>>> > > >>>> >>>>>>> Regards, > > >>>> >>>>>>> Jark > > >>>> >>>>>>> > > >>>> >>>>>>> On Tue, 25 Jun 2019 at 11:01, Jeff Zhang > > <[hidden email] <mailto:[hidden email]> > > >>>> <mailto:[hidden email] <mailto:[hidden email]>> > > >>>> >>>>>>> <mailto:[hidden email] > > <mailto:[hidden email]> <mailto:[hidden email] > > <mailto:[hidden email]>>>> wrote: > > >>>> >>>>>>> > > >>>> >>>>>>> Hi Folks, > > >>>> >>>>>>> > > >>>> >>>>>>> Zeppelin meet this kind of issue before, we solve > > >>>> it by > > >>>> >> delegating > > >>>> >>>>>>> each > > >>>> >>>>>>> one's PR build to his travis account > > (Everyone can > > >>>> have 5 free > > >>>> >>>>>>> slot for > > >>>> >>>>>>> travis build). > > >>>> >>>>>>> Apache account travis build is only triggered when > > >>>> PR is merged. > > >>>> >>>>>>> > > >>>> >>>>>>> > > >>>> >>>>>>> > > >>>> >>>>>>> Kurt Young <[hidden email] > > <mailto:[hidden email]> > > >>>> <mailto:[hidden email] <mailto:[hidden email]>> > > <mailto:[hidden email] <mailto:[hidden email]> > > >>>> <mailto:[hidden email] <mailto:[hidden email]>>>> > > >>>> >>>>>>> 于2019年6月25日周二 上午10:16写道: > > >>>> >>>>>>> > > >>>> >>>>>>> > (Forgot to cc George) > > >>>> >>>>>>> > > > >>>> >>>>>>> > Best, > > >>>> >>>>>>> > Kurt > > >>>> >>>>>>> > > > >>>> >>>>>>> > > > >>>> >>>>>>> > On Tue, Jun 25, 2019 at 10:16 AM Kurt Young > > >>>> <[hidden email] <mailto:[hidden email]> > > <mailto:[hidden email] <mailto:[hidden email]>> > > >>>> >>>>>>> <mailto:[hidden email] > > <mailto:[hidden email]> <mailto:[hidden email] > > <mailto:[hidden email]>>>> > > >>>> wrote: > > >>>> >>>>>>> > > > >>>> >>>>>>> > > Hi Bowen, > > >>>> >>>>>>> > > > > >>>> >>>>>>> > > Thanks for bringing this up. We > > actually have > > >>>> discussed > > >>>> >> about > > >>>> >>>>>>> this, and I > > >>>> >>>>>>> > > think Till and George have > > >>>> >>>>>>> > > already spend sometime investigating > > it. I have > > >>>> cced both of > > >>>> >>>>>>> them, and > > >>>> >>>>>>> > > maybe they can share > > >>>> >>>>>>> > > their findings. > > >>>> >>>>>>> > > > > >>>> >>>>>>> > > Best, > > >>>> >>>>>>> > > Kurt > > >>>> >>>>>>> > > > > >>>> >>>>>>> > > > > >>>> >>>>>>> > > On Tue, Jun 25, 2019 at 10:08 AM Jark Wu > > >>>> <[hidden email] <mailto:[hidden email]> > > <mailto:[hidden email] <mailto:[hidden email]>> > > >>>> >>>>>>> <mailto:[hidden email] > > <mailto:[hidden email]> <mailto:[hidden email] > > <mailto:[hidden email]>>>> > > >>>> wrote: > > >>>> >>>>>>> > > > > >>>> >>>>>>> > >> Hi Bowen, > > >>>> >>>>>>> > >> > > >>>> >>>>>>> > >> Thanks for bringing this. We also > > suffered from > > >>>> the long > > >>>> >>>>>>> build time. > > >>>> >>>>>>> > >> I agree that we should focus on > > solving build > > >>>> capacity > > >>>> >>>>>>> problem in the > > >>>> >>>>>>> > >> thread. > > >>>> >>>>>>> > >> > > >>>> >>>>>>> > >> My observation is there is only one > > build is > > >>>> running, all > > >>>> >> the > > >>>> >>>>>>> others > > >>>> >>>>>>> > >> (other > > >>>> >>>>>>> > >> PRs, master) are pending. > > >>>> >>>>>>> > >> The pricing plan[1] of travis shows > > it can > > >>>> support > > >>>> >> concurrent > > >>>> >>>>>>> build > > >>>> >>>>>>> > jobs. > > >>>> >>>>>>> > >> But I don't know which plan we are > > using, might > > >>>> be the free > > >>>> >>>>>>> plan for > > >>>> >>>>>>> > open > > >>>> >>>>>>> > >> source. > > >>>> >>>>>>> > >> > > >>>> >>>>>>> > >> I cc-ed Chesnay who may have some > > experience on > > >>>> Travis. > > >>>> >>>>>>> > >> > > >>>> >>>>>>> > >> Regards, > > >>>> >>>>>>> > >> Jark > > >>>> >>>>>>> > >> > > >>>> >>>>>>> > >> [1]: https://travis-ci.com/plans > > >>>> >>>>>>> > >> > > >>>> >>>>>>> > >> On Tue, 25 Jun 2019 at 08:11, Bowen Li < > > >>>> >> [hidden email] <mailto:[hidden email]> > > <mailto:[hidden email] <mailto:[hidden email]>> > > >>>> >>>>>>> <mailto:[hidden email] > > <mailto:[hidden email]> > > >>>> <mailto:[hidden email] > > <mailto:[hidden email]>>>> wrote: > > >>>> >>>>>>> > >> > > >>>> >>>>>>> > >> > Hi Steven, > > >>>> >>>>>>> > >> > > > >>>> >>>>>>> > >> > I think you may not read what I > > wrote. The > > >>>> discussion is > > >>>> >>>> about > > >>>> >>>>>>> > "unstable > > >>>> >>>>>>> > >> > build **capacity**", in another word > > >>>> "unstable / lack of > > >>>> >>>> build > > >>>> >>>>>>> > >> resources", > > >>>> >>>>>>> > >> > not "unstable build". > > >>>> >>>>>>> > >> > > > >>>> >>>>>>> > >> > On Mon, Jun 24, 2019 at 4:40 PM > > Steven Wu > > >>>> >>>>>>> <[hidden email] > > <mailto:[hidden email]> <mailto:[hidden email] > > <mailto:[hidden email]>> > > >>>> <mailto:[hidden email] > > <mailto:[hidden email]> <mailto:[hidden email] > > <mailto:[hidden email]>>>> > > >>>> >>>>>>> > wrote: > > >>>> >>>>>>> > >> > > > >>>> >>>>>>> > >> > > long and sometimes unstable build is > > >>>> definitely a pain > > >>>> >>>>>> point. > > >>>> >>>>>>> > >> > > > > >>>> >>>>>>> > >> > > I suspect the build failure here in > > >>>> >> flink-connector-kafka > > >>>> >>>>>>> is not > > >>>> >>>>>>> > >> related > > >>>> >>>>>>> > >> > to > > >>>> >>>>>>> > >> > > my change. but there is no easy > > re-run the > > >>>> build on > > >>>> >>>>>>> travis UI. > > >>>> >>>>>>> > >> > > search showed a trick of > > close-and-open the > > >>>> PR will > > >>>> >>>>>>> trigger rebuild. > > >>>> >>>>>>> > >> but > > >>>> >>>>>>> > >> > > that could add noises to the PR > > activities. > > >>>> >>>>>>> > >> > > > > >>>> https://travis-ci.org/apache/flink/jobs/545555519 > > >>>> >>>>>>> > >> > > > > >>>> >>>>>>> > >> > > travis-ci for my personal repo > > often failed > > >>>> with > > >>>> >>>>>>> exceeding time > > >>>> >>>>>>> > limit > > >>>> >>>>>>> > >> > after > > >>>> >>>>>>> > >> > > 4+ hours. > > >>>> >>>>>>> > >> > > The job exceeded the maximum time > > limit for > > >>>> jobs, and > > >>>> >> has > > >>>> >>>>>>> been > > >>>> >>>>>>> > >> > terminated. > > >>>> >>>>>>> > >> > > > > >>>> >>>>>>> > >> > > On Mon, Jun 24, 2019 at 4:15 PM > > Bowen Li > > >>>> >>>>>>> <[hidden email] > > <mailto:[hidden email]> <mailto:[hidden email] > > <mailto:[hidden email]>> > > >>>> <mailto:[hidden email] <mailto:[hidden email]> > > <mailto:[hidden email] <mailto:[hidden email]>>>> > > >>>> >>>>>>> > wrote: > > >>>> >>>>>>> > >> > > > > >>>> >>>>>>> > >> > > > > > >>>> https://travis-ci.org/apache/flink/builds/549681530 > > >>>> >>>>>>> This build > > >>>> >>>>>>> > >> > request > > >>>> >>>>>>> > >> > > > has > > >>>> >>>>>>> > >> > > > been sitting at **HEAD of the > > queue** > > >>>> since I first > > >>>> >> saw > > >>>> >>>>>>> it at PST > > >>>> >>>>>>> > >> > 10:30am > > >>>> >>>>>>> > >> > > > (not sure how long it's been > > there before > > >>>> 10:30am). > > >>>> >>>>>>> It's PST > > >>>> >>>>>>> > 4:12pm > > >>>> >>>>>>> > >> now > > >>>> >>>>>>> > >> > > and > > >>>> >>>>>>> > >> > > > it hasn't started yet. > > >>>> >>>>>>> > >> > > > > > >>>> >>>>>>> > >> > > > On Mon, Jun 24, 2019 at 2:48 PM > > Bowen Li > > >>>> >>>>>>> <[hidden email] > > <mailto:[hidden email]> <mailto:[hidden email] > > <mailto:[hidden email]>> > > >>>> <mailto:[hidden email] <mailto:[hidden email]> > > <mailto:[hidden email] <mailto:[hidden email]>>>> > > >>>> >>>>>>> > >> wrote: > > >>>> >>>>>>> > >> > > > > > >>>> >>>>>>> > >> > > > > Hi devs, > > >>>> >>>>>>> > >> > > > > > > >>>> >>>>>>> > >> > > > > I've been experiencing the pain > > >>>> resulting from lack > > >>>> >>>>>>> of stable > > >>>> >>>>>>> > >> build > > >>>> >>>>>>> > >> > > > > capacity on Travis for Flink > > PRs [1]. > > >>>> >> Specifically, I > > >>>> >>>>>>> noticed > > >>>> >>>>>>> > >> often > > >>>> >>>>>>> > >> > > that > > >>>> >>>>>>> > >> > > > no > > >>>> >>>>>>> > >> > > > > build in the queue is making any > > >>>> progress for > > >>>> >> hours, > > >>>> >>>> and > > >>>> >>>>>>> > suddenly > > >>>> >>>>>>> > >> 5 > > >>>> >>>>>>> > >> > or > > >>>> >>>>>>> > >> > > 6 > > >>>> >>>>>>> > >> > > > > builds kick off all together > > after the > > >>>> long pause. > > >>>> >>>>>>> I'm at PST > > >>>> >>>>>>> > >> > (UTC-08) > > >>>> >>>>>>> > >> > > > time > > >>>> >>>>>>> > >> > > > > zone, and I've seen pause can > > be as > > >>>> long as 6 hours > > >>>> >>>>>>> from PST 9am > > >>>> >>>>>>> > >> to > > >>>> >>>>>>> > >> > 3pm > > >>>> >>>>>>> > >> > > > > (let alone the time needed to > > drain the > > >>>> queue > > >>>> >>>>>>> afterwards). > > >>>> >>>>>>> > >> > > > > > > >>>> >>>>>>> > >> > > > > I think this has greatly > > impacted our > > >>>> productivity. > > >>>> >>>> I've > > >>>> >>>>>>> > >> experienced > > >>>> >>>>>>> > >> > > that > > >>>> >>>>>>> > >> > > > > PRs submitted in the early > > morning of > > >>>> PST time zone > > >>>> >>>>>>> won't finish > > >>>> >>>>>>> > >> > their > > >>>> >>>>>>> > >> > > > > build until late night of the > > same day. > > >>>> >>>>>>> > >> > > > > > > >>>> >>>>>>> > >> > > > > So my questions are: > > >>>> >>>>>>> > >> > > > > > > >>>> >>>>>>> > >> > > > > - Has anyone else experienced > > the same > > >>>> problem or > > >>>> >>>>>>> have similar > > >>>> >>>>>>> > >> > > > observation > > >>>> >>>>>>> > >> > > > > on TravisCI? (I suspect it > > has things > > >>>> to do with > > >>>> >> time > > >>>> >>>>>>> zone) > > >>>> >>>>>>> > >> > > > > > > >>>> >>>>>>> > >> > > > > - What pricing plan of > > TravisCI is > > >>>> Flink currently > > >>>> >>>>>>> using? Is it > > >>>> >>>>>>> > >> the > > >>>> >>>>>>> > >> > > free > > >>>> >>>>>>> > >> > > > > plan for open source > > projects? What > > >>>> are the > > >>>> >>>>>>> guaranteed build > > >>>> >>>>>>> > >> capacity > > >>>> >>>>>>> > >> > > of > > >>>> >>>>>>> > >> > > > > the current plan? > > >>>> >>>>>>> > >> > > > > > > >>>> >>>>>>> > >> > > > > - If the current pricing plan > > (either > > >>>> free or paid) > > >>>> >>>>>> can't > > >>>> >>>>>>> > provide > > >>>> >>>>>>> > >> > > stable > > >>>> >>>>>>> > >> > > > > build capacity, can we > > upgrade to a > > >>>> higher priced > > >>>> >>>>>>> plan with > > >>>> >>>>>>> > larger > > >>>> >>>>>>> > >> > and > > >>>> >>>>>>> > >> > > > more > > >>>> >>>>>>> > >> > > > > stable build capacity? > > >>>> >>>>>>> > >> > > > > > > >>>> >>>>>>> > >> > > > > BTW, another factor that > > contribute to > > >>>> the > > >>>> >>>>>>> productivity problem > > >>>> >>>>>>> > is > > >>>> >>>>>>> > >> > that > > >>>> >>>>>>> > >> > > > > our build is slow - we run > > full build > > >>>> for every PR > > >>>> >>>> and a > > >>>> >>>>>>> > >> successful > > >>>> >>>>>>> > >> > > full > > >>>> >>>>>>> > >> > > > > build takes ~5h. We > > definitely have > > >>>> more options to > > >>>> >>>>>>> solve it, > > >>>> >>>>>>> > for > > >>>> >>>>>>> > >> > > > instance, > > >>>> >>>>>>> > >> > > > > modularize the build graphs > > and reuse > > >>>> artifacts > > >>>> >> from > > >>>> >>>> the > > >>>> >>>>>>> > previous > > >>>> >>>>>>> > >> > > build. > > >>>> >>>>>>> > >> > > > > But I think that can be a big > > effort > > >>>> which is much > > >>>> >>>>>>> harder to > > >>>> >>>>>>> > >> > accomplish > > >>>> >>>>>>> > >> > > > in > > >>>> >>>>>>> > >> > > > > a short period of time and > > may deserve > > >>>> its own > > >>>> >>>> separate > > >>>> >>>>>>> > >> discussion. > > >>>> >>>>>>> > >> > > > > > > >>>> >>>>>>> > >> > > > > [1] > > >>>> >> https://travis-ci.org/apache/flink/pull_requests > > >>>> >>>>>>> > >> > > > > > > >>>> >>>>>>> > >> > > > > > > >>>> >>>>>>> > >> > > > > > >>>> >>>>>>> > >> > > > > >>>> >>>>>>> > >> > > > >>>> >>>>>>> > >> > > >>>> >>>>>>> > > > > >>>> >>>>>>> > > > >>>> >>>>>>> > > >>>> >>>>>>> > > >>>> >>>>>>> -- > > >>>> >>>>>>> Best Regards > > >>>> >>>>>>> > > >>>> >>>>>>> Jeff Zhang > > >>>> >>>>>>> > > >>>> >> > > >>>> > > >>> > > >> > > > > |
+1
Aljoscha > On 4. Jul 2019, at 11:09, Stephan Ewen <[hidden email]> wrote: > > +1 to move to a private Travis account. > > I can confirm that Ververica will sponsor a Travis CI plan that is > equivalent or a bit higher than the previous ASF quota (10 concurrent build > queues) > > Best, > Stephan > > On Thu, Jul 4, 2019 at 10:46 AM Chesnay Schepler <[hidden email]> wrote: > >> I've raised a JIRA >> <https://issues.apache.org/jira/browse/INFRA-18703>with INFRA to inquire >> whether it would be possible to switch to a different Travis account, >> and if so what steps would need to be taken. >> We need a proper confirmation from INFRA since we are not in full >> control of the flink repository (for example, we cannot access the >> settings page). >> >> If this is indeed possible, Ververica is willing sponsor a Travis >> account for the Flink project. >> This would provide us with more than enough resources than we need. >> >> Since this makes the project more reliant on resources provided by >> external companies I would like to vote on this. >> >> Please vote on this proposal, as follows: >> [ ] +1, Approve the migration to a Ververica-sponsored Travis account, >> provided that INFRA approves >> [ ] -1, Do not approach the migration to a Ververica-sponsored Travis >> account >> >> The vote will be open for at least 24h, and until we have confirmation >> from INFRA. The voting period may be shorter than the usual 3 days since >> our current is effectively not working. >> >> On 04/07/2019 06:51, Bowen Li wrote: >>> Re: > Are they using their own Travis CI pool, or did the switch to an >>> entirely different CI service? >>> >>> I reached out to Wes and Krisztián from Apache Arrow PMC. They are >>> currently moving away from ASF's Travis to their own in-house metal >>> machines at [1] with custom CI application at [2]. They've seen >>> significant improvement w.r.t both much higher performance and >>> basically no resource waiting time, "night-and-day" difference quoting >>> Wes. >>> >>> Re: > If we can just switch to our own Travis pool, just for our >>> project, then this might be something we can do fairly quickly? >>> >>> I believe so, according to [3] and [4] >>> >>> >>> [1] https://ci.ursalabs.org/ <https://ci.ursalabs.org/#/> >>> [2] https://github.com/ursa-labs/ursabot >>> [3] >>> https://docs.travis-ci.com/user/migrate/open-source-repository-migration >>> [4] https://docs.travis-ci.com/user/migrate/open-source-on-travis-ci-com >>> >>> >>> >>> On Wed, Jul 3, 2019 at 12:01 AM Chesnay Schepler <[hidden email] >>> <mailto:[hidden email]>> wrote: >>> >>> Are they using their own Travis CI pool, or did the switch to an >>> entirely different CI service? >>> >>> If we can just switch to our own Travis pool, just for our >>> project, then >>> this might be something we can do fairly quickly? >>> >>> On 03/07/2019 05:55, Bowen Li wrote: >>>> I responded in the INFRA ticket [1] that I believe they are >>> using a wrong >>>> metric against Flink and the total build time is a completely >>> different >>>> thing than guaranteed build capacity. >>>> >>>> My response: >>>> >>>> "As mentioned above, since I started to pay attention to Flink's >>> build >>>> queue a few tens of days ago, I'm in Seattle and I saw no build >>> was kicking >>>> off in PST daytime in weekdays for Flink. Our teammates in China >>> and Europe >>>> have also reported similar observations. So we need to evaluate >>> how the >>>> large total build time came from - if 1) your number and 2) our >>>> observations from three locations that cover pretty much a full >>> day, are >>>> all true, I **guess** one reason can be that - highly likely the >>> extra >>>> build time came from weekends when other Apache projects may be >>> idle and >>>> Flink just drains hard its congested queue. >>>> >>>> Please be aware of that we're not complaining about the lack of >>> resources >>>> in general, I'm complaining about the lack of **stable, dedicated** >>>> resources. An example for the latter one is, currently even if >>> no build is >>>> in Flink's queue and I submit a request to be the queue head in PST >>>> morning, my build won't even start in 6-8+h. That is an absurd >>> amount of >>>> waiting time. >>>> >>>> That's saying, if ASF INFRA decides to adopt a quota system and >>> grants >>>> Flink five DEDICATED servers that runs all the time only for >>> Flink, that'll >>>> be PERFECT and can totally solve our problem now. >>>> >>>> Please be aware of that we're not complaining about the lack of >>> resources >>>> in general, I'm complaining about the lack of **stable, dedicated** >>>> resources. An example for the latter one is, currently even if >>> no build is >>>> in Flink's queue and I submit a request to be the queue head in PST >>>> morning, my build won't even start in 6-8+h. That is an absurd >>> amount of >>>> waiting time. >>>> >>>> >>>> That's saying, if ASF INFRA decides to adopt a quota system and >>> grants >>>> Flink five DEDICATED servers that runs all the time only for >>> Flink, that'll >>>> be PERFECT and can totally solve our problem now. >>>> >>>> I feel what's missing in the ASF INFRA's Travis resource pool is >>> some level >>>> of build capacity SLAs and certainty" >>>> >>>> >>>> Again, I believe there are differences in nature of these two >>> problems, >>>> long build time v.s. lack of dedicated build resource. That's >>> saying, >>>> shortening build time may relieve the situation, and may not. >>> I'm sightly >>>> negative on disabling IT cases for PRs, due to the downside is >>> that we are >>>> at risk of any potential bugs in PR that UTs doesn't catch, and >>> may cost a >>>> lot more to fix and if it slows others down or even block >>> others, but am >>>> open to others opinions on it. >>>> >>>> AFAICT from INFRA ticket[1], donating to ASF INFRA won't be >>> feasible to >>>> solve our problem since INFRA's pool is fully shared and they >>> have no >>>> control and finer insights over resource allocation to a >>> specific Apache >>>> project. As mentioned in [1], Apache Arrow is moving away from >>> ASF INFRA >>>> Travis pool (they are actually surprised Flink hasn't plan to do >>> so). I >>>> know that Spark is on its own build infra. If we all agree that >>> funding our >>>> own build infra, I'd be glad to help investigate any potential >>> options >>>> after releasing 1.9 since I'm super busy with 1.9 now. >>>> >>>> [1] https://issues.apache.org/jira/browse/INFRA-18533 >>>> >>>> >>>> >>>> On Tue, Jul 2, 2019 at 4:46 AM Chesnay Schepler >>> <[hidden email] <mailto:[hidden email]>> wrote: >>>> >>>>> As a short-term stopgap, since we can assume this issue to >>> become much >>>>> worse in the following days/weeks, we could disable IT cases in >>> PRs and >>>>> only run them on master. >>>>> >>>>> On 02/07/2019 12:03, Chesnay Schepler wrote: >>>>>> People really have to stop thinking that just because >>> something works >>>>>> for us it is also a good solution. >>>>>> Also, please remember that our builds run for 2h from start to >>> finish, >>>>>> and not the 14 _minutes_ it takes for zeppelin. >>>>>> We are dealing with an entirely different scale here, both in >>> terms of >>>>>> build times and number of builds. >>>>>> >>>>>> In this very thread people have been complaining about long queue >>>>>> times for their builds. Surprise, other Apache projects have been >>>>>> suffering the very same thing due to us not controlling our build >>>>>> times. While switching services (be it Jenkins, CircleCI or >>> whatever) >>>>>> will possibly work for us (and these options are actually >>> attractive, >>>>>> like CircleCI's proper support for build artifacts), it will also >>>>>> result in us likely negatively affecting other projects in >>> significant >>>>>> ways. >>>>>> >>>>>> Sure, the Jenkins setup has a good user experience for us, at >>> the cost >>>>>> of blocking Jenkins workers for a _lot_ of time. Right now we >>> have 25 >>>>>> PR's in our queue; that's possibly 50h we'd consume of Jenkins >>>>>> resources, and the European contributors haven't even really >>> started yet. >>>>>> >>>>>> FYI, the latest INFRA response from INFRA-18533: >>>>>> >>>>>> "Our rough metrics shows that Flink used over 5800 hours of >>> build time >>>>>> last month. That is equal to EIGHT servers running 24/7 for >>> the ENTIRE >>>>>> MONTH. EIGHT. nonstop. >>>>>> When we discovered this last night, we discussed it some and >>> are going >>>>>> to tune down Flink to allow only five executors maximum. We >> cannot >>>>>> allow Flink to consume so much of a Foundation shared resource." >>>>>> >>>>>> So yes, we either >>>>>> a) have to heavily reduce our CI usage or >>>>>> b) fund our own, either maintaining it ourselves or donating >>> to Apache. >>>>>> >>>>>> On 02/07/2019 05:11, Bowen Li wrote: >>>>>>> By looking at the git history of the Jenkins script, its core >>> part >>>>>>> was finished in March 2017 (and only two minor update in >>> 2017/2018), >>>>>>> so it's been running for over two years now and feels like >>> Zepplin >>>>>>> community has been quite happy with it. @Jeff Zhang >>>>>>> <mailto:[hidden email] <mailto:[hidden email]>> can you >>> share your insights and user >>>>>>> experience with the Jenkins+Travis approach? >>>>>>> >>>>>>> Things like: >>>>>>> >>>>>>> - has the approach completely solved the resource capacity >>> problem >>>>>>> for Zepplin community? is Zepplin community happy with the >>> result? >>>>>>> - is the whole configuration chain stable (e.g. uptime) enough? >>>>>>> - how often do you need to maintain the Jenkins infra? how many >>>>>>> people are usually involved in maintenance and bug-fixes? >>>>>>> >>>>>>> The downside of this approach seems mostly to be on the >>> maintenance >>>>>>> to me - maintain the script and Jenkins infra. >>>>>>> >>>>>>> ** Having Our Own Travis-CI.com Account ** >>>>>>> >>>>>>> Another alternative I've been thinking of is to have our own >>>>>>> travis-ci.com <http://travis-ci.com> <http://travis-ci.com> >>> account with paid dedicated >>>>>>> resources. Note travis-ci.org <http://travis-ci.org> >>> <http://travis-ci.org> is the free >>>>>>> version and travis-ci.com <http://travis-ci.com> >>> <http://travis-ci.com> is the commercial >>>>>>> version. We currently use a shared resource pool managed by >>> ASK INFRA >>>>>>> team on travis-ci.org <http://travis-ci.org> >>> <http://travis-ci.org>, but we have no control >>>>>>> over it - we can't see how it's configured, how much >>> resources are >>>>>>> available, how resources are allocated among Apache projects, >>> etc. >>>>>>> The nice thing about having an account on travis-ci.com >>> <http://travis-ci.com> >>>>>>> <http://travis-ci.com> are: >>>>>>> >>>>>>> - relatively low cost with much better resource guarantee >>> than what >>>>>>> we currently have [1]: $249/month with 5 dedicated concurrency, >>>>>>> $489/month with 10 concurrency >>>>>>> - low maintenance work compared to using Jenkins >>>>>>> - (potentially) no migration cost according to Travis's doc [2] >>>>>>> (pending verification) >>>>>>> - full control over the build capacity/configuration compared to >>>>>>> using ASF INFRA's pool >>>>>>> >>>>>>> I'd be surprised if we as such a vibrant community cannot >>> find and >>>>>>> fund $249*12=$2988 a year in exchange for a much better >> developer >>>>>>> experience and much higher productivity. >>>>>>> >>>>>>> [1] https://travis-ci.com/plans >>>>>>> [2] >>>>>>> >>>>> >>> >> https://docs.travis-ci.com/user/migrate/open-source-repository-migration >>>>>>> On Sat, Jun 29, 2019 at 8:39 AM Chesnay Schepler >>> <[hidden email] <mailto:[hidden email]> >>>>>>> <mailto:[hidden email] <mailto:[hidden email]>>> wrote: >>>>>>> >>>>>>> So yes, the Jenkins job keeps pulling the state from >>> Travis until it >>>>>>> finishes. >>>>>>> >>>>>>> Note sure I'm comfortable with the idea of using Jenkins >>> workers >>>>>>> just to >>>>>>> idle for a several hours. >>>>>>> >>>>>>> On 29/06/2019 14:56, Jeff Zhang wrote: >>>>>>>> Here's what zeppelin community did, we make a python >>> script to >>>>>>> check the >>>>>>>> build status of pull request. >>>>>>>> Here's script: >>>>>>>> >>> https://github.com/apache/zeppelin/blob/master/travis_check.py >>>>>>>> >>>>>>>> And this is the script we used in Jenkins build job. >>>>>>>> >>>>>>>> if [ -f "travis_check.py" ]; then >>>>>>>> git log -n 1 >>>>>>>> STATUS=$(curl -s $BUILD_URL | grep -e "GitHub pull >>>>>>> request.*from.*" | sed >>>>>>>> 's/.*GitHub pull request <a >>>>>>>> href=\"\(https[^"]*\).*from[^"]*.\(https[^"]*\).*/\1 >>> \2/g') >>>>>>>> AUTHOR=$(echo $STATUS | sed 's/.*[/]\(.*\)$/\1/g') >>>>>>>> PR=$(echo $STATUS | awk '{print $1}' | sed >>>>>>> 's/.*[/]\(.*\)$/\1/g') >>>>>>>> #COMMIT=$(git log -n 1 | grep "^Merge:" | awk >>> '{print $3}') >>>>>>>> #if [ -z $COMMIT ]; then >>>>>>>> # COMMIT=$(curl -s >>>>>>> https://api.github.com/repos/apache/zeppelin/pulls/$PR >>>>>>>> | grep -e "\"label\":" -e "\"ref\":" -e "\"sha\":" | >>> tr '\n' ' ' >>>>>>> | sed >>>>>>>> 's/\(.*sha[^,]*,\)\(.*ref.*\)/\1 = \2/g' | tr = '\n' | >>> grep -v >>>>>>> "apache:" | >>>>>>>> sed 's/.*sha.[^"]*["]\([^"]*\).*/\1/g') >>>>>>>> #fi >>>>>>>> >>>>>>>> # get commit hash from PR >>>>>>>> COMMIT=$(curl -s >>>>>>> https://api.github.com/repos/apache/zeppelin/pulls/$PR | >>>>>>>> grep -e "\"label\":" -e "\"ref\":" -e "\"sha\":" | tr >>> '\n' ' ' >>>>>>> | sed >>>>>>>> 's/\(.*sha[^,]*,\)\(.*ref.*\)/\1 = \2/g' | tr = '\n' | >>> grep -v >>>>>>> "apache:" | >>>>>>>> sed 's/.*sha.[^"]*["]\([^"]*\).*/\1/g') >>>>>>>> sleep 30 # sleep few moment to wait travis starts >>> the build >>>>>>>> RET_CODE=0 >>>>>>>> python ./travis_check.py ${AUTHOR} ${COMMIT} || >>> RET_CODE=$? >>>>>>>> if [ $RET_CODE -eq 2 ]; then # try with repository >>> name when >>>>>>> travis-ci is >>>>>>>> not available in the account >>>>>>>> RET_CODE=0 >>>>>>>> AUTHOR=$(curl -s >>>>>>> https://api.github.com/repos/apache/zeppelin/pulls/$PR >>>>>>>> | grep '"full_name":' | grep -v "apache/zeppelin" | sed >>>>>>>> 's/.*[:][^"]*["]\([^/]*\).*/\1/g') >>>>>>>> python ./travis_check.py ${AUTHOR} ${COMMIT} || >>> RET_CODE=$? >>>>>>>> fi >>>>>>>> >>>>>>>> if [ $RET_CODE -eq 2 ]; then # fail with can't find >>> build >>>>>>> information in >>>>>>>> the travis >>>>>>>> set +x >>>>>>>> echo >>> "-----------------------------------------------------" >>>>>>>> echo "Looks like travis-ci is not configured for >>> your fork." >>>>>>>> echo "Please setup by swich on 'zeppelin' >>> repository at >>>>>>>> https://travis-ci.org/profile and travis-ci." >>>>>>>> echo "And then make sure 'Build branch updates' >>> option is >>>>>>> enabled in >>>>>>>> the settings >>> https://travis-ci.org/${AUTHOR}/zeppelin/settings >>> <https://travis-ci.org/$%7BAUTHOR%7D/zeppelin/settings> >>>>>>> <https://travis-ci.org/$%7BAUTHOR%7D/zeppelin/settings>." >>>>>>>> echo "" >>>>>>>> echo "To trigger CI after setup, you will need >>> ammend your >>>>>>> last commit >>>>>>>> with" >>>>>>>> echo "git commit --amend" >>>>>>>> echo "git push your-remote HEAD --force" >>>>>>>> echo "" >>>>>>>> echo "See >>>>>>>> >>>>>>> >>>>> >>> >> http://zeppelin.apache.org/contribution/contributions.html#continuous-integration >>>>>>>> ." >>>>>>>> fi >>>>>>>> >>>>>>>> exit $RET_CODE >>>>>>>> else >>>>>>>> set +x >>>>>>>> echo "travis_check.py does not exists" >>>>>>>> exit 1 >>>>>>>> fi >>>>>>>> >>>>>>>> Chesnay Schepler <[hidden email] >>> <mailto:[hidden email]> >>>>>>> <mailto:[hidden email] <mailto:[hidden email]>>> >>> 于2019年6月29日周六 下午3:17写道: >>>>>>>> >>>>>>>>> Does this imply that a Jenkins job is active as long >>> as the >>>>>>> Travis build >>>>>>>>> runs? >>>>>>>>> >>>>>>>>> On 26/06/2019 21:28, Bowen Li wrote: >>>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>> @Dawid, I think the "long test running" as I >>> mentioned in the >>>>>>> first >>>>>>>>> email, >>>>>>>>>> also as you guys said, belongs to "a big effort >>> which is much >>>>>>> harder to >>>>>>>>>> accomplish in a short period of time and may deserve >>> its own >>>>>>> separate >>>>>>>>>> discussion". Thus I didn't include it in what we can >>> do in a >>>>>>> foreseeable >>>>>>>>>> short term. >>>>>>>>>> >>>>>>>>>> Besides, I don't think that's the ultimate reason >>> for lack of >>>>>>> build >>>>>>>>>> resources. Even if the build is shortened to >>> something like >>>>>>> 2h, the >>>>>>>>>> problems of no build machine works about 6 or more >>> hours in >>>>>>> PST daytime >>>>>>>>>> that I described will still happen, because no >>> machine from >>>>>>> ASF INFRA's >>>>>>>>>> pool is allocated to Flink. As I have paid close >>> attention to >>>>>>> the build >>>>>>>>>> queue in the past few weekdays, it's a pretty clear >>> pattern now. >>>>>>>>>> >>>>>>>>>> **The ultimate root cause** for that is - we don't >>> have any >>>>>>> **dedicated** >>>>>>>>>> build resources that we can stably rely on. I'm >>> actually ok to >>>>>>> wait for a >>>>>>>>>> long time if there are build requests running, it >>> means at >>>>>>> least we are >>>>>>>>>> making progress. But I'm not ok with no build >>> resource. A >>>>>>> better place I >>>>>>>>>> think we should aim at in short term is to always >>> have at >>>>>>> least a central >>>>>>>>>> pool (can be 3 or 5) of machines dedicated to build >>> Flink at >>>>>>> any time, or >>>>>>>>>> maybe use users resources. >>>>>>>>>> >>>>>>>>>> @Chesnay @Robert I synced with Jeff offline that >>> Zeppelin >>>>>>> community is >>>>>>>>>> using a Jenkins job to automatically build on users' >>> travis >>>>>>> account and >>>>>>>>>> link the result back to github PR. I guess the >>> Jenkins job >>>>>>> would fetch >>>>>>>>>> latest upstream master and build the PR against it. >>> Jeff has >>>>>>> filed >>>>>>>>> tickets >>>>>>>>>> to learn and get access to the Jenkins infra. It'll >>> better to >>>>>>> fully >>>>>>>>>> understand it first before judging this approach. >>>>>>>>>> >>>>>>>>>> I also heard good things about CircleCI, and ASF >>> INFRA seems >>>>>>> to have a >>>>>>>>> pool >>>>>>>>>> of build capacity there too. Can be an alternative >>> to consider. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Wed, Jun 26, 2019 at 12:44 AM Dawid Wysakowicz < >>>>>>>>> [hidden email] >>> <mailto:[hidden email]> <mailto:[hidden email] >>> <mailto:[hidden email]>>> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Sorry to jump in late, but I think Bowen missed the >>> most >>>>>>> important point >>>>>>>>>>> from Chesnay's previous message in the summary. The >>> ultimate >>>>>>> reason for >>>>>>>>>>> all the problems is that the tests take close to 2 >>> hours to >>>>>>> run already. >>>>>>>>>>> I fully support this claim: "Unless people start >>> caring about >>>>>>> test times >>>>>>>>>>> before adding them, this issue cannot be solved" >>>>>>>>>>> >>>>>>>>>>> This is also another reason why using user's Travis >>> account >>>>>>> won't help. >>>>>>>>>>> Every few weeks we reach the user's time limit for >>> a single >>>>>>> profile. >>>>>>>>>>> This makes the user's builds simply fail, until we >>> either >>>>>>> properly >>>>>>>>>>> decrease the time the tests take (which I am not >>> sure we ever >>>>>>> did) or >>>>>>>>>>> postpone the problem by splitting into more >>> profiles. (Note >>>>>>> that the ASF >>>>>>>>>>> Travis account has higher time limits) >>>>>>>>>>> >>>>>>>>>>> Best, >>>>>>>>>>> >>>>>>>>>>> Dawid >>>>>>>>>>> >>>>>>>>>>> On 26/06/2019 09:36, Robert Metzger wrote: >>>>>>>>>>>> Do we know if using "the best" available hardware >>> would >>>>>>> improve the >>>>>>>>> build >>>>>>>>>>>> times? >>>>>>>>>>>> Imagine we would run the build on machines with >>> plenty of >>>>>>> main memory >>>>>>>>> to >>>>>>>>>>>> mount everything to ramdisk + the latest CPU >>> architecture? >>>>>>>>>>>> >>>>>>>>>>>> Throwing hardware at the problem could help reduce >>> the time >>>>>>> of an >>>>>>>>>>>> individual build, and using our own infrastructure >>> would >>>>>>> remove our >>>>>>>>>>>> dependency on Apache's Travis account (with the >>> obvious >>>>>>> downside of >>>>>>>>>>> having >>>>>>>>>>>> to maintain the infrastructure) >>>>>>>>>>>> We could use an open source travis alternative, to >>> have a >>>>>>> similar >>>>>>>>>>>> experience and make the migration easy. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Wed, Jun 26, 2019 at 9:34 AM Chesnay Schepler >>>>>>> <[hidden email] <mailto:[hidden email]> >>> <mailto:[hidden email] <mailto:[hidden email]>>> >>>>>>>>>>> wrote: >>>>>>>>>>>>>> From what I gathered, there's no special >>> sauce that the >>>>>>> Zeppelin >>>>>>>>>>>>> project uses which actually integrates a users >> Travis >>>>>>> account into the >>>>>>>>>>> PR. >>>>>>>>>>>>> They just disabled Travis for PRs. And that's >>> kind of it. >>>>>>>>>>>>> >>>>>>>>>>>>> Naturally we can do this (duh) and safe the ASF a >>> fair >>>>>>> amount of >>>>>>>>>>>>> resources, but there are downsides: >>>>>>>>>>>>> >>>>>>>>>>>>> The discoverability of the Travis check takes a >>> nose-dive. >>>>>>> Either we >>>>>>>>>>>>> require every contributor to always, an every >>> commit, also >>>>>>> post a >>>>>>>>> Travis >>>>>>>>>>>>> build, or we have the reviewer sift through the >>>>>>> contributors account >>>>>>>>> to >>>>>>>>>>>>> find it. >>>>>>>>>>>>> >>>>>>>>>>>>> This is rather cumbersome. Additionally, it's >>> also not >>>>>>> equivalent to >>>>>>>>>>>>> having a PR build. >>>>>>>>>>>>> >>>>>>>>>>>>> A normal branch build takes a branch as is and >>> tests it. A >>>>>>> PR build >>>>>>>>>>>>> merges the branch into master, and then runs it. >>> (Fun fact: >>>>>>> This is >>>>>>>>> why >>>>>>>>>>>>> a PR without merge conflicts is not being run on >>> Travis.) >>>>>>>>>>>>> >>>>>>>>>>>>> And ultimately, everyone can already make use of >> this >>>>>>> approach anyway. >>>>>>>>>>>>> >>>>>>>>>>>>> On 25/06/2019 08:02, Jark Wu wrote: >>>>>>>>>>>>>> Hi Jeff, >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks for sharing the Zeppelin approach. I >>> think it's a >>>>>>> good idea to >>>>>>>>>>>>>> leverage user's travis account. >>>>>>>>>>>>>> In this way, we can have almost unlimited >>> concurrent build >>>>>>> jobs and >>>>>>>>>>>>>> developers can restart build by themselves >>> (currently only >>>>>>> committers >>>>>>>>>>>>>> can restart PR's build). >>>>>>>>>>>>>> >>>>>>>>>>>>>> But I'm still not very clear how to integrate >> user's >>>>>>> travis build >>>>>>>>> into >>>>>>>>>>>>>> the Flink pull request's build automatically. >>> Can you >>>>>>> explain more in >>>>>>>>>>>>>> detail? >>>>>>>>>>>>>> >>>>>>>>>>>>>> Another question: does travis only build >>> branches for user >>>>>>> account? >>>>>>>>>>>>>> My concern is that builds for PRs will rebase >> user's >>>>>>> commits against >>>>>>>>>>>>>> current master branch. >>>>>>>>>>>>>> This will help us to find problems before >>> merge. Builds >>>>>>> for branches >>>>>>>>>>>>>> will lose the impact of new commits in master. >>>>>>>>>>>>>> How does Zeppelin solve this problem? >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks again for sharing the idea. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Regards, >>>>>>>>>>>>>> Jark >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Tue, 25 Jun 2019 at 11:01, Jeff Zhang >>> <[hidden email] <mailto:[hidden email]> >>>>>>> <mailto:[hidden email] <mailto:[hidden email]>> >>>>>>>>>>>>>> <mailto:[hidden email] >>> <mailto:[hidden email]> <mailto:[hidden email] >>> <mailto:[hidden email]>>>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> Hi Folks, >>>>>>>>>>>>>> >>>>>>>>>>>>>> Zeppelin meet this kind of issue before, we solve >>>>>>> it by >>>>>>>>> delegating >>>>>>>>>>>>>> each >>>>>>>>>>>>>> one's PR build to his travis account >>> (Everyone can >>>>>>> have 5 free >>>>>>>>>>>>>> slot for >>>>>>>>>>>>>> travis build). >>>>>>>>>>>>>> Apache account travis build is only triggered when >>>>>>> PR is merged. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Kurt Young <[hidden email] >>> <mailto:[hidden email]> >>>>>>> <mailto:[hidden email] <mailto:[hidden email]>> >>> <mailto:[hidden email] <mailto:[hidden email]> >>>>>>> <mailto:[hidden email] <mailto:[hidden email]>>>> >>>>>>>>>>>>>> 于2019年6月25日周二 上午10:16写道: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> (Forgot to cc George) >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Best, >>>>>>>>>>>>>>> Kurt >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Tue, Jun 25, 2019 at 10:16 AM Kurt Young >>>>>>> <[hidden email] <mailto:[hidden email]> >>> <mailto:[hidden email] <mailto:[hidden email]>> >>>>>>>>>>>>>> <mailto:[hidden email] >>> <mailto:[hidden email]> <mailto:[hidden email] >>> <mailto:[hidden email]>>>> >>>>>>> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Hi Bowen, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Thanks for bringing this up. We >>> actually have >>>>>>> discussed >>>>>>>>> about >>>>>>>>>>>>>> this, and I >>>>>>>>>>>>>>>> think Till and George have >>>>>>>>>>>>>>>> already spend sometime investigating >>> it. I have >>>>>>> cced both of >>>>>>>>>>>>>> them, and >>>>>>>>>>>>>>>> maybe they can share >>>>>>>>>>>>>>>> their findings. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Best, >>>>>>>>>>>>>>>> Kurt >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Tue, Jun 25, 2019 at 10:08 AM Jark Wu >>>>>>> <[hidden email] <mailto:[hidden email]> >>> <mailto:[hidden email] <mailto:[hidden email]>> >>>>>>>>>>>>>> <mailto:[hidden email] >>> <mailto:[hidden email]> <mailto:[hidden email] >>> <mailto:[hidden email]>>>> >>>>>>> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Hi Bowen, >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Thanks for bringing this. We also >>> suffered from >>>>>>> the long >>>>>>>>>>>>>> build time. >>>>>>>>>>>>>>>>> I agree that we should focus on >>> solving build >>>>>>> capacity >>>>>>>>>>>>>> problem in the >>>>>>>>>>>>>>>>> thread. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> My observation is there is only one >>> build is >>>>>>> running, all >>>>>>>>> the >>>>>>>>>>>>>> others >>>>>>>>>>>>>>>>> (other >>>>>>>>>>>>>>>>> PRs, master) are pending. >>>>>>>>>>>>>>>>> The pricing plan[1] of travis shows >>> it can >>>>>>> support >>>>>>>>> concurrent >>>>>>>>>>>>>> build >>>>>>>>>>>>>>> jobs. >>>>>>>>>>>>>>>>> But I don't know which plan we are >>> using, might >>>>>>> be the free >>>>>>>>>>>>>> plan for >>>>>>>>>>>>>>> open >>>>>>>>>>>>>>>>> source. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I cc-ed Chesnay who may have some >>> experience on >>>>>>> Travis. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Regards, >>>>>>>>>>>>>>>>> Jark >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> [1]: https://travis-ci.com/plans >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Tue, 25 Jun 2019 at 08:11, Bowen Li < >>>>>>>>> [hidden email] <mailto:[hidden email]> >>> <mailto:[hidden email] <mailto:[hidden email]>> >>>>>>>>>>>>>> <mailto:[hidden email] >>> <mailto:[hidden email]> >>>>>>> <mailto:[hidden email] >>> <mailto:[hidden email]>>>> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Hi Steven, >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> I think you may not read what I >>> wrote. The >>>>>>> discussion is >>>>>>>>>>> about >>>>>>>>>>>>>>> "unstable >>>>>>>>>>>>>>>>>> build **capacity**", in another word >>>>>>> "unstable / lack of >>>>>>>>>>> build >>>>>>>>>>>>>>>>> resources", >>>>>>>>>>>>>>>>>> not "unstable build". >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Mon, Jun 24, 2019 at 4:40 PM >>> Steven Wu >>>>>>>>>>>>>> <[hidden email] >>> <mailto:[hidden email]> <mailto:[hidden email] >>> <mailto:[hidden email]>> >>>>>>> <mailto:[hidden email] >>> <mailto:[hidden email]> <mailto:[hidden email] >>> <mailto:[hidden email]>>>> >>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> long and sometimes unstable build is >>>>>>> definitely a pain >>>>>>>>>>>>> point. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> I suspect the build failure here in >>>>>>>>> flink-connector-kafka >>>>>>>>>>>>>> is not >>>>>>>>>>>>>>>>> related >>>>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>>>>> my change. but there is no easy >>> re-run the >>>>>>> build on >>>>>>>>>>>>>> travis UI. >>>>>>>>>>>>>>>>>>> search showed a trick of >>> close-and-open the >>>>>>> PR will >>>>>>>>>>>>>> trigger rebuild. >>>>>>>>>>>>>>>>> but >>>>>>>>>>>>>>>>>>> that could add noises to the PR >>> activities. >>>>>>>>>>>>>>>>>>> >>>>>>> https://travis-ci.org/apache/flink/jobs/545555519 >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> travis-ci for my personal repo >>> often failed >>>>>>> with >>>>>>>>>>>>>> exceeding time >>>>>>>>>>>>>>> limit >>>>>>>>>>>>>>>>>> after >>>>>>>>>>>>>>>>>>> 4+ hours. >>>>>>>>>>>>>>>>>>> The job exceeded the maximum time >>> limit for >>>>>>> jobs, and >>>>>>>>> has >>>>>>>>>>>>>> been >>>>>>>>>>>>>>>>>> terminated. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> On Mon, Jun 24, 2019 at 4:15 PM >>> Bowen Li >>>>>>>>>>>>>> <[hidden email] >>> <mailto:[hidden email]> <mailto:[hidden email] >>> <mailto:[hidden email]>> >>>>>>> <mailto:[hidden email] <mailto:[hidden email]> >>> <mailto:[hidden email] <mailto:[hidden email]>>>> >>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>> https://travis-ci.org/apache/flink/builds/549681530 >>>>>>>>>>>>>> This build >>>>>>>>>>>>>>>>>> request >>>>>>>>>>>>>>>>>>>> has >>>>>>>>>>>>>>>>>>>> been sitting at **HEAD of the >>> queue** >>>>>>> since I first >>>>>>>>> saw >>>>>>>>>>>>>> it at PST >>>>>>>>>>>>>>>>>> 10:30am >>>>>>>>>>>>>>>>>>>> (not sure how long it's been >>> there before >>>>>>> 10:30am). >>>>>>>>>>>>>> It's PST >>>>>>>>>>>>>>> 4:12pm >>>>>>>>>>>>>>>>> now >>>>>>>>>>>>>>>>>>> and >>>>>>>>>>>>>>>>>>>> it hasn't started yet. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> On Mon, Jun 24, 2019 at 2:48 PM >>> Bowen Li >>>>>>>>>>>>>> <[hidden email] >>> <mailto:[hidden email]> <mailto:[hidden email] >>> <mailto:[hidden email]>> >>>>>>> <mailto:[hidden email] <mailto:[hidden email]> >>> <mailto:[hidden email] <mailto:[hidden email]>>>> >>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Hi devs, >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> I've been experiencing the pain >>>>>>> resulting from lack >>>>>>>>>>>>>> of stable >>>>>>>>>>>>>>>>> build >>>>>>>>>>>>>>>>>>>>> capacity on Travis for Flink >>> PRs [1]. >>>>>>>>> Specifically, I >>>>>>>>>>>>>> noticed >>>>>>>>>>>>>>>>> often >>>>>>>>>>>>>>>>>>> that >>>>>>>>>>>>>>>>>>>> no >>>>>>>>>>>>>>>>>>>>> build in the queue is making any >>>>>>> progress for >>>>>>>>> hours, >>>>>>>>>>> and >>>>>>>>>>>>>>> suddenly >>>>>>>>>>>>>>>>> 5 >>>>>>>>>>>>>>>>>> or >>>>>>>>>>>>>>>>>>> 6 >>>>>>>>>>>>>>>>>>>>> builds kick off all together >>> after the >>>>>>> long pause. >>>>>>>>>>>>>> I'm at PST >>>>>>>>>>>>>>>>>> (UTC-08) >>>>>>>>>>>>>>>>>>>> time >>>>>>>>>>>>>>>>>>>>> zone, and I've seen pause can >>> be as >>>>>>> long as 6 hours >>>>>>>>>>>>>> from PST 9am >>>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>>>> 3pm >>>>>>>>>>>>>>>>>>>>> (let alone the time needed to >>> drain the >>>>>>> queue >>>>>>>>>>>>>> afterwards). >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> I think this has greatly >>> impacted our >>>>>>> productivity. >>>>>>>>>>> I've >>>>>>>>>>>>>>>>> experienced >>>>>>>>>>>>>>>>>>> that >>>>>>>>>>>>>>>>>>>>> PRs submitted in the early >>> morning of >>>>>>> PST time zone >>>>>>>>>>>>>> won't finish >>>>>>>>>>>>>>>>>> their >>>>>>>>>>>>>>>>>>>>> build until late night of the >>> same day. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> So my questions are: >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> - Has anyone else experienced >>> the same >>>>>>> problem or >>>>>>>>>>>>>> have similar >>>>>>>>>>>>>>>>>>>> observation >>>>>>>>>>>>>>>>>>>>> on TravisCI? (I suspect it >>> has things >>>>>>> to do with >>>>>>>>> time >>>>>>>>>>>>>> zone) >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> - What pricing plan of >>> TravisCI is >>>>>>> Flink currently >>>>>>>>>>>>>> using? Is it >>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>>> free >>>>>>>>>>>>>>>>>>>>> plan for open source >>> projects? What >>>>>>> are the >>>>>>>>>>>>>> guaranteed build >>>>>>>>>>>>>>>>> capacity >>>>>>>>>>>>>>>>>>> of >>>>>>>>>>>>>>>>>>>>> the current plan? >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> - If the current pricing plan >>> (either >>>>>>> free or paid) >>>>>>>>>>>>> can't >>>>>>>>>>>>>>> provide >>>>>>>>>>>>>>>>>>> stable >>>>>>>>>>>>>>>>>>>>> build capacity, can we >>> upgrade to a >>>>>>> higher priced >>>>>>>>>>>>>> plan with >>>>>>>>>>>>>>> larger >>>>>>>>>>>>>>>>>> and >>>>>>>>>>>>>>>>>>>> more >>>>>>>>>>>>>>>>>>>>> stable build capacity? >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> BTW, another factor that >>> contribute to >>>>>>> the >>>>>>>>>>>>>> productivity problem >>>>>>>>>>>>>>> is >>>>>>>>>>>>>>>>>> that >>>>>>>>>>>>>>>>>>>>> our build is slow - we run >>> full build >>>>>>> for every PR >>>>>>>>>>> and a >>>>>>>>>>>>>>>>> successful >>>>>>>>>>>>>>>>>>> full >>>>>>>>>>>>>>>>>>>>> build takes ~5h. We >>> definitely have >>>>>>> more options to >>>>>>>>>>>>>> solve it, >>>>>>>>>>>>>>> for >>>>>>>>>>>>>>>>>>>> instance, >>>>>>>>>>>>>>>>>>>>> modularize the build graphs >>> and reuse >>>>>>> artifacts >>>>>>>>> from >>>>>>>>>>> the >>>>>>>>>>>>>>> previous >>>>>>>>>>>>>>>>>>> build. >>>>>>>>>>>>>>>>>>>>> But I think that can be a big >>> effort >>>>>>> which is much >>>>>>>>>>>>>> harder to >>>>>>>>>>>>>>>>>> accomplish >>>>>>>>>>>>>>>>>>>> in >>>>>>>>>>>>>>>>>>>>> a short period of time and >>> may deserve >>>>>>> its own >>>>>>>>>>> separate >>>>>>>>>>>>>>>>> discussion. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> [1] >>>>>>>>> https://travis-ci.org/apache/flink/pull_requests >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> -- >>>>>>>>>>>>>> Best Regards >>>>>>>>>>>>>> >>>>>>>>>>>>>> Jeff Zhang >>>>>>>>>>>>>> >>>>>>>>> >>>>>>> >>>>>> >>>>> >>> >> >> |
+1 and great thanks Chesnay for pushing this.
Best, Kurt On Thu, Jul 4, 2019 at 5:44 PM Aljoscha Krettek <[hidden email]> wrote: > +1 > > Aljoscha > > > On 4. Jul 2019, at 11:09, Stephan Ewen <[hidden email]> wrote: > > > > +1 to move to a private Travis account. > > > > I can confirm that Ververica will sponsor a Travis CI plan that is > > equivalent or a bit higher than the previous ASF quota (10 concurrent > build > > queues) > > > > Best, > > Stephan > > > > On Thu, Jul 4, 2019 at 10:46 AM Chesnay Schepler <[hidden email]> > wrote: > > > >> I've raised a JIRA > >> <https://issues.apache.org/jira/browse/INFRA-18703>with INFRA to > inquire > >> whether it would be possible to switch to a different Travis account, > >> and if so what steps would need to be taken. > >> We need a proper confirmation from INFRA since we are not in full > >> control of the flink repository (for example, we cannot access the > >> settings page). > >> > >> If this is indeed possible, Ververica is willing sponsor a Travis > >> account for the Flink project. > >> This would provide us with more than enough resources than we need. > >> > >> Since this makes the project more reliant on resources provided by > >> external companies I would like to vote on this. > >> > >> Please vote on this proposal, as follows: > >> [ ] +1, Approve the migration to a Ververica-sponsored Travis account, > >> provided that INFRA approves > >> [ ] -1, Do not approach the migration to a Ververica-sponsored Travis > >> account > >> > >> The vote will be open for at least 24h, and until we have confirmation > >> from INFRA. The voting period may be shorter than the usual 3 days since > >> our current is effectively not working. > >> > >> On 04/07/2019 06:51, Bowen Li wrote: > >>> Re: > Are they using their own Travis CI pool, or did the switch to an > >>> entirely different CI service? > >>> > >>> I reached out to Wes and Krisztián from Apache Arrow PMC. They are > >>> currently moving away from ASF's Travis to their own in-house metal > >>> machines at [1] with custom CI application at [2]. They've seen > >>> significant improvement w.r.t both much higher performance and > >>> basically no resource waiting time, "night-and-day" difference quoting > >>> Wes. > >>> > >>> Re: > If we can just switch to our own Travis pool, just for our > >>> project, then this might be something we can do fairly quickly? > >>> > >>> I believe so, according to [3] and [4] > >>> > >>> > >>> [1] https://ci.ursalabs.org/ <https://ci.ursalabs.org/#/> > >>> [2] https://github.com/ursa-labs/ursabot > >>> [3] > >>> > https://docs.travis-ci.com/user/migrate/open-source-repository-migration > >>> [4] > https://docs.travis-ci.com/user/migrate/open-source-on-travis-ci-com > >>> > >>> > >>> > >>> On Wed, Jul 3, 2019 at 12:01 AM Chesnay Schepler <[hidden email] > >>> <mailto:[hidden email]>> wrote: > >>> > >>> Are they using their own Travis CI pool, or did the switch to an > >>> entirely different CI service? > >>> > >>> If we can just switch to our own Travis pool, just for our > >>> project, then > >>> this might be something we can do fairly quickly? > >>> > >>> On 03/07/2019 05:55, Bowen Li wrote: > >>>> I responded in the INFRA ticket [1] that I believe they are > >>> using a wrong > >>>> metric against Flink and the total build time is a completely > >>> different > >>>> thing than guaranteed build capacity. > >>>> > >>>> My response: > >>>> > >>>> "As mentioned above, since I started to pay attention to Flink's > >>> build > >>>> queue a few tens of days ago, I'm in Seattle and I saw no build > >>> was kicking > >>>> off in PST daytime in weekdays for Flink. Our teammates in China > >>> and Europe > >>>> have also reported similar observations. So we need to evaluate > >>> how the > >>>> large total build time came from - if 1) your number and 2) our > >>>> observations from three locations that cover pretty much a full > >>> day, are > >>>> all true, I **guess** one reason can be that - highly likely the > >>> extra > >>>> build time came from weekends when other Apache projects may be > >>> idle and > >>>> Flink just drains hard its congested queue. > >>>> > >>>> Please be aware of that we're not complaining about the lack of > >>> resources > >>>> in general, I'm complaining about the lack of **stable, dedicated** > >>>> resources. An example for the latter one is, currently even if > >>> no build is > >>>> in Flink's queue and I submit a request to be the queue head in PST > >>>> morning, my build won't even start in 6-8+h. That is an absurd > >>> amount of > >>>> waiting time. > >>>> > >>>> That's saying, if ASF INFRA decides to adopt a quota system and > >>> grants > >>>> Flink five DEDICATED servers that runs all the time only for > >>> Flink, that'll > >>>> be PERFECT and can totally solve our problem now. > >>>> > >>>> Please be aware of that we're not complaining about the lack of > >>> resources > >>>> in general, I'm complaining about the lack of **stable, dedicated** > >>>> resources. An example for the latter one is, currently even if > >>> no build is > >>>> in Flink's queue and I submit a request to be the queue head in PST > >>>> morning, my build won't even start in 6-8+h. That is an absurd > >>> amount of > >>>> waiting time. > >>>> > >>>> > >>>> That's saying, if ASF INFRA decides to adopt a quota system and > >>> grants > >>>> Flink five DEDICATED servers that runs all the time only for > >>> Flink, that'll > >>>> be PERFECT and can totally solve our problem now. > >>>> > >>>> I feel what's missing in the ASF INFRA's Travis resource pool is > >>> some level > >>>> of build capacity SLAs and certainty" > >>>> > >>>> > >>>> Again, I believe there are differences in nature of these two > >>> problems, > >>>> long build time v.s. lack of dedicated build resource. That's > >>> saying, > >>>> shortening build time may relieve the situation, and may not. > >>> I'm sightly > >>>> negative on disabling IT cases for PRs, due to the downside is > >>> that we are > >>>> at risk of any potential bugs in PR that UTs doesn't catch, and > >>> may cost a > >>>> lot more to fix and if it slows others down or even block > >>> others, but am > >>>> open to others opinions on it. > >>>> > >>>> AFAICT from INFRA ticket[1], donating to ASF INFRA won't be > >>> feasible to > >>>> solve our problem since INFRA's pool is fully shared and they > >>> have no > >>>> control and finer insights over resource allocation to a > >>> specific Apache > >>>> project. As mentioned in [1], Apache Arrow is moving away from > >>> ASF INFRA > >>>> Travis pool (they are actually surprised Flink hasn't plan to do > >>> so). I > >>>> know that Spark is on its own build infra. If we all agree that > >>> funding our > >>>> own build infra, I'd be glad to help investigate any potential > >>> options > >>>> after releasing 1.9 since I'm super busy with 1.9 now. > >>>> > >>>> [1] https://issues.apache.org/jira/browse/INFRA-18533 > >>>> > >>>> > >>>> > >>>> On Tue, Jul 2, 2019 at 4:46 AM Chesnay Schepler > >>> <[hidden email] <mailto:[hidden email]>> wrote: > >>>> > >>>>> As a short-term stopgap, since we can assume this issue to > >>> become much > >>>>> worse in the following days/weeks, we could disable IT cases in > >>> PRs and > >>>>> only run them on master. > >>>>> > >>>>> On 02/07/2019 12:03, Chesnay Schepler wrote: > >>>>>> People really have to stop thinking that just because > >>> something works > >>>>>> for us it is also a good solution. > >>>>>> Also, please remember that our builds run for 2h from start to > >>> finish, > >>>>>> and not the 14 _minutes_ it takes for zeppelin. > >>>>>> We are dealing with an entirely different scale here, both in > >>> terms of > >>>>>> build times and number of builds. > >>>>>> > >>>>>> In this very thread people have been complaining about long queue > >>>>>> times for their builds. Surprise, other Apache projects have been > >>>>>> suffering the very same thing due to us not controlling our build > >>>>>> times. While switching services (be it Jenkins, CircleCI or > >>> whatever) > >>>>>> will possibly work for us (and these options are actually > >>> attractive, > >>>>>> like CircleCI's proper support for build artifacts), it will also > >>>>>> result in us likely negatively affecting other projects in > >>> significant > >>>>>> ways. > >>>>>> > >>>>>> Sure, the Jenkins setup has a good user experience for us, at > >>> the cost > >>>>>> of blocking Jenkins workers for a _lot_ of time. Right now we > >>> have 25 > >>>>>> PR's in our queue; that's possibly 50h we'd consume of Jenkins > >>>>>> resources, and the European contributors haven't even really > >>> started yet. > >>>>>> > >>>>>> FYI, the latest INFRA response from INFRA-18533: > >>>>>> > >>>>>> "Our rough metrics shows that Flink used over 5800 hours of > >>> build time > >>>>>> last month. That is equal to EIGHT servers running 24/7 for > >>> the ENTIRE > >>>>>> MONTH. EIGHT. nonstop. > >>>>>> When we discovered this last night, we discussed it some and > >>> are going > >>>>>> to tune down Flink to allow only five executors maximum. We > >> cannot > >>>>>> allow Flink to consume so much of a Foundation shared resource." > >>>>>> > >>>>>> So yes, we either > >>>>>> a) have to heavily reduce our CI usage or > >>>>>> b) fund our own, either maintaining it ourselves or donating > >>> to Apache. > >>>>>> > >>>>>> On 02/07/2019 05:11, Bowen Li wrote: > >>>>>>> By looking at the git history of the Jenkins script, its core > >>> part > >>>>>>> was finished in March 2017 (and only two minor update in > >>> 2017/2018), > >>>>>>> so it's been running for over two years now and feels like > >>> Zepplin > >>>>>>> community has been quite happy with it. @Jeff Zhang > >>>>>>> <mailto:[hidden email] <mailto:[hidden email]>> can you > >>> share your insights and user > >>>>>>> experience with the Jenkins+Travis approach? > >>>>>>> > >>>>>>> Things like: > >>>>>>> > >>>>>>> - has the approach completely solved the resource capacity > >>> problem > >>>>>>> for Zepplin community? is Zepplin community happy with the > >>> result? > >>>>>>> - is the whole configuration chain stable (e.g. uptime) enough? > >>>>>>> - how often do you need to maintain the Jenkins infra? how many > >>>>>>> people are usually involved in maintenance and bug-fixes? > >>>>>>> > >>>>>>> The downside of this approach seems mostly to be on the > >>> maintenance > >>>>>>> to me - maintain the script and Jenkins infra. > >>>>>>> > >>>>>>> ** Having Our Own Travis-CI.com Account ** > >>>>>>> > >>>>>>> Another alternative I've been thinking of is to have our own > >>>>>>> travis-ci.com <http://travis-ci.com> <http://travis-ci.com> > >>> account with paid dedicated > >>>>>>> resources. Note travis-ci.org <http://travis-ci.org> > >>> <http://travis-ci.org> is the free > >>>>>>> version and travis-ci.com <http://travis-ci.com> > >>> <http://travis-ci.com> is the commercial > >>>>>>> version. We currently use a shared resource pool managed by > >>> ASK INFRA > >>>>>>> team on travis-ci.org <http://travis-ci.org> > >>> <http://travis-ci.org>, but we have no control > >>>>>>> over it - we can't see how it's configured, how much > >>> resources are > >>>>>>> available, how resources are allocated among Apache projects, > >>> etc. > >>>>>>> The nice thing about having an account on travis-ci.com > >>> <http://travis-ci.com> > >>>>>>> <http://travis-ci.com> are: > >>>>>>> > >>>>>>> - relatively low cost with much better resource guarantee > >>> than what > >>>>>>> we currently have [1]: $249/month with 5 dedicated concurrency, > >>>>>>> $489/month with 10 concurrency > >>>>>>> - low maintenance work compared to using Jenkins > >>>>>>> - (potentially) no migration cost according to Travis's doc [2] > >>>>>>> (pending verification) > >>>>>>> - full control over the build capacity/configuration compared to > >>>>>>> using ASF INFRA's pool > >>>>>>> > >>>>>>> I'd be surprised if we as such a vibrant community cannot > >>> find and > >>>>>>> fund $249*12=$2988 a year in exchange for a much better > >> developer > >>>>>>> experience and much higher productivity. > >>>>>>> > >>>>>>> [1] https://travis-ci.com/plans > >>>>>>> [2] > >>>>>>> > >>>>> > >>> > >> > https://docs.travis-ci.com/user/migrate/open-source-repository-migration > >>>>>>> On Sat, Jun 29, 2019 at 8:39 AM Chesnay Schepler > >>> <[hidden email] <mailto:[hidden email]> > >>>>>>> <mailto:[hidden email] <mailto:[hidden email]>>> wrote: > >>>>>>> > >>>>>>> So yes, the Jenkins job keeps pulling the state from > >>> Travis until it > >>>>>>> finishes. > >>>>>>> > >>>>>>> Note sure I'm comfortable with the idea of using Jenkins > >>> workers > >>>>>>> just to > >>>>>>> idle for a several hours. > >>>>>>> > >>>>>>> On 29/06/2019 14:56, Jeff Zhang wrote: > >>>>>>>> Here's what zeppelin community did, we make a python > >>> script to > >>>>>>> check the > >>>>>>>> build status of pull request. > >>>>>>>> Here's script: > >>>>>>>> > >>> https://github.com/apache/zeppelin/blob/master/travis_check.py > >>>>>>>> > >>>>>>>> And this is the script we used in Jenkins build job. > >>>>>>>> > >>>>>>>> if [ -f "travis_check.py" ]; then > >>>>>>>> git log -n 1 > >>>>>>>> STATUS=$(curl -s $BUILD_URL | grep -e "GitHub pull > >>>>>>> request.*from.*" | sed > >>>>>>>> 's/.*GitHub pull request <a > >>>>>>>> href=\"\(https[^"]*\).*from[^"]*.\(https[^"]*\).*/\1 > >>> \2/g') > >>>>>>>> AUTHOR=$(echo $STATUS | sed 's/.*[/]\(.*\)$/\1/g') > >>>>>>>> PR=$(echo $STATUS | awk '{print $1}' | sed > >>>>>>> 's/.*[/]\(.*\)$/\1/g') > >>>>>>>> #COMMIT=$(git log -n 1 | grep "^Merge:" | awk > >>> '{print $3}') > >>>>>>>> #if [ -z $COMMIT ]; then > >>>>>>>> # COMMIT=$(curl -s > >>>>>>> https://api.github.com/repos/apache/zeppelin/pulls/$PR > >>>>>>>> | grep -e "\"label\":" -e "\"ref\":" -e "\"sha\":" | > >>> tr '\n' ' ' > >>>>>>> | sed > >>>>>>>> 's/\(.*sha[^,]*,\)\(.*ref.*\)/\1 = \2/g' | tr = '\n' | > >>> grep -v > >>>>>>> "apache:" | > >>>>>>>> sed 's/.*sha.[^"]*["]\([^"]*\).*/\1/g') > >>>>>>>> #fi > >>>>>>>> > >>>>>>>> # get commit hash from PR > >>>>>>>> COMMIT=$(curl -s > >>>>>>> https://api.github.com/repos/apache/zeppelin/pulls/$PR | > >>>>>>>> grep -e "\"label\":" -e "\"ref\":" -e "\"sha\":" | tr > >>> '\n' ' ' > >>>>>>> | sed > >>>>>>>> 's/\(.*sha[^,]*,\)\(.*ref.*\)/\1 = \2/g' | tr = '\n' | > >>> grep -v > >>>>>>> "apache:" | > >>>>>>>> sed 's/.*sha.[^"]*["]\([^"]*\).*/\1/g') > >>>>>>>> sleep 30 # sleep few moment to wait travis starts > >>> the build > >>>>>>>> RET_CODE=0 > >>>>>>>> python ./travis_check.py ${AUTHOR} ${COMMIT} || > >>> RET_CODE=$? > >>>>>>>> if [ $RET_CODE -eq 2 ]; then # try with repository > >>> name when > >>>>>>> travis-ci is > >>>>>>>> not available in the account > >>>>>>>> RET_CODE=0 > >>>>>>>> AUTHOR=$(curl -s > >>>>>>> https://api.github.com/repos/apache/zeppelin/pulls/$PR > >>>>>>>> | grep '"full_name":' | grep -v "apache/zeppelin" | sed > >>>>>>>> 's/.*[:][^"]*["]\([^/]*\).*/\1/g') > >>>>>>>> python ./travis_check.py ${AUTHOR} ${COMMIT} || > >>> RET_CODE=$? > >>>>>>>> fi > >>>>>>>> > >>>>>>>> if [ $RET_CODE -eq 2 ]; then # fail with can't find > >>> build > >>>>>>> information in > >>>>>>>> the travis > >>>>>>>> set +x > >>>>>>>> echo > >>> "-----------------------------------------------------" > >>>>>>>> echo "Looks like travis-ci is not configured for > >>> your fork." > >>>>>>>> echo "Please setup by swich on 'zeppelin' > >>> repository at > >>>>>>>> https://travis-ci.org/profile and travis-ci." > >>>>>>>> echo "And then make sure 'Build branch updates' > >>> option is > >>>>>>> enabled in > >>>>>>>> the settings > >>> https://travis-ci.org/${AUTHOR}/zeppelin/settings > >>> <https://travis-ci.org/$%7BAUTHOR%7D/zeppelin/settings> > >>>>>>> <https://travis-ci.org/$%7BAUTHOR%7D/zeppelin/settings>." > >>>>>>>> echo "" > >>>>>>>> echo "To trigger CI after setup, you will need > >>> ammend your > >>>>>>> last commit > >>>>>>>> with" > >>>>>>>> echo "git commit --amend" > >>>>>>>> echo "git push your-remote HEAD --force" > >>>>>>>> echo "" > >>>>>>>> echo "See > >>>>>>>> > >>>>>>> > >>>>> > >>> > >> > http://zeppelin.apache.org/contribution/contributions.html#continuous-integration > >>>>>>>> ." > >>>>>>>> fi > >>>>>>>> > >>>>>>>> exit $RET_CODE > >>>>>>>> else > >>>>>>>> set +x > >>>>>>>> echo "travis_check.py does not exists" > >>>>>>>> exit 1 > >>>>>>>> fi > >>>>>>>> > >>>>>>>> Chesnay Schepler <[hidden email] > >>> <mailto:[hidden email]> > >>>>>>> <mailto:[hidden email] <mailto:[hidden email]>>> > >>> 于2019年6月29日周六 下午3:17写道: > >>>>>>>> > >>>>>>>>> Does this imply that a Jenkins job is active as long > >>> as the > >>>>>>> Travis build > >>>>>>>>> runs? > >>>>>>>>> > >>>>>>>>> On 26/06/2019 21:28, Bowen Li wrote: > >>>>>>>>>> Hi, > >>>>>>>>>> > >>>>>>>>>> @Dawid, I think the "long test running" as I > >>> mentioned in the > >>>>>>> first > >>>>>>>>> email, > >>>>>>>>>> also as you guys said, belongs to "a big effort > >>> which is much > >>>>>>> harder to > >>>>>>>>>> accomplish in a short period of time and may deserve > >>> its own > >>>>>>> separate > >>>>>>>>>> discussion". Thus I didn't include it in what we can > >>> do in a > >>>>>>> foreseeable > >>>>>>>>>> short term. > >>>>>>>>>> > >>>>>>>>>> Besides, I don't think that's the ultimate reason > >>> for lack of > >>>>>>> build > >>>>>>>>>> resources. Even if the build is shortened to > >>> something like > >>>>>>> 2h, the > >>>>>>>>>> problems of no build machine works about 6 or more > >>> hours in > >>>>>>> PST daytime > >>>>>>>>>> that I described will still happen, because no > >>> machine from > >>>>>>> ASF INFRA's > >>>>>>>>>> pool is allocated to Flink. As I have paid close > >>> attention to > >>>>>>> the build > >>>>>>>>>> queue in the past few weekdays, it's a pretty clear > >>> pattern now. > >>>>>>>>>> > >>>>>>>>>> **The ultimate root cause** for that is - we don't > >>> have any > >>>>>>> **dedicated** > >>>>>>>>>> build resources that we can stably rely on. I'm > >>> actually ok to > >>>>>>> wait for a > >>>>>>>>>> long time if there are build requests running, it > >>> means at > >>>>>>> least we are > >>>>>>>>>> making progress. But I'm not ok with no build > >>> resource. A > >>>>>>> better place I > >>>>>>>>>> think we should aim at in short term is to always > >>> have at > >>>>>>> least a central > >>>>>>>>>> pool (can be 3 or 5) of machines dedicated to build > >>> Flink at > >>>>>>> any time, or > >>>>>>>>>> maybe use users resources. > >>>>>>>>>> > >>>>>>>>>> @Chesnay @Robert I synced with Jeff offline that > >>> Zeppelin > >>>>>>> community is > >>>>>>>>>> using a Jenkins job to automatically build on users' > >>> travis > >>>>>>> account and > >>>>>>>>>> link the result back to github PR. I guess the > >>> Jenkins job > >>>>>>> would fetch > >>>>>>>>>> latest upstream master and build the PR against it. > >>> Jeff has > >>>>>>> filed > >>>>>>>>> tickets > >>>>>>>>>> to learn and get access to the Jenkins infra. It'll > >>> better to > >>>>>>> fully > >>>>>>>>>> understand it first before judging this approach. > >>>>>>>>>> > >>>>>>>>>> I also heard good things about CircleCI, and ASF > >>> INFRA seems > >>>>>>> to have a > >>>>>>>>> pool > >>>>>>>>>> of build capacity there too. Can be an alternative > >>> to consider. > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> On Wed, Jun 26, 2019 at 12:44 AM Dawid Wysakowicz < > >>>>>>>>> [hidden email] > >>> <mailto:[hidden email]> <mailto:[hidden email] > >>> <mailto:[hidden email]>>> > >>>>>>>>>> wrote: > >>>>>>>>>> > >>>>>>>>>>> Sorry to jump in late, but I think Bowen missed the > >>> most > >>>>>>> important point > >>>>>>>>>>> from Chesnay's previous message in the summary. The > >>> ultimate > >>>>>>> reason for > >>>>>>>>>>> all the problems is that the tests take close to 2 > >>> hours to > >>>>>>> run already. > >>>>>>>>>>> I fully support this claim: "Unless people start > >>> caring about > >>>>>>> test times > >>>>>>>>>>> before adding them, this issue cannot be solved" > >>>>>>>>>>> > >>>>>>>>>>> This is also another reason why using user's Travis > >>> account > >>>>>>> won't help. > >>>>>>>>>>> Every few weeks we reach the user's time limit for > >>> a single > >>>>>>> profile. > >>>>>>>>>>> This makes the user's builds simply fail, until we > >>> either > >>>>>>> properly > >>>>>>>>>>> decrease the time the tests take (which I am not > >>> sure we ever > >>>>>>> did) or > >>>>>>>>>>> postpone the problem by splitting into more > >>> profiles. (Note > >>>>>>> that the ASF > >>>>>>>>>>> Travis account has higher time limits) > >>>>>>>>>>> > >>>>>>>>>>> Best, > >>>>>>>>>>> > >>>>>>>>>>> Dawid > >>>>>>>>>>> > >>>>>>>>>>> On 26/06/2019 09:36, Robert Metzger wrote: > >>>>>>>>>>>> Do we know if using "the best" available hardware > >>> would > >>>>>>> improve the > >>>>>>>>> build > >>>>>>>>>>>> times? > >>>>>>>>>>>> Imagine we would run the build on machines with > >>> plenty of > >>>>>>> main memory > >>>>>>>>> to > >>>>>>>>>>>> mount everything to ramdisk + the latest CPU > >>> architecture? > >>>>>>>>>>>> > >>>>>>>>>>>> Throwing hardware at the problem could help reduce > >>> the time > >>>>>>> of an > >>>>>>>>>>>> individual build, and using our own infrastructure > >>> would > >>>>>>> remove our > >>>>>>>>>>>> dependency on Apache's Travis account (with the > >>> obvious > >>>>>>> downside of > >>>>>>>>>>> having > >>>>>>>>>>>> to maintain the infrastructure) > >>>>>>>>>>>> We could use an open source travis alternative, to > >>> have a > >>>>>>> similar > >>>>>>>>>>>> experience and make the migration easy. > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> On Wed, Jun 26, 2019 at 9:34 AM Chesnay Schepler > >>>>>>> <[hidden email] <mailto:[hidden email]> > >>> <mailto:[hidden email] <mailto:[hidden email]>>> > >>>>>>>>>>> wrote: > >>>>>>>>>>>>>> From what I gathered, there's no special > >>> sauce that the > >>>>>>> Zeppelin > >>>>>>>>>>>>> project uses which actually integrates a users > >> Travis > >>>>>>> account into the > >>>>>>>>>>> PR. > >>>>>>>>>>>>> They just disabled Travis for PRs. And that's > >>> kind of it. > >>>>>>>>>>>>> > >>>>>>>>>>>>> Naturally we can do this (duh) and safe the ASF a > >>> fair > >>>>>>> amount of > >>>>>>>>>>>>> resources, but there are downsides: > >>>>>>>>>>>>> > >>>>>>>>>>>>> The discoverability of the Travis check takes a > >>> nose-dive. > >>>>>>> Either we > >>>>>>>>>>>>> require every contributor to always, an every > >>> commit, also > >>>>>>> post a > >>>>>>>>> Travis > >>>>>>>>>>>>> build, or we have the reviewer sift through the > >>>>>>> contributors account > >>>>>>>>> to > >>>>>>>>>>>>> find it. > >>>>>>>>>>>>> > >>>>>>>>>>>>> This is rather cumbersome. Additionally, it's > >>> also not > >>>>>>> equivalent to > >>>>>>>>>>>>> having a PR build. > >>>>>>>>>>>>> > >>>>>>>>>>>>> A normal branch build takes a branch as is and > >>> tests it. A > >>>>>>> PR build > >>>>>>>>>>>>> merges the branch into master, and then runs it. > >>> (Fun fact: > >>>>>>> This is > >>>>>>>>> why > >>>>>>>>>>>>> a PR without merge conflicts is not being run on > >>> Travis.) > >>>>>>>>>>>>> > >>>>>>>>>>>>> And ultimately, everyone can already make use of > >> this > >>>>>>> approach anyway. > >>>>>>>>>>>>> > >>>>>>>>>>>>> On 25/06/2019 08:02, Jark Wu wrote: > >>>>>>>>>>>>>> Hi Jeff, > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Thanks for sharing the Zeppelin approach. I > >>> think it's a > >>>>>>> good idea to > >>>>>>>>>>>>>> leverage user's travis account. > >>>>>>>>>>>>>> In this way, we can have almost unlimited > >>> concurrent build > >>>>>>> jobs and > >>>>>>>>>>>>>> developers can restart build by themselves > >>> (currently only > >>>>>>> committers > >>>>>>>>>>>>>> can restart PR's build). > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> But I'm still not very clear how to integrate > >> user's > >>>>>>> travis build > >>>>>>>>> into > >>>>>>>>>>>>>> the Flink pull request's build automatically. > >>> Can you > >>>>>>> explain more in > >>>>>>>>>>>>>> detail? > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Another question: does travis only build > >>> branches for user > >>>>>>> account? > >>>>>>>>>>>>>> My concern is that builds for PRs will rebase > >> user's > >>>>>>> commits against > >>>>>>>>>>>>>> current master branch. > >>>>>>>>>>>>>> This will help us to find problems before > >>> merge. Builds > >>>>>>> for branches > >>>>>>>>>>>>>> will lose the impact of new commits in master. > >>>>>>>>>>>>>> How does Zeppelin solve this problem? > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Thanks again for sharing the idea. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Regards, > >>>>>>>>>>>>>> Jark > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> On Tue, 25 Jun 2019 at 11:01, Jeff Zhang > >>> <[hidden email] <mailto:[hidden email]> > >>>>>>> <mailto:[hidden email] <mailto:[hidden email]>> > >>>>>>>>>>>>>> <mailto:[hidden email] > >>> <mailto:[hidden email]> <mailto:[hidden email] > >>> <mailto:[hidden email]>>>> wrote: > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Hi Folks, > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Zeppelin meet this kind of issue before, we solve > >>>>>>> it by > >>>>>>>>> delegating > >>>>>>>>>>>>>> each > >>>>>>>>>>>>>> one's PR build to his travis account > >>> (Everyone can > >>>>>>> have 5 free > >>>>>>>>>>>>>> slot for > >>>>>>>>>>>>>> travis build). > >>>>>>>>>>>>>> Apache account travis build is only triggered when > >>>>>>> PR is merged. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Kurt Young <[hidden email] > >>> <mailto:[hidden email]> > >>>>>>> <mailto:[hidden email] <mailto:[hidden email]>> > >>> <mailto:[hidden email] <mailto:[hidden email]> > >>>>>>> <mailto:[hidden email] <mailto:[hidden email]>>>> > >>>>>>>>>>>>>> 于2019年6月25日周二 上午10:16写道: > >>>>>>>>>>>>>> > >>>>>>>>>>>>>>> (Forgot to cc George) > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Best, > >>>>>>>>>>>>>>> Kurt > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> On Tue, Jun 25, 2019 at 10:16 AM Kurt Young > >>>>>>> <[hidden email] <mailto:[hidden email]> > >>> <mailto:[hidden email] <mailto:[hidden email]>> > >>>>>>>>>>>>>> <mailto:[hidden email] > >>> <mailto:[hidden email]> <mailto:[hidden email] > >>> <mailto:[hidden email]>>>> > >>>>>>> wrote: > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Hi Bowen, > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Thanks for bringing this up. We > >>> actually have > >>>>>>> discussed > >>>>>>>>> about > >>>>>>>>>>>>>> this, and I > >>>>>>>>>>>>>>>> think Till and George have > >>>>>>>>>>>>>>>> already spend sometime investigating > >>> it. I have > >>>>>>> cced both of > >>>>>>>>>>>>>> them, and > >>>>>>>>>>>>>>>> maybe they can share > >>>>>>>>>>>>>>>> their findings. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Best, > >>>>>>>>>>>>>>>> Kurt > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> On Tue, Jun 25, 2019 at 10:08 AM Jark Wu > >>>>>>> <[hidden email] <mailto:[hidden email]> > >>> <mailto:[hidden email] <mailto:[hidden email]>> > >>>>>>>>>>>>>> <mailto:[hidden email] > >>> <mailto:[hidden email]> <mailto:[hidden email] > >>> <mailto:[hidden email]>>>> > >>>>>>> wrote: > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> Hi Bowen, > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> Thanks for bringing this. We also > >>> suffered from > >>>>>>> the long > >>>>>>>>>>>>>> build time. > >>>>>>>>>>>>>>>>> I agree that we should focus on > >>> solving build > >>>>>>> capacity > >>>>>>>>>>>>>> problem in the > >>>>>>>>>>>>>>>>> thread. > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> My observation is there is only one > >>> build is > >>>>>>> running, all > >>>>>>>>> the > >>>>>>>>>>>>>> others > >>>>>>>>>>>>>>>>> (other > >>>>>>>>>>>>>>>>> PRs, master) are pending. > >>>>>>>>>>>>>>>>> The pricing plan[1] of travis shows > >>> it can > >>>>>>> support > >>>>>>>>> concurrent > >>>>>>>>>>>>>> build > >>>>>>>>>>>>>>> jobs. > >>>>>>>>>>>>>>>>> But I don't know which plan we are > >>> using, might > >>>>>>> be the free > >>>>>>>>>>>>>> plan for > >>>>>>>>>>>>>>> open > >>>>>>>>>>>>>>>>> source. > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> I cc-ed Chesnay who may have some > >>> experience on > >>>>>>> Travis. > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> Regards, > >>>>>>>>>>>>>>>>> Jark > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> [1]: https://travis-ci.com/plans > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> On Tue, 25 Jun 2019 at 08:11, Bowen Li < > >>>>>>>>> [hidden email] <mailto:[hidden email]> > >>> <mailto:[hidden email] <mailto:[hidden email]>> > >>>>>>>>>>>>>> <mailto:[hidden email] > >>> <mailto:[hidden email]> > >>>>>>> <mailto:[hidden email] > >>> <mailto:[hidden email]>>>> wrote: > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> Hi Steven, > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> I think you may not read what I > >>> wrote. The > >>>>>>> discussion is > >>>>>>>>>>> about > >>>>>>>>>>>>>>> "unstable > >>>>>>>>>>>>>>>>>> build **capacity**", in another word > >>>>>>> "unstable / lack of > >>>>>>>>>>> build > >>>>>>>>>>>>>>>>> resources", > >>>>>>>>>>>>>>>>>> not "unstable build". > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> On Mon, Jun 24, 2019 at 4:40 PM > >>> Steven Wu > >>>>>>>>>>>>>> <[hidden email] > >>> <mailto:[hidden email]> <mailto:[hidden email] > >>> <mailto:[hidden email]>> > >>>>>>> <mailto:[hidden email] > >>> <mailto:[hidden email]> <mailto:[hidden email] > >>> <mailto:[hidden email]>>>> > >>>>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> long and sometimes unstable build is > >>>>>>> definitely a pain > >>>>>>>>>>>>> point. > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> I suspect the build failure here in > >>>>>>>>> flink-connector-kafka > >>>>>>>>>>>>>> is not > >>>>>>>>>>>>>>>>> related > >>>>>>>>>>>>>>>>>> to > >>>>>>>>>>>>>>>>>>> my change. but there is no easy > >>> re-run the > >>>>>>> build on > >>>>>>>>>>>>>> travis UI. > >>>>>>>>>>>>>>>>>>> search showed a trick of > >>> close-and-open the > >>>>>>> PR will > >>>>>>>>>>>>>> trigger rebuild. > >>>>>>>>>>>>>>>>> but > >>>>>>>>>>>>>>>>>>> that could add noises to the PR > >>> activities. > >>>>>>>>>>>>>>>>>>> > >>>>>>> https://travis-ci.org/apache/flink/jobs/545555519 > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> travis-ci for my personal repo > >>> often failed > >>>>>>> with > >>>>>>>>>>>>>> exceeding time > >>>>>>>>>>>>>>> limit > >>>>>>>>>>>>>>>>>> after > >>>>>>>>>>>>>>>>>>> 4+ hours. > >>>>>>>>>>>>>>>>>>> The job exceeded the maximum time > >>> limit for > >>>>>>> jobs, and > >>>>>>>>> has > >>>>>>>>>>>>>> been > >>>>>>>>>>>>>>>>>> terminated. > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> On Mon, Jun 24, 2019 at 4:15 PM > >>> Bowen Li > >>>>>>>>>>>>>> <[hidden email] > >>> <mailto:[hidden email]> <mailto:[hidden email] > >>> <mailto:[hidden email]>> > >>>>>>> <mailto:[hidden email] <mailto:[hidden email]> > >>> <mailto:[hidden email] <mailto:[hidden email]>>>> > >>>>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> > >>>>>>> https://travis-ci.org/apache/flink/builds/549681530 > >>>>>>>>>>>>>> This build > >>>>>>>>>>>>>>>>>> request > >>>>>>>>>>>>>>>>>>>> has > >>>>>>>>>>>>>>>>>>>> been sitting at **HEAD of the > >>> queue** > >>>>>>> since I first > >>>>>>>>> saw > >>>>>>>>>>>>>> it at PST > >>>>>>>>>>>>>>>>>> 10:30am > >>>>>>>>>>>>>>>>>>>> (not sure how long it's been > >>> there before > >>>>>>> 10:30am). > >>>>>>>>>>>>>> It's PST > >>>>>>>>>>>>>>> 4:12pm > >>>>>>>>>>>>>>>>> now > >>>>>>>>>>>>>>>>>>> and > >>>>>>>>>>>>>>>>>>>> it hasn't started yet. > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> On Mon, Jun 24, 2019 at 2:48 PM > >>> Bowen Li > >>>>>>>>>>>>>> <[hidden email] > >>> <mailto:[hidden email]> <mailto:[hidden email] > >>> <mailto:[hidden email]>> > >>>>>>> <mailto:[hidden email] <mailto:[hidden email]> > >>> <mailto:[hidden email] <mailto:[hidden email]>>>> > >>>>>>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> Hi devs, > >>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> I've been experiencing the pain > >>>>>>> resulting from lack > >>>>>>>>>>>>>> of stable > >>>>>>>>>>>>>>>>> build > >>>>>>>>>>>>>>>>>>>>> capacity on Travis for Flink > >>> PRs [1]. > >>>>>>>>> Specifically, I > >>>>>>>>>>>>>> noticed > >>>>>>>>>>>>>>>>> often > >>>>>>>>>>>>>>>>>>> that > >>>>>>>>>>>>>>>>>>>> no > >>>>>>>>>>>>>>>>>>>>> build in the queue is making any > >>>>>>> progress for > >>>>>>>>> hours, > >>>>>>>>>>> and > >>>>>>>>>>>>>>> suddenly > >>>>>>>>>>>>>>>>> 5 > >>>>>>>>>>>>>>>>>> or > >>>>>>>>>>>>>>>>>>> 6 > >>>>>>>>>>>>>>>>>>>>> builds kick off all together > >>> after the > >>>>>>> long pause. > >>>>>>>>>>>>>> I'm at PST > >>>>>>>>>>>>>>>>>> (UTC-08) > >>>>>>>>>>>>>>>>>>>> time > >>>>>>>>>>>>>>>>>>>>> zone, and I've seen pause can > >>> be as > >>>>>>> long as 6 hours > >>>>>>>>>>>>>> from PST 9am > >>>>>>>>>>>>>>>>> to > >>>>>>>>>>>>>>>>>> 3pm > >>>>>>>>>>>>>>>>>>>>> (let alone the time needed to > >>> drain the > >>>>>>> queue > >>>>>>>>>>>>>> afterwards). > >>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> I think this has greatly > >>> impacted our > >>>>>>> productivity. > >>>>>>>>>>> I've > >>>>>>>>>>>>>>>>> experienced > >>>>>>>>>>>>>>>>>>> that > >>>>>>>>>>>>>>>>>>>>> PRs submitted in the early > >>> morning of > >>>>>>> PST time zone > >>>>>>>>>>>>>> won't finish > >>>>>>>>>>>>>>>>>> their > >>>>>>>>>>>>>>>>>>>>> build until late night of the > >>> same day. > >>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> So my questions are: > >>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> - Has anyone else experienced > >>> the same > >>>>>>> problem or > >>>>>>>>>>>>>> have similar > >>>>>>>>>>>>>>>>>>>> observation > >>>>>>>>>>>>>>>>>>>>> on TravisCI? (I suspect it > >>> has things > >>>>>>> to do with > >>>>>>>>> time > >>>>>>>>>>>>>> zone) > >>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> - What pricing plan of > >>> TravisCI is > >>>>>>> Flink currently > >>>>>>>>>>>>>> using? Is it > >>>>>>>>>>>>>>>>> the > >>>>>>>>>>>>>>>>>>> free > >>>>>>>>>>>>>>>>>>>>> plan for open source > >>> projects? What > >>>>>>> are the > >>>>>>>>>>>>>> guaranteed build > >>>>>>>>>>>>>>>>> capacity > >>>>>>>>>>>>>>>>>>> of > >>>>>>>>>>>>>>>>>>>>> the current plan? > >>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> - If the current pricing plan > >>> (either > >>>>>>> free or paid) > >>>>>>>>>>>>> can't > >>>>>>>>>>>>>>> provide > >>>>>>>>>>>>>>>>>>> stable > >>>>>>>>>>>>>>>>>>>>> build capacity, can we > >>> upgrade to a > >>>>>>> higher priced > >>>>>>>>>>>>>> plan with > >>>>>>>>>>>>>>> larger > >>>>>>>>>>>>>>>>>> and > >>>>>>>>>>>>>>>>>>>> more > >>>>>>>>>>>>>>>>>>>>> stable build capacity? > >>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> BTW, another factor that > >>> contribute to > >>>>>>> the > >>>>>>>>>>>>>> productivity problem > >>>>>>>>>>>>>>> is > >>>>>>>>>>>>>>>>>> that > >>>>>>>>>>>>>>>>>>>>> our build is slow - we run > >>> full build > >>>>>>> for every PR > >>>>>>>>>>> and a > >>>>>>>>>>>>>>>>> successful > >>>>>>>>>>>>>>>>>>> full > >>>>>>>>>>>>>>>>>>>>> build takes ~5h. We > >>> definitely have > >>>>>>> more options to > >>>>>>>>>>>>>> solve it, > >>>>>>>>>>>>>>> for > >>>>>>>>>>>>>>>>>>>> instance, > >>>>>>>>>>>>>>>>>>>>> modularize the build graphs > >>> and reuse > >>>>>>> artifacts > >>>>>>>>> from > >>>>>>>>>>> the > >>>>>>>>>>>>>>> previous > >>>>>>>>>>>>>>>>>>> build. > >>>>>>>>>>>>>>>>>>>>> But I think that can be a big > >>> effort > >>>>>>> which is much > >>>>>>>>>>>>>> harder to > >>>>>>>>>>>>>>>>>> accomplish > >>>>>>>>>>>>>>>>>>>> in > >>>>>>>>>>>>>>>>>>>>> a short period of time and > >>> may deserve > >>>>>>> its own > >>>>>>>>>>> separate > >>>>>>>>>>>>>>>>> discussion. > >>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> [1] > >>>>>>>>> https://travis-ci.org/apache/flink/pull_requests > >>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> -- > >>>>>>>>>>>>>> Best Regards > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Jeff Zhang > >>>>>>>>>>>>>> > >>>>>>>>> > >>>>>>> > >>>>>> > >>>>> > >>> > >> > >> > > |
In reply to this post by Chesnay Schepler-3
Small update with mostly bad news:
INFRA doesn't know whether it is possible, and referred my to Travis support. They did point out that it could be problematic in regards to read/write permissions for the repository. From my own findings /so far/ with a test repo/organization, it does not appear possible to configure the Travis account used for a specific repository. So yeah, if we go down this route we may have to pimp the Flinkbot to trigger builds through the Travis REST API. On 04/07/2019 10:46, Chesnay Schepler wrote: > I've raised a JIRA > <https://issues.apache.org/jira/browse/INFRA-18703>with INFRA to > inquire whether it would be possible to switch to a different Travis > account, and if so what steps would need to be taken. > We need a proper confirmation from INFRA since we are not in full > control of the flink repository (for example, we cannot access the > settings page). > > If this is indeed possible, Ververica is willing sponsor a Travis > account for the Flink project. > This would provide us with more than enough resources than we need. > > Since this makes the project more reliant on resources provided by > external companies I would like to vote on this. > > Please vote on this proposal, as follows: > [ ] +1, Approve the migration to a Ververica-sponsored Travis account, > provided that INFRA approves > [ ] -1, Do not approach the migration to a Ververica-sponsored Travis > account > > The vote will be open for at least 24h, and until we have confirmation > from INFRA. The voting period may be shorter than the usual 3 days > since our current is effectively not working. > > On 04/07/2019 06:51, Bowen Li wrote: >> Re: > Are they using their own Travis CI pool, or did the switch to >> an entirely different CI service? >> >> I reached out to Wes and Krisztián from Apache Arrow PMC. They are >> currently moving away from ASF's Travis to their own in-house metal >> machines at [1] with custom CI application at [2]. They've seen >> significant improvement w.r.t both much higher performance and >> basically no resource waiting time, "night-and-day" difference >> quoting Wes. >> >> Re: > If we can just switch to our own Travis pool, just for our >> project, then this might be something we can do fairly quickly? >> >> I believe so, according to [3] and [4] >> >> >> [1] https://ci.ursalabs.org/ <https://ci.ursalabs.org/#/> >> [2] https://github.com/ursa-labs/ursabot >> [3] >> https://docs.travis-ci.com/user/migrate/open-source-repository-migration >> [4] https://docs.travis-ci.com/user/migrate/open-source-on-travis-ci-com >> >> >> >> On Wed, Jul 3, 2019 at 12:01 AM Chesnay Schepler <[hidden email] >> <mailto:[hidden email]>> wrote: >> >> Are they using their own Travis CI pool, or did the switch to an >> entirely different CI service? >> >> If we can just switch to our own Travis pool, just for our >> project, then >> this might be something we can do fairly quickly? >> >> On 03/07/2019 05:55, Bowen Li wrote: >> > I responded in the INFRA ticket [1] that I believe they are >> using a wrong >> > metric against Flink and the total build time is a completely >> different >> > thing than guaranteed build capacity. >> > >> > My response: >> > >> > "As mentioned above, since I started to pay attention to Flink's >> build >> > queue a few tens of days ago, I'm in Seattle and I saw no build >> was kicking >> > off in PST daytime in weekdays for Flink. Our teammates in China >> and Europe >> > have also reported similar observations. So we need to evaluate >> how the >> > large total build time came from - if 1) your number and 2) our >> > observations from three locations that cover pretty much a full >> day, are >> > all true, I **guess** one reason can be that - highly likely the >> extra >> > build time came from weekends when other Apache projects may be >> idle and >> > Flink just drains hard its congested queue. >> > >> > Please be aware of that we're not complaining about the lack of >> resources >> > in general, I'm complaining about the lack of **stable, >> dedicated** >> > resources. An example for the latter one is, currently even if >> no build is >> > in Flink's queue and I submit a request to be the queue head in >> PST >> > morning, my build won't even start in 6-8+h. That is an absurd >> amount of >> > waiting time. >> > >> > That's saying, if ASF INFRA decides to adopt a quota system and >> grants >> > Flink five DEDICATED servers that runs all the time only for >> Flink, that'll >> > be PERFECT and can totally solve our problem now. >> > >> > Please be aware of that we're not complaining about the lack of >> resources >> > in general, I'm complaining about the lack of **stable, >> dedicated** >> > resources. An example for the latter one is, currently even if >> no build is >> > in Flink's queue and I submit a request to be the queue head in >> PST >> > morning, my build won't even start in 6-8+h. That is an absurd >> amount of >> > waiting time. >> > >> > >> > That's saying, if ASF INFRA decides to adopt a quota system and >> grants >> > Flink five DEDICATED servers that runs all the time only for >> Flink, that'll >> > be PERFECT and can totally solve our problem now. >> > >> > I feel what's missing in the ASF INFRA's Travis resource pool is >> some level >> > of build capacity SLAs and certainty" >> > >> > >> > Again, I believe there are differences in nature of these two >> problems, >> > long build time v.s. lack of dedicated build resource. That's >> saying, >> > shortening build time may relieve the situation, and may not. >> I'm sightly >> > negative on disabling IT cases for PRs, due to the downside is >> that we are >> > at risk of any potential bugs in PR that UTs doesn't catch, and >> may cost a >> > lot more to fix and if it slows others down or even block >> others, but am >> > open to others opinions on it. >> > >> > AFAICT from INFRA ticket[1], donating to ASF INFRA won't be >> feasible to >> > solve our problem since INFRA's pool is fully shared and they >> have no >> > control and finer insights over resource allocation to a >> specific Apache >> > project. As mentioned in [1], Apache Arrow is moving away from >> ASF INFRA >> > Travis pool (they are actually surprised Flink hasn't plan to do >> so). I >> > know that Spark is on its own build infra. If we all agree that >> funding our >> > own build infra, I'd be glad to help investigate any potential >> options >> > after releasing 1.9 since I'm super busy with 1.9 now. >> > >> > [1] https://issues.apache.org/jira/browse/INFRA-18533 >> > >> > >> > >> > On Tue, Jul 2, 2019 at 4:46 AM Chesnay Schepler >> <[hidden email] <mailto:[hidden email]>> wrote: >> > >> >> As a short-term stopgap, since we can assume this issue to >> become much >> >> worse in the following days/weeks, we could disable IT cases in >> PRs and >> >> only run them on master. >> >> >> >> On 02/07/2019 12:03, Chesnay Schepler wrote: >> >>> People really have to stop thinking that just because >> something works >> >>> for us it is also a good solution. >> >>> Also, please remember that our builds run for 2h from start to >> finish, >> >>> and not the 14 _minutes_ it takes for zeppelin. >> >>> We are dealing with an entirely different scale here, both in >> terms of >> >>> build times and number of builds. >> >>> >> >>> In this very thread people have been complaining about long >> queue >> >>> times for their builds. Surprise, other Apache projects have >> been >> >>> suffering the very same thing due to us not controlling our >> build >> >>> times. While switching services (be it Jenkins, CircleCI or >> whatever) >> >>> will possibly work for us (and these options are actually >> attractive, >> >>> like CircleCI's proper support for build artifacts), it will >> also >> >>> result in us likely negatively affecting other projects in >> significant >> >>> ways. >> >>> >> >>> Sure, the Jenkins setup has a good user experience for us, at >> the cost >> >>> of blocking Jenkins workers for a _lot_ of time. Right now we >> have 25 >> >>> PR's in our queue; that's possibly 50h we'd consume of Jenkins >> >>> resources, and the European contributors haven't even really >> started yet. >> >>> >> >>> FYI, the latest INFRA response from INFRA-18533: >> >>> >> >>> "Our rough metrics shows that Flink used over 5800 hours of >> build time >> >>> last month. That is equal to EIGHT servers running 24/7 for >> the ENTIRE >> >>> MONTH. EIGHT. nonstop. >> >>> When we discovered this last night, we discussed it some and >> are going >> >>> to tune down Flink to allow only five executors maximum. We >> cannot >> >>> allow Flink to consume so much of a Foundation shared resource." >> >>> >> >>> So yes, we either >> >>> a) have to heavily reduce our CI usage or >> >>> b) fund our own, either maintaining it ourselves or donating >> to Apache. >> >>> >> >>> On 02/07/2019 05:11, Bowen Li wrote: >> >>>> By looking at the git history of the Jenkins script, its core >> part >> >>>> was finished in March 2017 (and only two minor update in >> 2017/2018), >> >>>> so it's been running for over two years now and feels like >> Zepplin >> >>>> community has been quite happy with it. @Jeff Zhang >> >>>> <mailto:[hidden email] <mailto:[hidden email]>> can you >> share your insights and user >> >>>> experience with the Jenkins+Travis approach? >> >>>> >> >>>> Things like: >> >>>> >> >>>> - has the approach completely solved the resource capacity >> problem >> >>>> for Zepplin community? is Zepplin community happy with the >> result? >> >>>> - is the whole configuration chain stable (e.g. uptime) enough? >> >>>> - how often do you need to maintain the Jenkins infra? how many >> >>>> people are usually involved in maintenance and bug-fixes? >> >>>> >> >>>> The downside of this approach seems mostly to be on the >> maintenance >> >>>> to me - maintain the script and Jenkins infra. >> >>>> >> >>>> ** Having Our Own Travis-CI.com Account ** >> >>>> >> >>>> Another alternative I've been thinking of is to have our own >> >>>> travis-ci.com <http://travis-ci.com> <http://travis-ci.com> >> account with paid dedicated >> >>>> resources. Note travis-ci.org <http://travis-ci.org> >> <http://travis-ci.org> is the free >> >>>> version and travis-ci.com <http://travis-ci.com> >> <http://travis-ci.com> is the commercial >> >>>> version. We currently use a shared resource pool managed by >> ASK INFRA >> >>>> team on travis-ci.org <http://travis-ci.org> >> <http://travis-ci.org>, but we have no control >> >>>> over it - we can't see how it's configured, how much >> resources are >> >>>> available, how resources are allocated among Apache projects, >> etc. >> >>>> The nice thing about having an account on travis-ci.com >> <http://travis-ci.com> >> >>>> <http://travis-ci.com> are: >> >>>> >> >>>> - relatively low cost with much better resource guarantee >> than what >> >>>> we currently have [1]: $249/month with 5 dedicated concurrency, >> >>>> $489/month with 10 concurrency >> >>>> - low maintenance work compared to using Jenkins >> >>>> - (potentially) no migration cost according to Travis's doc [2] >> >>>> (pending verification) >> >>>> - full control over the build capacity/configuration >> compared to >> >>>> using ASF INFRA's pool >> >>>> >> >>>> I'd be surprised if we as such a vibrant community cannot >> find and >> >>>> fund $249*12=$2988 a year in exchange for a much better >> developer >> >>>> experience and much higher productivity. >> >>>> >> >>>> [1] https://travis-ci.com/plans >> >>>> [2] >> >>>> >> >> >> https://docs.travis-ci.com/user/migrate/open-source-repository-migration >> >>>> On Sat, Jun 29, 2019 at 8:39 AM Chesnay Schepler >> <[hidden email] <mailto:[hidden email]> >> >>>> <mailto:[hidden email] <mailto:[hidden email]>>> wrote: >> >>>> >> >>>> So yes, the Jenkins job keeps pulling the state from >> Travis until it >> >>>> finishes. >> >>>> >> >>>> Note sure I'm comfortable with the idea of using Jenkins >> workers >> >>>> just to >> >>>> idle for a several hours. >> >>>> >> >>>> On 29/06/2019 14:56, Jeff Zhang wrote: >> >>>> > Here's what zeppelin community did, we make a python >> script to >> >>>> check the >> >>>> > build status of pull request. >> >>>> > Here's script: >> >>>> > >> https://github.com/apache/zeppelin/blob/master/travis_check.py >> >>>> > >> >>>> > And this is the script we used in Jenkins build job. >> >>>> > >> >>>> > if [ -f "travis_check.py" ]; then >> >>>> > git log -n 1 >> >>>> > STATUS=$(curl -s $BUILD_URL | grep -e "GitHub pull >> >>>> request.*from.*" | sed >> >>>> > 's/.*GitHub pull request <a >> >>>> > href=\"\(https[^"]*\).*from[^"]*.\(https[^"]*\).*/\1 >> \2/g') >> >>>> > AUTHOR=$(echo $STATUS | sed 's/.*[/]\(.*\)$/\1/g') >> >>>> > PR=$(echo $STATUS | awk '{print $1}' | sed >> >>>> 's/.*[/]\(.*\)$/\1/g') >> >>>> > #COMMIT=$(git log -n 1 | grep "^Merge:" | awk >> '{print $3}') >> >>>> > #if [ -z $COMMIT ]; then >> >>>> > # COMMIT=$(curl -s >> >>>> https://api.github.com/repos/apache/zeppelin/pulls/$PR >> >>>> > | grep -e "\"label\":" -e "\"ref\":" -e "\"sha\":" | >> tr '\n' ' ' >> >>>> | sed >> >>>> > 's/\(.*sha[^,]*,\)\(.*ref.*\)/\1 = \2/g' | tr = '\n' | >> grep -v >> >>>> "apache:" | >> >>>> > sed 's/.*sha.[^"]*["]\([^"]*\).*/\1/g') >> >>>> > #fi >> >>>> > >> >>>> > # get commit hash from PR >> >>>> > COMMIT=$(curl -s >> >>>> https://api.github.com/repos/apache/zeppelin/pulls/$PR | >> >>>> > grep -e "\"label\":" -e "\"ref\":" -e "\"sha\":" | tr >> '\n' ' ' >> >>>> | sed >> >>>> > 's/\(.*sha[^,]*,\)\(.*ref.*\)/\1 = \2/g' | tr = '\n' | >> grep -v >> >>>> "apache:" | >> >>>> > sed 's/.*sha.[^"]*["]\([^"]*\).*/\1/g') >> >>>> > sleep 30 # sleep few moment to wait travis starts >> the build >> >>>> > RET_CODE=0 >> >>>> > python ./travis_check.py ${AUTHOR} ${COMMIT} || >> RET_CODE=$? >> >>>> > if [ $RET_CODE -eq 2 ]; then # try with repository >> name when >> >>>> travis-ci is >> >>>> > not available in the account >> >>>> > RET_CODE=0 >> >>>> > AUTHOR=$(curl -s >> >>>> https://api.github.com/repos/apache/zeppelin/pulls/$PR >> >>>> > | grep '"full_name":' | grep -v "apache/zeppelin" | sed >> >>>> > 's/.*[:][^"]*["]\([^/]*\).*/\1/g') >> >>>> > python ./travis_check.py ${AUTHOR} ${COMMIT} || >> RET_CODE=$? >> >>>> > fi >> >>>> > >> >>>> > if [ $RET_CODE -eq 2 ]; then # fail with can't find >> build >> >>>> information in >> >>>> > the travis >> >>>> > set +x >> >>>> > echo >> "-----------------------------------------------------" >> >>>> > echo "Looks like travis-ci is not configured for >> your fork." >> >>>> > echo "Please setup by swich on 'zeppelin' >> repository at >> >>>> > https://travis-ci.org/profile and travis-ci." >> >>>> > echo "And then make sure 'Build branch updates' >> option is >> >>>> enabled in >> >>>> > the settings >> https://travis-ci.org/${AUTHOR}/zeppelin/settings >> <https://travis-ci.org/$%7BAUTHOR%7D/zeppelin/settings> >> >>>> <https://travis-ci.org/$%7BAUTHOR%7D/zeppelin/settings>." >> >>>> > echo "" >> >>>> > echo "To trigger CI after setup, you will need >> ammend your >> >>>> last commit >> >>>> > with" >> >>>> > echo "git commit --amend" >> >>>> > echo "git push your-remote HEAD --force" >> >>>> > echo "" >> >>>> > echo "See >> >>>> > >> >>>> >> >> >> http://zeppelin.apache.org/contribution/contributions.html#continuous-integration >> >>>> > ." >> >>>> > fi >> >>>> > >> >>>> > exit $RET_CODE >> >>>> > else >> >>>> > set +x >> >>>> > echo "travis_check.py does not exists" >> >>>> > exit 1 >> >>>> > fi >> >>>> > >> >>>> > Chesnay Schepler <[hidden email] >> <mailto:[hidden email]> >> >>>> <mailto:[hidden email] <mailto:[hidden email]>>> >> 于2019年6月29日周六 下午3:17写道: >> >>>> > >> >>>> >> Does this imply that a Jenkins job is active as long >> as the >> >>>> Travis build >> >>>> >> runs? >> >>>> >> >> >>>> >> On 26/06/2019 21:28, Bowen Li wrote: >> >>>> >>> Hi, >> >>>> >>> >> >>>> >>> @Dawid, I think the "long test running" as I >> mentioned in the >> >>>> first >> >>>> >> email, >> >>>> >>> also as you guys said, belongs to "a big effort >> which is much >> >>>> harder to >> >>>> >>> accomplish in a short period of time and may deserve >> its own >> >>>> separate >> >>>> >>> discussion". Thus I didn't include it in what we can >> do in a >> >>>> foreseeable >> >>>> >>> short term. >> >>>> >>> >> >>>> >>> Besides, I don't think that's the ultimate reason >> for lack of >> >>>> build >> >>>> >>> resources. Even if the build is shortened to >> something like >> >>>> 2h, the >> >>>> >>> problems of no build machine works about 6 or more >> hours in >> >>>> PST daytime >> >>>> >>> that I described will still happen, because no >> machine from >> >>>> ASF INFRA's >> >>>> >>> pool is allocated to Flink. As I have paid close >> attention to >> >>>> the build >> >>>> >>> queue in the past few weekdays, it's a pretty clear >> pattern now. >> >>>> >>> >> >>>> >>> **The ultimate root cause** for that is - we don't >> have any >> >>>> **dedicated** >> >>>> >>> build resources that we can stably rely on. I'm >> actually ok to >> >>>> wait for a >> >>>> >>> long time if there are build requests running, it >> means at >> >>>> least we are >> >>>> >>> making progress. But I'm not ok with no build >> resource. A >> >>>> better place I >> >>>> >>> think we should aim at in short term is to always >> have at >> >>>> least a central >> >>>> >>> pool (can be 3 or 5) of machines dedicated to build >> Flink at >> >>>> any time, or >> >>>> >>> maybe use users resources. >> >>>> >>> >> >>>> >>> @Chesnay @Robert I synced with Jeff offline that >> Zeppelin >> >>>> community is >> >>>> >>> using a Jenkins job to automatically build on users' >> travis >> >>>> account and >> >>>> >>> link the result back to github PR. I guess the >> Jenkins job >> >>>> would fetch >> >>>> >>> latest upstream master and build the PR against it. >> Jeff has >> >>>> filed >> >>>> >> tickets >> >>>> >>> to learn and get access to the Jenkins infra. It'll >> better to >> >>>> fully >> >>>> >>> understand it first before judging this approach. >> >>>> >>> >> >>>> >>> I also heard good things about CircleCI, and ASF >> INFRA seems >> >>>> to have a >> >>>> >> pool >> >>>> >>> of build capacity there too. Can be an alternative >> to consider. >> >>>> >>> >> >>>> >>> >> >>>> >>> >> >>>> >>> >> >>>> >>> >> >>>> >>> >> >>>> >>> >> >>>> >>> >> >>>> >>> >> >>>> >>> On Wed, Jun 26, 2019 at 12:44 AM Dawid Wysakowicz < >> >>>> >> [hidden email] >> <mailto:[hidden email]> <mailto:[hidden email] >> <mailto:[hidden email]>>> >> >>>> >>> wrote: >> >>>> >>> >> >>>> >>>> Sorry to jump in late, but I think Bowen missed the >> most >> >>>> important point >> >>>> >>>> from Chesnay's previous message in the summary. The >> ultimate >> >>>> reason for >> >>>> >>>> all the problems is that the tests take close to 2 >> hours to >> >>>> run already. >> >>>> >>>> I fully support this claim: "Unless people start >> caring about >> >>>> test times >> >>>> >>>> before adding them, this issue cannot be solved" >> >>>> >>>> >> >>>> >>>> This is also another reason why using user's Travis >> account >> >>>> won't help. >> >>>> >>>> Every few weeks we reach the user's time limit for >> a single >> >>>> profile. >> >>>> >>>> This makes the user's builds simply fail, until we >> either >> >>>> properly >> >>>> >>>> decrease the time the tests take (which I am not >> sure we ever >> >>>> did) or >> >>>> >>>> postpone the problem by splitting into more >> profiles. (Note >> >>>> that the ASF >> >>>> >>>> Travis account has higher time limits) >> >>>> >>>> >> >>>> >>>> Best, >> >>>> >>>> >> >>>> >>>> Dawid >> >>>> >>>> >> >>>> >>>> On 26/06/2019 09:36, Robert Metzger wrote: >> >>>> >>>>> Do we know if using "the best" available hardware >> would >> >>>> improve the >> >>>> >> build >> >>>> >>>>> times? >> >>>> >>>>> Imagine we would run the build on machines with >> plenty of >> >>>> main memory >> >>>> >> to >> >>>> >>>>> mount everything to ramdisk + the latest CPU >> architecture? >> >>>> >>>>> >> >>>> >>>>> Throwing hardware at the problem could help reduce >> the time >> >>>> of an >> >>>> >>>>> individual build, and using our own infrastructure >> would >> >>>> remove our >> >>>> >>>>> dependency on Apache's Travis account (with the >> obvious >> >>>> downside of >> >>>> >>>> having >> >>>> >>>>> to maintain the infrastructure) >> >>>> >>>>> We could use an open source travis alternative, to >> have a >> >>>> similar >> >>>> >>>>> experience and make the migration easy. >> >>>> >>>>> >> >>>> >>>>> >> >>>> >>>>> On Wed, Jun 26, 2019 at 9:34 AM Chesnay Schepler >> >>>> <[hidden email] <mailto:[hidden email]> >> <mailto:[hidden email] <mailto:[hidden email]>>> >> >>>> >>>> wrote: >> >>>> >>>>>> >From what I gathered, there's no special >> sauce that the >> >>>> Zeppelin >> >>>> >>>>>> project uses which actually integrates a users >> Travis >> >>>> account into the >> >>>> >>>> PR. >> >>>> >>>>>> They just disabled Travis for PRs. And that's >> kind of it. >> >>>> >>>>>> >> >>>> >>>>>> Naturally we can do this (duh) and safe the ASF a >> fair >> >>>> amount of >> >>>> >>>>>> resources, but there are downsides: >> >>>> >>>>>> >> >>>> >>>>>> The discoverability of the Travis check takes a >> nose-dive. >> >>>> Either we >> >>>> >>>>>> require every contributor to always, an every >> commit, also >> >>>> post a >> >>>> >> Travis >> >>>> >>>>>> build, or we have the reviewer sift through the >> >>>> contributors account >> >>>> >> to >> >>>> >>>>>> find it. >> >>>> >>>>>> >> >>>> >>>>>> This is rather cumbersome. Additionally, it's >> also not >> >>>> equivalent to >> >>>> >>>>>> having a PR build. >> >>>> >>>>>> >> >>>> >>>>>> A normal branch build takes a branch as is and >> tests it. A >> >>>> PR build >> >>>> >>>>>> merges the branch into master, and then runs it. >> (Fun fact: >> >>>> This is >> >>>> >> why >> >>>> >>>>>> a PR without merge conflicts is not being run on >> Travis.) >> >>>> >>>>>> >> >>>> >>>>>> And ultimately, everyone can already make use of >> this >> >>>> approach anyway. >> >>>> >>>>>> >> >>>> >>>>>> On 25/06/2019 08:02, Jark Wu wrote: >> >>>> >>>>>>> Hi Jeff, >> >>>> >>>>>>> >> >>>> >>>>>>> Thanks for sharing the Zeppelin approach. I >> think it's a >> >>>> good idea to >> >>>> >>>>>>> leverage user's travis account. >> >>>> >>>>>>> In this way, we can have almost unlimited >> concurrent build >> >>>> jobs and >> >>>> >>>>>>> developers can restart build by themselves >> (currently only >> >>>> committers >> >>>> >>>>>>> can restart PR's build). >> >>>> >>>>>>> >> >>>> >>>>>>> But I'm still not very clear how to integrate >> user's >> >>>> travis build >> >>>> >> into >> >>>> >>>>>>> the Flink pull request's build automatically. >> Can you >> >>>> explain more in >> >>>> >>>>>>> detail? >> >>>> >>>>>>> >> >>>> >>>>>>> Another question: does travis only build >> branches for user >> >>>> account? >> >>>> >>>>>>> My concern is that builds for PRs will rebase >> user's >> >>>> commits against >> >>>> >>>>>>> current master branch. >> >>>> >>>>>>> This will help us to find problems before >> merge. Builds >> >>>> for branches >> >>>> >>>>>>> will lose the impact of new commits in master. >> >>>> >>>>>>> How does Zeppelin solve this problem? >> >>>> >>>>>>> >> >>>> >>>>>>> Thanks again for sharing the idea. >> >>>> >>>>>>> >> >>>> >>>>>>> Regards, >> >>>> >>>>>>> Jark >> >>>> >>>>>>> >> >>>> >>>>>>> On Tue, 25 Jun 2019 at 11:01, Jeff Zhang >> <[hidden email] <mailto:[hidden email]> >> >>>> <mailto:[hidden email] <mailto:[hidden email]>> >> >>>> >>>>>>> <mailto:[hidden email] >> <mailto:[hidden email]> <mailto:[hidden email] >> <mailto:[hidden email]>>>> wrote: >> >>>> >>>>>>> >> >>>> >>>>>>> Hi Folks, >> >>>> >>>>>>> >> >>>> >>>>>>> Zeppelin meet this kind of issue before, we solve >> >>>> it by >> >>>> >> delegating >> >>>> >>>>>>> each >> >>>> >>>>>>> one's PR build to his travis account >> (Everyone can >> >>>> have 5 free >> >>>> >>>>>>> slot for >> >>>> >>>>>>> travis build). >> >>>> >>>>>>> Apache account travis build is only triggered >> when >> >>>> PR is merged. >> >>>> >>>>>>> >> >>>> >>>>>>> >> >>>> >>>>>>> >> >>>> >>>>>>> Kurt Young <[hidden email] >> <mailto:[hidden email]> >> >>>> <mailto:[hidden email] <mailto:[hidden email]>> >> <mailto:[hidden email] <mailto:[hidden email]> >> >>>> <mailto:[hidden email] <mailto:[hidden email]>>>> >> >>>> >>>>>>> 于2019年6月25日周二 上午10:16写道: >> >>>> >>>>>>> >> >>>> >>>>>>> > (Forgot to cc George) >> >>>> >>>>>>> > >> >>>> >>>>>>> > Best, >> >>>> >>>>>>> > Kurt >> >>>> >>>>>>> > >> >>>> >>>>>>> > >> >>>> >>>>>>> > On Tue, Jun 25, 2019 at 10:16 AM Kurt Young >> >>>> <[hidden email] <mailto:[hidden email]> >> <mailto:[hidden email] <mailto:[hidden email]>> >> >>>> >>>>>>> <mailto:[hidden email] >> <mailto:[hidden email]> <mailto:[hidden email] >> <mailto:[hidden email]>>>> >> >>>> wrote: >> >>>> >>>>>>> > >> >>>> >>>>>>> > > Hi Bowen, >> >>>> >>>>>>> > > >> >>>> >>>>>>> > > Thanks for bringing this up. We >> actually have >> >>>> discussed >> >>>> >> about >> >>>> >>>>>>> this, and I >> >>>> >>>>>>> > > think Till and George have >> >>>> >>>>>>> > > already spend sometime investigating >> it. I have >> >>>> cced both of >> >>>> >>>>>>> them, and >> >>>> >>>>>>> > > maybe they can share >> >>>> >>>>>>> > > their findings. >> >>>> >>>>>>> > > >> >>>> >>>>>>> > > Best, >> >>>> >>>>>>> > > Kurt >> >>>> >>>>>>> > > >> >>>> >>>>>>> > > >> >>>> >>>>>>> > > On Tue, Jun 25, 2019 at 10:08 AM Jark Wu >> >>>> <[hidden email] <mailto:[hidden email]> >> <mailto:[hidden email] <mailto:[hidden email]>> >> >>>> >>>>>>> <mailto:[hidden email] >> <mailto:[hidden email]> <mailto:[hidden email] >> <mailto:[hidden email]>>>> >> >>>> wrote: >> >>>> >>>>>>> > > >> >>>> >>>>>>> > >> Hi Bowen, >> >>>> >>>>>>> > >> >> >>>> >>>>>>> > >> Thanks for bringing this. We also >> suffered from >> >>>> the long >> >>>> >>>>>>> build time. >> >>>> >>>>>>> > >> I agree that we should focus on >> solving build >> >>>> capacity >> >>>> >>>>>>> problem in the >> >>>> >>>>>>> > >> thread. >> >>>> >>>>>>> > >> >> >>>> >>>>>>> > >> My observation is there is only one >> build is >> >>>> running, all >> >>>> >> the >> >>>> >>>>>>> others >> >>>> >>>>>>> > >> (other >> >>>> >>>>>>> > >> PRs, master) are pending. >> >>>> >>>>>>> > >> The pricing plan[1] of travis shows >> it can >> >>>> support >> >>>> >> concurrent >> >>>> >>>>>>> build >> >>>> >>>>>>> > jobs. >> >>>> >>>>>>> > >> But I don't know which plan we are >> using, might >> >>>> be the free >> >>>> >>>>>>> plan for >> >>>> >>>>>>> > open >> >>>> >>>>>>> > >> source. >> >>>> >>>>>>> > >> >> >>>> >>>>>>> > >> I cc-ed Chesnay who may have some >> experience on >> >>>> Travis. >> >>>> >>>>>>> > >> >> >>>> >>>>>>> > >> Regards, >> >>>> >>>>>>> > >> Jark >> >>>> >>>>>>> > >> >> >>>> >>>>>>> > >> [1]: https://travis-ci.com/plans >> >>>> >>>>>>> > >> >> >>>> >>>>>>> > >> On Tue, 25 Jun 2019 at 08:11, Bowen Li < >> >>>> >> [hidden email] <mailto:[hidden email]> >> <mailto:[hidden email] <mailto:[hidden email]>> >> >>>> >>>>>>> <mailto:[hidden email] >> <mailto:[hidden email]> >> >>>> <mailto:[hidden email] >> <mailto:[hidden email]>>>> wrote: >> >>>> >>>>>>> > >> >> >>>> >>>>>>> > >> > Hi Steven, >> >>>> >>>>>>> > >> > >> >>>> >>>>>>> > >> > I think you may not read what I >> wrote. The >> >>>> discussion is >> >>>> >>>> about >> >>>> >>>>>>> > "unstable >> >>>> >>>>>>> > >> > build **capacity**", in another word >> >>>> "unstable / lack of >> >>>> >>>> build >> >>>> >>>>>>> > >> resources", >> >>>> >>>>>>> > >> > not "unstable build". >> >>>> >>>>>>> > >> > >> >>>> >>>>>>> > >> > On Mon, Jun 24, 2019 at 4:40 PM >> Steven Wu >> >>>> >>>>>>> <[hidden email] >> <mailto:[hidden email]> <mailto:[hidden email] >> <mailto:[hidden email]>> >> >>>> <mailto:[hidden email] >> <mailto:[hidden email]> <mailto:[hidden email] >> <mailto:[hidden email]>>>> >> >>>> >>>>>>> > wrote: >> >>>> >>>>>>> > >> > >> >>>> >>>>>>> > >> > > long and sometimes unstable build is >> >>>> definitely a pain >> >>>> >>>>>> point. >> >>>> >>>>>>> > >> > > >> >>>> >>>>>>> > >> > > I suspect the build failure here in >> >>>> >> flink-connector-kafka >> >>>> >>>>>>> is not >> >>>> >>>>>>> > >> related >> >>>> >>>>>>> > >> > to >> >>>> >>>>>>> > >> > > my change. but there is no easy >> re-run the >> >>>> build on >> >>>> >>>>>>> travis UI. >> >>>> >>>>>>> > >> > > search showed a trick of >> close-and-open the >> >>>> PR will >> >>>> >>>>>>> trigger rebuild. >> >>>> >>>>>>> > >> but >> >>>> >>>>>>> > >> > > that could add noises to the PR >> activities. >> >>>> >>>>>>> > >> > > >> >>>> https://travis-ci.org/apache/flink/jobs/545555519 >> >>>> >>>>>>> > >> > > >> >>>> >>>>>>> > >> > > travis-ci for my personal repo >> often failed >> >>>> with >> >>>> >>>>>>> exceeding time >> >>>> >>>>>>> > limit >> >>>> >>>>>>> > >> > after >> >>>> >>>>>>> > >> > > 4+ hours. >> >>>> >>>>>>> > >> > > The job exceeded the maximum time >> limit for >> >>>> jobs, and >> >>>> >> has >> >>>> >>>>>>> been >> >>>> >>>>>>> > >> > terminated. >> >>>> >>>>>>> > >> > > >> >>>> >>>>>>> > >> > > On Mon, Jun 24, 2019 at 4:15 PM >> Bowen Li >> >>>> >>>>>>> <[hidden email] >> <mailto:[hidden email]> <mailto:[hidden email] >> <mailto:[hidden email]>> >> >>>> <mailto:[hidden email] <mailto:[hidden email]> >> <mailto:[hidden email] <mailto:[hidden email]>>>> >> >>>> >>>>>>> > wrote: >> >>>> >>>>>>> > >> > > >> >>>> >>>>>>> > >> > > > >> >>>> https://travis-ci.org/apache/flink/builds/549681530 >> >>>> >>>>>>> This build >> >>>> >>>>>>> > >> > request >> >>>> >>>>>>> > >> > > > has >> >>>> >>>>>>> > >> > > > been sitting at **HEAD of the >> queue** >> >>>> since I first >> >>>> >> saw >> >>>> >>>>>>> it at PST >> >>>> >>>>>>> > >> > 10:30am >> >>>> >>>>>>> > >> > > > (not sure how long it's been >> there before >> >>>> 10:30am). >> >>>> >>>>>>> It's PST >> >>>> >>>>>>> > 4:12pm >> >>>> >>>>>>> > >> now >> >>>> >>>>>>> > >> > > and >> >>>> >>>>>>> > >> > > > it hasn't started yet. >> >>>> >>>>>>> > >> > > > >> >>>> >>>>>>> > >> > > > On Mon, Jun 24, 2019 at 2:48 PM >> Bowen Li >> >>>> >>>>>>> <[hidden email] >> <mailto:[hidden email]> <mailto:[hidden email] >> <mailto:[hidden email]>> >> >>>> <mailto:[hidden email] <mailto:[hidden email]> >> <mailto:[hidden email] <mailto:[hidden email]>>>> >> >>>> >>>>>>> > >> wrote: >> >>>> >>>>>>> > >> > > > >> >>>> >>>>>>> > >> > > > > Hi devs, >> >>>> >>>>>>> > >> > > > > >> >>>> >>>>>>> > >> > > > > I've been experiencing the pain >> >>>> resulting from lack >> >>>> >>>>>>> of stable >> >>>> >>>>>>> > >> build >> >>>> >>>>>>> > >> > > > > capacity on Travis for Flink >> PRs [1]. >> >>>> >> Specifically, I >> >>>> >>>>>>> noticed >> >>>> >>>>>>> > >> often >> >>>> >>>>>>> > >> > > that >> >>>> >>>>>>> > >> > > > no >> >>>> >>>>>>> > >> > > > > build in the queue is making any >> >>>> progress for >> >>>> >> hours, >> >>>> >>>> and >> >>>> >>>>>>> > suddenly >> >>>> >>>>>>> > >> 5 >> >>>> >>>>>>> > >> > or >> >>>> >>>>>>> > >> > > 6 >> >>>> >>>>>>> > >> > > > > builds kick off all together >> after the >> >>>> long pause. >> >>>> >>>>>>> I'm at PST >> >>>> >>>>>>> > >> > (UTC-08) >> >>>> >>>>>>> > >> > > > time >> >>>> >>>>>>> > >> > > > > zone, and I've seen pause can >> be as >> >>>> long as 6 hours >> >>>> >>>>>>> from PST 9am >> >>>> >>>>>>> > >> to >> >>>> >>>>>>> > >> > 3pm >> >>>> >>>>>>> > >> > > > > (let alone the time needed to >> drain the >> >>>> queue >> >>>> >>>>>>> afterwards). >> >>>> >>>>>>> > >> > > > > >> >>>> >>>>>>> > >> > > > > I think this has greatly >> impacted our >> >>>> productivity. >> >>>> >>>> I've >> >>>> >>>>>>> > >> experienced >> >>>> >>>>>>> > >> > > that >> >>>> >>>>>>> > >> > > > > PRs submitted in the early >> morning of >> >>>> PST time zone >> >>>> >>>>>>> won't finish >> >>>> >>>>>>> > >> > their >> >>>> >>>>>>> > >> > > > > build until late night of the >> same day. >> >>>> >>>>>>> > >> > > > > >> >>>> >>>>>>> > >> > > > > So my questions are: >> >>>> >>>>>>> > >> > > > > >> >>>> >>>>>>> > >> > > > > - Has anyone else experienced >> the same >> >>>> problem or >> >>>> >>>>>>> have similar >> >>>> >>>>>>> > >> > > > observation >> >>>> >>>>>>> > >> > > > > on TravisCI? (I suspect it >> has things >> >>>> to do with >> >>>> >> time >> >>>> >>>>>>> zone) >> >>>> >>>>>>> > >> > > > > >> >>>> >>>>>>> > >> > > > > - What pricing plan of >> TravisCI is >> >>>> Flink currently >> >>>> >>>>>>> using? Is it >> >>>> >>>>>>> > >> the >> >>>> >>>>>>> > >> > > free >> >>>> >>>>>>> > >> > > > > plan for open source >> projects? What >> >>>> are the >> >>>> >>>>>>> guaranteed build >> >>>> >>>>>>> > >> capacity >> >>>> >>>>>>> > >> > > of >> >>>> >>>>>>> > >> > > > > the current plan? >> >>>> >>>>>>> > >> > > > > >> >>>> >>>>>>> > >> > > > > - If the current pricing plan >> (either >> >>>> free or paid) >> >>>> >>>>>> can't >> >>>> >>>>>>> > provide >> >>>> >>>>>>> > >> > > stable >> >>>> >>>>>>> > >> > > > > build capacity, can we >> upgrade to a >> >>>> higher priced >> >>>> >>>>>>> plan with >> >>>> >>>>>>> > larger >> >>>> >>>>>>> > >> > and >> >>>> >>>>>>> > >> > > > more >> >>>> >>>>>>> > >> > > > > stable build capacity? >> >>>> >>>>>>> > >> > > > > >> >>>> >>>>>>> > >> > > > > BTW, another factor that >> contribute to >> >>>> the >> >>>> >>>>>>> productivity problem >> >>>> >>>>>>> > is >> >>>> >>>>>>> > >> > that >> >>>> >>>>>>> > >> > > > > our build is slow - we run >> full build >> >>>> for every PR >> >>>> >>>> and a >> >>>> >>>>>>> > >> successful >> >>>> >>>>>>> > >> > > full >> >>>> >>>>>>> > >> > > > > build takes ~5h. We >> definitely have >> >>>> more options to >> >>>> >>>>>>> solve it, >> >>>> >>>>>>> > for >> >>>> >>>>>>> > >> > > > instance, >> >>>> >>>>>>> > >> > > > > modularize the build graphs >> and reuse >> >>>> artifacts >> >>>> >> from >> >>>> >>>> the >> >>>> >>>>>>> > previous >> >>>> >>>>>>> > >> > > build. >> >>>> >>>>>>> > >> > > > > But I think that can be a big >> effort >> >>>> which is much >> >>>> >>>>>>> harder to >> >>>> >>>>>>> > >> > accomplish >> >>>> >>>>>>> > >> > > > in >> >>>> >>>>>>> > >> > > > > a short period of time and >> may deserve >> >>>> its own >> >>>> >>>> separate >> >>>> >>>>>>> > >> discussion. >> >>>> >>>>>>> > >> > > > > >> >>>> >>>>>>> > >> > > > > [1] >> >>>> >> https://travis-ci.org/apache/flink/pull_requests >> >>>> >>>>>>> > >> > > > > >> >>>> >>>>>>> > >> > > > > >> >>>> >>>>>>> > >> > > > >> >>>> >>>>>>> > >> > > >> >>>> >>>>>>> > >> > >> >>>> >>>>>>> > >> >> >>>> >>>>>>> > > >> >>>> >>>>>>> > >> >>>> >>>>>>> >> >>>> >>>>>>> >> >>>> >>>>>>> -- >> >>>> >>>>>>> Best Regards >> >>>> >>>>>>> >> >>>> >>>>>>> Jeff Zhang >> >>>> >>>>>>> >> >>>> >> >> >>>> >> >>> >> >> >> > > |
In reply to this post by Kurt Young
+1. Thank Chesnay for pushing this forward.
Best, Haibo At 2019-07-04 17:58:28, "Kurt Young" <[hidden email]> wrote: >+1 and great thanks Chesnay for pushing this. > >Best, >Kurt > > >On Thu, Jul 4, 2019 at 5:44 PM Aljoscha Krettek <[hidden email]> wrote: > >> +1 >> >> Aljoscha >> >> > On 4. Jul 2019, at 11:09, Stephan Ewen <[hidden email]> wrote: >> > >> > +1 to move to a private Travis account. >> > >> > I can confirm that Ververica will sponsor a Travis CI plan that is >> > equivalent or a bit higher than the previous ASF quota (10 concurrent >> build >> > queues) >> > >> > Best, >> > Stephan >> > >> > On Thu, Jul 4, 2019 at 10:46 AM Chesnay Schepler <[hidden email]> >> wrote: >> > >> >> I've raised a JIRA >> >> <https://issues.apache.org/jira/browse/INFRA-18703>with INFRA to >> inquire >> >> whether it would be possible to switch to a different Travis account, >> >> and if so what steps would need to be taken. >> >> We need a proper confirmation from INFRA since we are not in full >> >> control of the flink repository (for example, we cannot access the >> >> settings page). >> >> >> >> If this is indeed possible, Ververica is willing sponsor a Travis >> >> account for the Flink project. >> >> This would provide us with more than enough resources than we need. >> >> >> >> Since this makes the project more reliant on resources provided by >> >> external companies I would like to vote on this. >> >> >> >> Please vote on this proposal, as follows: >> >> [ ] +1, Approve the migration to a Ververica-sponsored Travis account, >> >> provided that INFRA approves >> >> [ ] -1, Do not approach the migration to a Ververica-sponsored Travis >> >> account >> >> >> >> The vote will be open for at least 24h, and until we have confirmation >> >> from INFRA. The voting period may be shorter than the usual 3 days since >> >> our current is effectively not working. >> >> >> >> On 04/07/2019 06:51, Bowen Li wrote: >> >>> Re: > Are they using their own Travis CI pool, or did the switch to an >> >>> entirely different CI service? >> >>> >> >>> I reached out to Wes and Krisztián from Apache Arrow PMC. They are >> >>> currently moving away from ASF's Travis to their own in-house metal >> >>> machines at [1] with custom CI application at [2]. They've seen >> >>> significant improvement w.r.t both much higher performance and >> >>> basically no resource waiting time, "night-and-day" difference quoting >> >>> Wes. >> >>> >> >>> Re: > If we can just switch to our own Travis pool, just for our >> >>> project, then this might be something we can do fairly quickly? >> >>> >> >>> I believe so, according to [3] and [4] >> >>> >> >>> >> >>> [1] https://ci.ursalabs.org/ <https://ci.ursalabs.org/#/> >> >>> [2] https://github.com/ursa-labs/ursabot >> >>> [3] >> >>> >> https://docs.travis-ci.com/user/migrate/open-source-repository-migration >> >>> [4] >> https://docs.travis-ci.com/user/migrate/open-source-on-travis-ci-com >> >>> >> >>> >> >>> >> >>> On Wed, Jul 3, 2019 at 12:01 AM Chesnay Schepler <[hidden email] >> >>> <mailto:[hidden email]>> wrote: >> >>> >> >>> Are they using their own Travis CI pool, or did the switch to an >> >>> entirely different CI service? >> >>> >> >>> If we can just switch to our own Travis pool, just for our >> >>> project, then >> >>> this might be something we can do fairly quickly? >> >>> >> >>> On 03/07/2019 05:55, Bowen Li wrote: >> >>>> I responded in the INFRA ticket [1] that I believe they are >> >>> using a wrong >> >>>> metric against Flink and the total build time is a completely >> >>> different >> >>>> thing than guaranteed build capacity. >> >>>> >> >>>> My response: >> >>>> >> >>>> "As mentioned above, since I started to pay attention to Flink's >> >>> build >> >>>> queue a few tens of days ago, I'm in Seattle and I saw no build >> >>> was kicking >> >>>> off in PST daytime in weekdays for Flink. Our teammates in China >> >>> and Europe >> >>>> have also reported similar observations. So we need to evaluate >> >>> how the >> >>>> large total build time came from - if 1) your number and 2) our >> >>>> observations from three locations that cover pretty much a full >> >>> day, are >> >>>> all true, I **guess** one reason can be that - highly likely the >> >>> extra >> >>>> build time came from weekends when other Apache projects may be >> >>> idle and >> >>>> Flink just drains hard its congested queue. >> >>>> >> >>>> Please be aware of that we're not complaining about the lack of >> >>> resources >> >>>> in general, I'm complaining about the lack of **stable, dedicated** >> >>>> resources. An example for the latter one is, currently even if >> >>> no build is >> >>>> in Flink's queue and I submit a request to be the queue head in PST >> >>>> morning, my build won't even start in 6-8+h. That is an absurd >> >>> amount of >> >>>> waiting time. >> >>>> >> >>>> That's saying, if ASF INFRA decides to adopt a quota system and >> >>> grants >> >>>> Flink five DEDICATED servers that runs all the time only for >> >>> Flink, that'll >> >>>> be PERFECT and can totally solve our problem now. >> >>>> >> >>>> Please be aware of that we're not complaining about the lack of >> >>> resources >> >>>> in general, I'm complaining about the lack of **stable, dedicated** >> >>>> resources. An example for the latter one is, currently even if >> >>> no build is >> >>>> in Flink's queue and I submit a request to be the queue head in PST >> >>>> morning, my build won't even start in 6-8+h. That is an absurd >> >>> amount of >> >>>> waiting time. >> >>>> >> >>>> >> >>>> That's saying, if ASF INFRA decides to adopt a quota system and >> >>> grants >> >>>> Flink five DEDICATED servers that runs all the time only for >> >>> Flink, that'll >> >>>> be PERFECT and can totally solve our problem now. >> >>>> >> >>>> I feel what's missing in the ASF INFRA's Travis resource pool is >> >>> some level >> >>>> of build capacity SLAs and certainty" >> >>>> >> >>>> >> >>>> Again, I believe there are differences in nature of these two >> >>> problems, >> >>>> long build time v.s. lack of dedicated build resource. That's >> >>> saying, >> >>>> shortening build time may relieve the situation, and may not. >> >>> I'm sightly >> >>>> negative on disabling IT cases for PRs, due to the downside is >> >>> that we are >> >>>> at risk of any potential bugs in PR that UTs doesn't catch, and >> >>> may cost a >> >>>> lot more to fix and if it slows others down or even block >> >>> others, but am >> >>>> open to others opinions on it. >> >>>> >> >>>> AFAICT from INFRA ticket[1], donating to ASF INFRA won't be >> >>> feasible to >> >>>> solve our problem since INFRA's pool is fully shared and they >> >>> have no >> >>>> control and finer insights over resource allocation to a >> >>> specific Apache >> >>>> project. As mentioned in [1], Apache Arrow is moving away from >> >>> ASF INFRA >> >>>> Travis pool (they are actually surprised Flink hasn't plan to do >> >>> so). I >> >>>> know that Spark is on its own build infra. If we all agree that >> >>> funding our >> >>>> own build infra, I'd be glad to help investigate any potential >> >>> options >> >>>> after releasing 1.9 since I'm super busy with 1.9 now. >> >>>> >> >>>> [1] https://issues.apache.org/jira/browse/INFRA-18533 >> >>>> >> >>>> >> >>>> >> >>>> On Tue, Jul 2, 2019 at 4:46 AM Chesnay Schepler >> >>> <[hidden email] <mailto:[hidden email]>> wrote: >> >>>> >> >>>>> As a short-term stopgap, since we can assume this issue to >> >>> become much >> >>>>> worse in the following days/weeks, we could disable IT cases in >> >>> PRs and >> >>>>> only run them on master. >> >>>>> >> >>>>> On 02/07/2019 12:03, Chesnay Schepler wrote: >> >>>>>> People really have to stop thinking that just because >> >>> something works >> >>>>>> for us it is also a good solution. >> >>>>>> Also, please remember that our builds run for 2h from start to >> >>> finish, >> >>>>>> and not the 14 _minutes_ it takes for zeppelin. >> >>>>>> We are dealing with an entirely different scale here, both in >> >>> terms of >> >>>>>> build times and number of builds. >> >>>>>> >> >>>>>> In this very thread people have been complaining about long queue >> >>>>>> times for their builds. Surprise, other Apache projects have been >> >>>>>> suffering the very same thing due to us not controlling our build >> >>>>>> times. While switching services (be it Jenkins, CircleCI or >> >>> whatever) >> >>>>>> will possibly work for us (and these options are actually >> >>> attractive, >> >>>>>> like CircleCI's proper support for build artifacts), it will also >> >>>>>> result in us likely negatively affecting other projects in >> >>> significant >> >>>>>> ways. >> >>>>>> >> >>>>>> Sure, the Jenkins setup has a good user experience for us, at >> >>> the cost >> >>>>>> of blocking Jenkins workers for a _lot_ of time. Right now we >> >>> have 25 >> >>>>>> PR's in our queue; that's possibly 50h we'd consume of Jenkins >> >>>>>> resources, and the European contributors haven't even really >> >>> started yet. >> >>>>>> >> >>>>>> FYI, the latest INFRA response from INFRA-18533: >> >>>>>> >> >>>>>> "Our rough metrics shows that Flink used over 5800 hours of >> >>> build time >> >>>>>> last month. That is equal to EIGHT servers running 24/7 for >> >>> the ENTIRE >> >>>>>> MONTH. EIGHT. nonstop. >> >>>>>> When we discovered this last night, we discussed it some and >> >>> are going >> >>>>>> to tune down Flink to allow only five executors maximum. We >> >> cannot >> >>>>>> allow Flink to consume so much of a Foundation shared resource." >> >>>>>> >> >>>>>> So yes, we either >> >>>>>> a) have to heavily reduce our CI usage or >> >>>>>> b) fund our own, either maintaining it ourselves or donating >> >>> to Apache. >> >>>>>> >> >>>>>> On 02/07/2019 05:11, Bowen Li wrote: >> >>>>>>> By looking at the git history of the Jenkins script, its core >> >>> part >> >>>>>>> was finished in March 2017 (and only two minor update in >> >>> 2017/2018), >> >>>>>>> so it's been running for over two years now and feels like >> >>> Zepplin >> >>>>>>> community has been quite happy with it. @Jeff Zhang >> >>>>>>> <mailto:[hidden email] <mailto:[hidden email]>> can you >> >>> share your insights and user >> >>>>>>> experience with the Jenkins+Travis approach? >> >>>>>>> >> >>>>>>> Things like: >> >>>>>>> >> >>>>>>> - has the approach completely solved the resource capacity >> >>> problem >> >>>>>>> for Zepplin community? is Zepplin community happy with the >> >>> result? >> >>>>>>> - is the whole configuration chain stable (e.g. uptime) enough? >> >>>>>>> - how often do you need to maintain the Jenkins infra? how many >> >>>>>>> people are usually involved in maintenance and bug-fixes? >> >>>>>>> >> >>>>>>> The downside of this approach seems mostly to be on the >> >>> maintenance >> >>>>>>> to me - maintain the script and Jenkins infra. >> >>>>>>> >> >>>>>>> ** Having Our Own Travis-CI.com Account ** >> >>>>>>> >> >>>>>>> Another alternative I've been thinking of is to have our own >> >>>>>>> travis-ci.com <http://travis-ci.com> <http://travis-ci.com> >> >>> account with paid dedicated >> >>>>>>> resources. Note travis-ci.org <http://travis-ci.org> >> >>> <http://travis-ci.org> is the free >> >>>>>>> version and travis-ci.com <http://travis-ci.com> >> >>> <http://travis-ci.com> is the commercial >> >>>>>>> version. We currently use a shared resource pool managed by >> >>> ASK INFRA >> >>>>>>> team on travis-ci.org <http://travis-ci.org> >> >>> <http://travis-ci.org>, but we have no control >> >>>>>>> over it - we can't see how it's configured, how much >> >>> resources are >> >>>>>>> available, how resources are allocated among Apache projects, >> >>> etc. >> >>>>>>> The nice thing about having an account on travis-ci.com >> >>> <http://travis-ci.com> >> >>>>>>> <http://travis-ci.com> are: >> >>>>>>> >> >>>>>>> - relatively low cost with much better resource guarantee >> >>> than what >> >>>>>>> we currently have [1]: $249/month with 5 dedicated concurrency, >> >>>>>>> $489/month with 10 concurrency >> >>>>>>> - low maintenance work compared to using Jenkins >> >>>>>>> - (potentially) no migration cost according to Travis's doc [2] >> >>>>>>> (pending verification) >> >>>>>>> - full control over the build capacity/configuration compared to >> >>>>>>> using ASF INFRA's pool >> >>>>>>> >> >>>>>>> I'd be surprised if we as such a vibrant community cannot >> >>> find and >> >>>>>>> fund $249*12=$2988 a year in exchange for a much better >> >> developer >> >>>>>>> experience and much higher productivity. >> >>>>>>> >> >>>>>>> [1] https://travis-ci.com/plans >> >>>>>>> [2] >> >>>>>>> >> >>>>> >> >>> >> >> >> https://docs.travis-ci.com/user/migrate/open-source-repository-migration >> >>>>>>> On Sat, Jun 29, 2019 at 8:39 AM Chesnay Schepler >> >>> <[hidden email] <mailto:[hidden email]> >> >>>>>>> <mailto:[hidden email] <mailto:[hidden email]>>> wrote: >> >>>>>>> >> >>>>>>> So yes, the Jenkins job keeps pulling the state from >> >>> Travis until it >> >>>>>>> finishes. >> >>>>>>> >> >>>>>>> Note sure I'm comfortable with the idea of using Jenkins >> >>> workers >> >>>>>>> just to >> >>>>>>> idle for a several hours. >> >>>>>>> >> >>>>>>> On 29/06/2019 14:56, Jeff Zhang wrote: >> >>>>>>>> Here's what zeppelin community did, we make a python >> >>> script to >> >>>>>>> check the >> >>>>>>>> build status of pull request. >> >>>>>>>> Here's script: >> >>>>>>>> >> >>> https://github.com/apache/zeppelin/blob/master/travis_check.py >> >>>>>>>> >> >>>>>>>> And this is the script we used in Jenkins build job. >> >>>>>>>> >> >>>>>>>> if [ -f "travis_check.py" ]; then >> >>>>>>>> git log -n 1 >> >>>>>>>> STATUS=$(curl -s $BUILD_URL | grep -e "GitHub pull >> >>>>>>> request.*from.*" | sed >> >>>>>>>> 's/.*GitHub pull request <a >> >>>>>>>> href=\"\(https[^"]*\).*from[^"]*.\(https[^"]*\).*/\1 >> >>> \2/g') >> >>>>>>>> AUTHOR=$(echo $STATUS | sed 's/.*[/]\(.*\)$/\1/g') >> >>>>>>>> PR=$(echo $STATUS | awk '{print $1}' | sed >> >>>>>>> 's/.*[/]\(.*\)$/\1/g') >> >>>>>>>> #COMMIT=$(git log -n 1 | grep "^Merge:" | awk >> >>> '{print $3}') >> >>>>>>>> #if [ -z $COMMIT ]; then >> >>>>>>>> # COMMIT=$(curl -s >> >>>>>>> https://api.github.com/repos/apache/zeppelin/pulls/$PR >> >>>>>>>> | grep -e "\"label\":" -e "\"ref\":" -e "\"sha\":" | >> >>> tr '\n' ' ' >> >>>>>>> | sed >> >>>>>>>> 's/\(.*sha[^,]*,\)\(.*ref.*\)/\1 = \2/g' | tr = '\n' | >> >>> grep -v >> >>>>>>> "apache:" | >> >>>>>>>> sed 's/.*sha.[^"]*["]\([^"]*\).*/\1/g') >> >>>>>>>> #fi >> >>>>>>>> >> >>>>>>>> # get commit hash from PR >> >>>>>>>> COMMIT=$(curl -s >> >>>>>>> https://api.github.com/repos/apache/zeppelin/pulls/$PR | >> >>>>>>>> grep -e "\"label\":" -e "\"ref\":" -e "\"sha\":" | tr >> >>> '\n' ' ' >> >>>>>>> | sed >> >>>>>>>> 's/\(.*sha[^,]*,\)\(.*ref.*\)/\1 = \2/g' | tr = '\n' | >> >>> grep -v >> >>>>>>> "apache:" | >> >>>>>>>> sed 's/.*sha.[^"]*["]\([^"]*\).*/\1/g') >> >>>>>>>> sleep 30 # sleep few moment to wait travis starts >> >>> the build >> >>>>>>>> RET_CODE=0 >> >>>>>>>> python ./travis_check.py ${AUTHOR} ${COMMIT} || >> >>> RET_CODE=$? >> >>>>>>>> if [ $RET_CODE -eq 2 ]; then # try with repository >> >>> name when >> >>>>>>> travis-ci is >> >>>>>>>> not available in the account >> >>>>>>>> RET_CODE=0 >> >>>>>>>> AUTHOR=$(curl -s >> >>>>>>> https://api.github.com/repos/apache/zeppelin/pulls/$PR >> >>>>>>>> | grep '"full_name":' | grep -v "apache/zeppelin" | sed >> >>>>>>>> 's/.*[:][^"]*["]\([^/]*\).*/\1/g') >> >>>>>>>> python ./travis_check.py ${AUTHOR} ${COMMIT} || >> >>> RET_CODE=$? >> >>>>>>>> fi >> >>>>>>>> >> >>>>>>>> if [ $RET_CODE -eq 2 ]; then # fail with can't find >> >>> build >> >>>>>>> information in >> >>>>>>>> the travis >> >>>>>>>> set +x >> >>>>>>>> echo >> >>> "-----------------------------------------------------" >> >>>>>>>> echo "Looks like travis-ci is not configured for >> >>> your fork." >> >>>>>>>> echo "Please setup by swich on 'zeppelin' >> >>> repository at >> >>>>>>>> https://travis-ci.org/profile and travis-ci." >> >>>>>>>> echo "And then make sure 'Build branch updates' >> >>> option is >> >>>>>>> enabled in >> >>>>>>>> the settings >> >>> https://travis-ci.org/${AUTHOR}/zeppelin/settings >> >>> <https://travis-ci.org/$%7BAUTHOR%7D/zeppelin/settings> >> >>>>>>> <https://travis-ci.org/$%7BAUTHOR%7D/zeppelin/settings>." >> >>>>>>>> echo "" >> >>>>>>>> echo "To trigger CI after setup, you will need >> >>> ammend your >> >>>>>>> last commit >> >>>>>>>> with" >> >>>>>>>> echo "git commit --amend" >> >>>>>>>> echo "git push your-remote HEAD --force" >> >>>>>>>> echo "" >> >>>>>>>> echo "See >> >>>>>>>> >> >>>>>>> >> >>>>> >> >>> >> >> >> http://zeppelin.apache.org/contribution/contributions.html#continuous-integration >> >>>>>>>> ." >> >>>>>>>> fi >> >>>>>>>> >> >>>>>>>> exit $RET_CODE >> >>>>>>>> else >> >>>>>>>> set +x >> >>>>>>>> echo "travis_check.py does not exists" >> >>>>>>>> exit 1 >> >>>>>>>> fi >> >>>>>>>> >> >>>>>>>> Chesnay Schepler <[hidden email] >> >>> <mailto:[hidden email]> >> >>>>>>> <mailto:[hidden email] <mailto:[hidden email]>>> >> >>> 于2019年6月29日周六 下午3:17写道: >> >>>>>>>> >> >>>>>>>>> Does this imply that a Jenkins job is active as long >> >>> as the >> >>>>>>> Travis build >> >>>>>>>>> runs? >> >>>>>>>>> >> >>>>>>>>> On 26/06/2019 21:28, Bowen Li wrote: >> >>>>>>>>>> Hi, >> >>>>>>>>>> >> >>>>>>>>>> @Dawid, I think the "long test running" as I >> >>> mentioned in the >> >>>>>>> first >> >>>>>>>>> email, >> >>>>>>>>>> also as you guys said, belongs to "a big effort >> >>> which is much >> >>>>>>> harder to >> >>>>>>>>>> accomplish in a short period of time and may deserve >> >>> its own >> >>>>>>> separate >> >>>>>>>>>> discussion". Thus I didn't include it in what we can >> >>> do in a >> >>>>>>> foreseeable >> >>>>>>>>>> short term. >> >>>>>>>>>> >> >>>>>>>>>> Besides, I don't think that's the ultimate reason >> >>> for lack of >> >>>>>>> build >> >>>>>>>>>> resources. Even if the build is shortened to >> >>> something like >> >>>>>>> 2h, the >> >>>>>>>>>> problems of no build machine works about 6 or more >> >>> hours in >> >>>>>>> PST daytime >> >>>>>>>>>> that I described will still happen, because no >> >>> machine from >> >>>>>>> ASF INFRA's >> >>>>>>>>>> pool is allocated to Flink. As I have paid close >> >>> attention to >> >>>>>>> the build >> >>>>>>>>>> queue in the past few weekdays, it's a pretty clear >> >>> pattern now. >> >>>>>>>>>> >> >>>>>>>>>> **The ultimate root cause** for that is - we don't >> >>> have any >> >>>>>>> **dedicated** >> >>>>>>>>>> build resources that we can stably rely on. I'm >> >>> actually ok to >> >>>>>>> wait for a >> >>>>>>>>>> long time if there are build requests running, it >> >>> means at >> >>>>>>> least we are >> >>>>>>>>>> making progress. But I'm not ok with no build >> >>> resource. A >> >>>>>>> better place I >> >>>>>>>>>> think we should aim at in short term is to always >> >>> have at >> >>>>>>> least a central >> >>>>>>>>>> pool (can be 3 or 5) of machines dedicated to build >> >>> Flink at >> >>>>>>> any time, or >> >>>>>>>>>> maybe use users resources. >> >>>>>>>>>> >> >>>>>>>>>> @Chesnay @Robert I synced with Jeff offline that >> >>> Zeppelin >> >>>>>>> community is >> >>>>>>>>>> using a Jenkins job to automatically build on users' >> >>> travis >> >>>>>>> account and >> >>>>>>>>>> link the result back to github PR. I guess the >> >>> Jenkins job >> >>>>>>> would fetch >> >>>>>>>>>> latest upstream master and build the PR against it. >> >>> Jeff has >> >>>>>>> filed >> >>>>>>>>> tickets >> >>>>>>>>>> to learn and get access to the Jenkins infra. It'll >> >>> better to >> >>>>>>> fully >> >>>>>>>>>> understand it first before judging this approach. >> >>>>>>>>>> >> >>>>>>>>>> I also heard good things about CircleCI, and ASF >> >>> INFRA seems >> >>>>>>> to have a >> >>>>>>>>> pool >> >>>>>>>>>> of build capacity there too. Can be an alternative >> >>> to consider. >> >>>>>>>>>> >> >>>>>>>>>> >> >>>>>>>>>> >> >>>>>>>>>> >> >>>>>>>>>> >> >>>>>>>>>> >> >>>>>>>>>> >> >>>>>>>>>> >> >>>>>>>>>> >> >>>>>>>>>> On Wed, Jun 26, 2019 at 12:44 AM Dawid Wysakowicz < >> >>>>>>>>> [hidden email] >> >>> <mailto:[hidden email]> <mailto:[hidden email] >> >>> <mailto:[hidden email]>>> >> >>>>>>>>>> wrote: >> >>>>>>>>>> >> >>>>>>>>>>> Sorry to jump in late, but I think Bowen missed the >> >>> most >> >>>>>>> important point >> >>>>>>>>>>> from Chesnay's previous message in the summary. The >> >>> ultimate >> >>>>>>> reason for >> >>>>>>>>>>> all the problems is that the tests take close to 2 >> >>> hours to >> >>>>>>> run already. >> >>>>>>>>>>> I fully support this claim: "Unless people start >> >>> caring about >> >>>>>>> test times >> >>>>>>>>>>> before adding them, this issue cannot be solved" >> >>>>>>>>>>> >> >>>>>>>>>>> This is also another reason why using user's Travis >> >>> account >> >>>>>>> won't help. >> >>>>>>>>>>> Every few weeks we reach the user's time limit for >> >>> a single >> >>>>>>> profile. >> >>>>>>>>>>> This makes the user's builds simply fail, until we >> >>> either >> >>>>>>> properly >> >>>>>>>>>>> decrease the time the tests take (which I am not >> >>> sure we ever >> >>>>>>> did) or >> >>>>>>>>>>> postpone the problem by splitting into more >> >>> profiles. (Note >> >>>>>>> that the ASF >> >>>>>>>>>>> Travis account has higher time limits) >> >>>>>>>>>>> >> >>>>>>>>>>> Best, >> >>>>>>>>>>> >> >>>>>>>>>>> Dawid >> >>>>>>>>>>> >> >>>>>>>>>>> On 26/06/2019 09:36, Robert Metzger wrote: >> >>>>>>>>>>>> Do we know if using "the best" available hardware >> >>> would >> >>>>>>> improve the >> >>>>>>>>> build >> >>>>>>>>>>>> times? >> >>>>>>>>>>>> Imagine we would run the build on machines with >> >>> plenty of >> >>>>>>> main memory >> >>>>>>>>> to >> >>>>>>>>>>>> mount everything to ramdisk + the latest CPU >> >>> architecture? >> >>>>>>>>>>>> >> >>>>>>>>>>>> Throwing hardware at the problem could help reduce >> >>> the time >> >>>>>>> of an >> >>>>>>>>>>>> individual build, and using our own infrastructure >> >>> would >> >>>>>>> remove our >> >>>>>>>>>>>> dependency on Apache's Travis account (with the >> >>> obvious >> >>>>>>> downside of >> >>>>>>>>>>> having >> >>>>>>>>>>>> to maintain the infrastructure) >> >>>>>>>>>>>> We could use an open source travis alternative, to >> >>> have a >> >>>>>>> similar >> >>>>>>>>>>>> experience and make the migration easy. >> >>>>>>>>>>>> >> >>>>>>>>>>>> >> >>>>>>>>>>>> On Wed, Jun 26, 2019 at 9:34 AM Chesnay Schepler >> >>>>>>> <[hidden email] <mailto:[hidden email]> >> >>> <mailto:[hidden email] <mailto:[hidden email]>>> >> >>>>>>>>>>> wrote: >> >>>>>>>>>>>>>> From what I gathered, there's no special >> >>> sauce that the >> >>>>>>> Zeppelin >> >>>>>>>>>>>>> project uses which actually integrates a users >> >> Travis >> >>>>>>> account into the >> >>>>>>>>>>> PR. >> >>>>>>>>>>>>> They just disabled Travis for PRs. And that's >> >>> kind of it. >> >>>>>>>>>>>>> >> >>>>>>>>>>>>> Naturally we can do this (duh) and safe the ASF a >> >>> fair >> >>>>>>> amount of >> >>>>>>>>>>>>> resources, but there are downsides: >> >>>>>>>>>>>>> >> >>>>>>>>>>>>> The discoverability of the Travis check takes a >> >>> nose-dive. >> >>>>>>> Either we >> >>>>>>>>>>>>> require every contributor to always, an every >> >>> commit, also >> >>>>>>> post a >> >>>>>>>>> Travis >> >>>>>>>>>>>>> build, or we have the reviewer sift through the >> >>>>>>> contributors account >> >>>>>>>>> to >> >>>>>>>>>>>>> find it. >> >>>>>>>>>>>>> >> >>>>>>>>>>>>> This is rather cumbersome. Additionally, it's >> >>> also not >> >>>>>>> equivalent to >> >>>>>>>>>>>>> having a PR build. >> >>>>>>>>>>>>> >> >>>>>>>>>>>>> A normal branch build takes a branch as is and >> >>> tests it. A >> >>>>>>> PR build >> >>>>>>>>>>>>> merges the branch into master, and then runs it. >> >>> (Fun fact: >> >>>>>>> This is >> >>>>>>>>> why >> >>>>>>>>>>>>> a PR without merge conflicts is not being run on >> >>> Travis.) >> >>>>>>>>>>>>> >> >>>>>>>>>>>>> And ultimately, everyone can already make use of >> >> this >> >>>>>>> approach anyway. >> >>>>>>>>>>>>> >> >>>>>>>>>>>>> On 25/06/2019 08:02, Jark Wu wrote: >> >>>>>>>>>>>>>> Hi Jeff, >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> Thanks for sharing the Zeppelin approach. I >> >>> think it's a >> >>>>>>> good idea to >> >>>>>>>>>>>>>> leverage user's travis account. >> >>>>>>>>>>>>>> In this way, we can have almost unlimited >> >>> concurrent build >> >>>>>>> jobs and >> >>>>>>>>>>>>>> developers can restart build by themselves >> >>> (currently only >> >>>>>>> committers >> >>>>>>>>>>>>>> can restart PR's build). >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> But I'm still not very clear how to integrate >> >> user's >> >>>>>>> travis build >> >>>>>>>>> into >> >>>>>>>>>>>>>> the Flink pull request's build automatically. >> >>> Can you >> >>>>>>> explain more in >> >>>>>>>>>>>>>> detail? >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> Another question: does travis only build >> >>> branches for user >> >>>>>>> account? >> >>>>>>>>>>>>>> My concern is that builds for PRs will rebase >> >> user's >> >>>>>>> commits against >> >>>>>>>>>>>>>> current master branch. >> >>>>>>>>>>>>>> This will help us to find problems before >> >>> merge. Builds >> >>>>>>> for branches >> >>>>>>>>>>>>>> will lose the impact of new commits in master. >> >>>>>>>>>>>>>> How does Zeppelin solve this problem? >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> Thanks again for sharing the idea. >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> Regards, >> >>>>>>>>>>>>>> Jark >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> On Tue, 25 Jun 2019 at 11:01, Jeff Zhang >> >>> <[hidden email] <mailto:[hidden email]> >> >>>>>>> <mailto:[hidden email] <mailto:[hidden email]>> >> >>>>>>>>>>>>>> <mailto:[hidden email] >> >>> <mailto:[hidden email]> <mailto:[hidden email] >> >>> <mailto:[hidden email]>>>> wrote: >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> Hi Folks, >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> Zeppelin meet this kind of issue before, we solve >> >>>>>>> it by >> >>>>>>>>> delegating >> >>>>>>>>>>>>>> each >> >>>>>>>>>>>>>> one's PR build to his travis account >> >>> (Everyone can >> >>>>>>> have 5 free >> >>>>>>>>>>>>>> slot for >> >>>>>>>>>>>>>> travis build). >> >>>>>>>>>>>>>> Apache account travis build is only triggered when >> >>>>>>> PR is merged. >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> Kurt Young <[hidden email] >> >>> <mailto:[hidden email]> >> >>>>>>> <mailto:[hidden email] <mailto:[hidden email]>> >> >>> <mailto:[hidden email] <mailto:[hidden email]> >> >>>>>>> <mailto:[hidden email] <mailto:[hidden email]>>>> >> >>>>>>>>>>>>>> 于2019年6月25日周二 上午10:16写道: >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> (Forgot to cc George) >> >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> Best, >> >>>>>>>>>>>>>>> Kurt >> >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> On Tue, Jun 25, 2019 at 10:16 AM Kurt Young >> >>>>>>> <[hidden email] <mailto:[hidden email]> >> >>> <mailto:[hidden email] <mailto:[hidden email]>> >> >>>>>>>>>>>>>> <mailto:[hidden email] >> >>> <mailto:[hidden email]> <mailto:[hidden email] >> >>> <mailto:[hidden email]>>>> >> >>>>>>> wrote: >> >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>> Hi Bowen, >> >>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>> Thanks for bringing this up. We >> >>> actually have >> >>>>>>> discussed >> >>>>>>>>> about >> >>>>>>>>>>>>>> this, and I >> >>>>>>>>>>>>>>>> think Till and George have >> >>>>>>>>>>>>>>>> already spend sometime investigating >> >>> it. I have >> >>>>>>> cced both of >> >>>>>>>>>>>>>> them, and >> >>>>>>>>>>>>>>>> maybe they can share >> >>>>>>>>>>>>>>>> their findings. >> >>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>> Best, >> >>>>>>>>>>>>>>>> Kurt >> >>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>> On Tue, Jun 25, 2019 at 10:08 AM Jark Wu >> >>>>>>> <[hidden email] <mailto:[hidden email]> >> >>> <mailto:[hidden email] <mailto:[hidden email]>> >> >>>>>>>>>>>>>> <mailto:[hidden email] >> >>> <mailto:[hidden email]> <mailto:[hidden email] >> >>> <mailto:[hidden email]>>>> >> >>>>>>> wrote: >> >>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>> Hi Bowen, >> >>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>> Thanks for bringing this. We also >> >>> suffered from >> >>>>>>> the long >> >>>>>>>>>>>>>> build time. >> >>>>>>>>>>>>>>>>> I agree that we should focus on >> >>> solving build >> >>>>>>> capacity >> >>>>>>>>>>>>>> problem in the >> >>>>>>>>>>>>>>>>> thread. >> >>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>> My observation is there is only one >> >>> build is >> >>>>>>> running, all >> >>>>>>>>> the >> >>>>>>>>>>>>>> others >> >>>>>>>>>>>>>>>>> (other >> >>>>>>>>>>>>>>>>> PRs, master) are pending. >> >>>>>>>>>>>>>>>>> The pricing plan[1] of travis shows >> >>> it can >> >>>>>>> support >> >>>>>>>>> concurrent >> >>>>>>>>>>>>>> build >> >>>>>>>>>>>>>>> jobs. >> >>>>>>>>>>>>>>>>> But I don't know which plan we are >> >>> using, might >> >>>>>>> be the free >> >>>>>>>>>>>>>> plan for >> >>>>>>>>>>>>>>> open >> >>>>>>>>>>>>>>>>> source. >> >>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>> I cc-ed Chesnay who may have some >> >>> experience on >> >>>>>>> Travis. >> >>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>> Regards, >> >>>>>>>>>>>>>>>>> Jark >> >>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>> [1]: https://travis-ci.com/plans >> >>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>> On Tue, 25 Jun 2019 at 08:11, Bowen Li < >> >>>>>>>>> [hidden email] <mailto:[hidden email]> >> >>> <mailto:[hidden email] <mailto:[hidden email]>> >> >>>>>>>>>>>>>> <mailto:[hidden email] >> >>> <mailto:[hidden email]> >> >>>>>>> <mailto:[hidden email] >> >>> <mailto:[hidden email]>>>> wrote: >> >>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>> Hi Steven, >> >>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>> I think you may not read what I >> >>> wrote. The >> >>>>>>> discussion is >> >>>>>>>>>>> about >> >>>>>>>>>>>>>>> "unstable >> >>>>>>>>>>>>>>>>>> build **capacity**", in another word >> >>>>>>> "unstable / lack of >> >>>>>>>>>>> build >> >>>>>>>>>>>>>>>>> resources", >> >>>>>>>>>>>>>>>>>> not "unstable build". >> >>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>> On Mon, Jun 24, 2019 at 4:40 PM >> >>> Steven Wu >> >>>>>>>>>>>>>> <[hidden email] >> >>> <mailto:[hidden email]> <mailto:[hidden email] >> >>> <mailto:[hidden email]>> >> >>>>>>> <mailto:[hidden email] >> >>> <mailto:[hidden email]> <mailto:[hidden email] >> >>> <mailto:[hidden email]>>>> >> >>>>>>>>>>>>>>> wrote: >> >>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>> long and sometimes unstable build is >> >>>>>>> definitely a pain >> >>>>>>>>>>>>> point. >> >>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>> I suspect the build failure here in >> >>>>>>>>> flink-connector-kafka >> >>>>>>>>>>>>>> is not >> >>>>>>>>>>>>>>>>> related >> >>>>>>>>>>>>>>>>>> to >> >>>>>>>>>>>>>>>>>>> my change. but there is no easy >> >>> re-run the >> >>>>>>> build on >> >>>>>>>>>>>>>> travis UI. >> >>>>>>>>>>>>>>>>>>> search showed a trick of >> >>> close-and-open the >> >>>>>>> PR will >> >>>>>>>>>>>>>> trigger rebuild. >> >>>>>>>>>>>>>>>>> but >> >>>>>>>>>>>>>>>>>>> that could add noises to the PR >> >>> activities. >> >>>>>>>>>>>>>>>>>>> >> >>>>>>> https://travis-ci.org/apache/flink/jobs/545555519 >> >>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>> travis-ci for my personal repo >> >>> often failed >> >>>>>>> with >> >>>>>>>>>>>>>> exceeding time >> >>>>>>>>>>>>>>> limit >> >>>>>>>>>>>>>>>>>> after >> >>>>>>>>>>>>>>>>>>> 4+ hours. >> >>>>>>>>>>>>>>>>>>> The job exceeded the maximum time >> >>> limit for >> >>>>>>> jobs, and >> >>>>>>>>> has >> >>>>>>>>>>>>>> been >> >>>>>>>>>>>>>>>>>> terminated. >> >>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>> On Mon, Jun 24, 2019 at 4:15 PM >> >>> Bowen Li >> >>>>>>>>>>>>>> <[hidden email] >> >>> <mailto:[hidden email]> <mailto:[hidden email] >> >>> <mailto:[hidden email]>> >> >>>>>>> <mailto:[hidden email] <mailto:[hidden email]> >> >>> <mailto:[hidden email] <mailto:[hidden email]>>>> >> >>>>>>>>>>>>>>> wrote: >> >>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>> >> >>>>>>> https://travis-ci.org/apache/flink/builds/549681530 >> >>>>>>>>>>>>>> This build >> >>>>>>>>>>>>>>>>>> request >> >>>>>>>>>>>>>>>>>>>> has >> >>>>>>>>>>>>>>>>>>>> been sitting at **HEAD of the >> >>> queue** >> >>>>>>> since I first >> >>>>>>>>> saw >> >>>>>>>>>>>>>> it at PST >> >>>>>>>>>>>>>>>>>> 10:30am >> >>>>>>>>>>>>>>>>>>>> (not sure how long it's been >> >>> there before >> >>>>>>> 10:30am). >> >>>>>>>>>>>>>> It's PST >> >>>>>>>>>>>>>>> 4:12pm >> >>>>>>>>>>>>>>>>> now >> >>>>>>>>>>>>>>>>>>> and >> >>>>>>>>>>>>>>>>>>>> it hasn't started yet. >> >>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>> On Mon, Jun 24, 2019 at 2:48 PM >> >>> Bowen Li >> >>>>>>>>>>>>>> <[hidden email] >> >>> <mailto:[hidden email]> <mailto:[hidden email] >> >>> <mailto:[hidden email]>> >> >>>>>>> <mailto:[hidden email] <mailto:[hidden email]> >> >>> <mailto:[hidden email] <mailto:[hidden email]>>>> >> >>>>>>>>>>>>>>>>> wrote: >> >>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>>> Hi devs, >> >>>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>>> I've been experiencing the pain >> >>>>>>> resulting from lack >> >>>>>>>>>>>>>> of stable >> >>>>>>>>>>>>>>>>> build >> >>>>>>>>>>>>>>>>>>>>> capacity on Travis for Flink >> >>> PRs [1]. >> >>>>>>>>> Specifically, I >> >>>>>>>>>>>>>> noticed >> >>>>>>>>>>>>>>>>> often >> >>>>>>>>>>>>>>>>>>> that >> >>>>>>>>>>>>>>>>>>>> no >> >>>>>>>>>>>>>>>>>>>>> build in the queue is making any >> >>>>>>> progress for >> >>>>>>>>> hours, >> >>>>>>>>>>> and >> >>>>>>>>>>>>>>> suddenly >> >>>>>>>>>>>>>>>>> 5 >> >>>>>>>>>>>>>>>>>> or >> >>>>>>>>>>>>>>>>>>> 6 >> >>>>>>>>>>>>>>>>>>>>> builds kick off all together >> >>> after the >> >>>>>>> long pause. >> >>>>>>>>>>>>>> I'm at PST >> >>>>>>>>>>>>>>>>>> (UTC-08) >> >>>>>>>>>>>>>>>>>>>> time >> >>>>>>>>>>>>>>>>>>>>> zone, and I've seen pause can >> >>> be as >> >>>>>>> long as 6 hours >> >>>>>>>>>>>>>> from PST 9am >> >>>>>>>>>>>>>>>>> to >> >>>>>>>>>>>>>>>>>> 3pm >> >>>>>>>>>>>>>>>>>>>>> (let alone the time needed to >> >>> drain the >> >>>>>>> queue >> >>>>>>>>>>>>>> afterwards). >> >>>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>>> I think this has greatly >> >>> impacted our >> >>>>>>> productivity. >> >>>>>>>>>>> I've >> >>>>>>>>>>>>>>>>> experienced >> >>>>>>>>>>>>>>>>>>> that >> >>>>>>>>>>>>>>>>>>>>> PRs submitted in the early >> >>> morning of >> >>>>>>> PST time zone >> >>>>>>>>>>>>>> won't finish >> >>>>>>>>>>>>>>>>>> their >> >>>>>>>>>>>>>>>>>>>>> build until late night of the >> >>> same day. >> >>>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>>> So my questions are: >> >>>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>>> - Has anyone else experienced >> >>> the same >> >>>>>>> problem or >> >>>>>>>>>>>>>> have similar >> >>>>>>>>>>>>>>>>>>>> observation >> >>>>>>>>>>>>>>>>>>>>> on TravisCI? (I suspect it >> >>> has things >> >>>>>>> to do with >> >>>>>>>>> time >> >>>>>>>>>>>>>> zone) >> >>>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>>> - What pricing plan of >> >>> TravisCI is >> >>>>>>> Flink currently >> >>>>>>>>>>>>>> using? Is it >> >>>>>>>>>>>>>>>>> the >> >>>>>>>>>>>>>>>>>>> free >> >>>>>>>>>>>>>>>>>>>>> plan for open source >> >>> projects? What >> >>>>>>> are the >> >>>>>>>>>>>>>> guaranteed build >> >>>>>>>>>>>>>>>>> capacity >> >>>>>>>>>>>>>>>>>>> of >> >>>>>>>>>>>>>>>>>>>>> the current plan? >> >>>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>>> - If the current pricing plan >> >>> (either >> >>>>>>> free or paid) >> >>>>>>>>>>>>> can't >> >>>>>>>>>>>>>>> provide >> >>>>>>>>>>>>>>>>>>> stable >> >>>>>>>>>>>>>>>>>>>>> build capacity, can we >> >>> upgrade to a >> >>>>>>> higher priced >> >>>>>>>>>>>>>> plan with >> >>>>>>>>>>>>>>> larger >> >>>>>>>>>>>>>>>>>> and >> >>>>>>>>>>>>>>>>>>>> more >> >>>>>>>>>>>>>>>>>>>>> stable build capacity? >> >>>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>>> BTW, another factor that >> >>> contribute to >> >>>>>>> the >> >>>>>>>>>>>>>> productivity problem >> >>>>>>>>>>>>>>> is >> >>>>>>>>>>>>>>>>>> that >> >>>>>>>>>>>>>>>>>>>>> our build is slow - we run >> >>> full build >> >>>>>>> for every PR >> >>>>>>>>>>> and a >> >>>>>>>>>>>>>>>>> successful >> >>>>>>>>>>>>>>>>>>> full >> >>>>>>>>>>>>>>>>>>>>> build takes ~5h. We >> >>> definitely have >> >>>>>>> more options to >> >>>>>>>>>>>>>> solve it, >> >>>>>>>>>>>>>>> for >> >>>>>>>>>>>>>>>>>>>> instance, >> >>>>>>>>>>>>>>>>>>>>> modularize the build graphs >> >>> and reuse >> >>>>>>> artifacts >> >>>>>>>>> from >> >>>>>>>>>>> the >> >>>>>>>>>>>>>>> previous >> >>>>>>>>>>>>>>>>>>> build. >> >>>>>>>>>>>>>>>>>>>>> But I think that can be a big >> >>> effort >> >>>>>>> which is much >> >>>>>>>>>>>>>> harder to >> >>>>>>>>>>>>>>>>>> accomplish >> >>>>>>>>>>>>>>>>>>>> in >> >>>>>>>>>>>>>>>>>>>>> a short period of time and >> >>> may deserve >> >>>>>>> its own >> >>>>>>>>>>> separate >> >>>>>>>>>>>>>>>>> discussion. >> >>>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>>> [1] >> >>>>>>>>> https://travis-ci.org/apache/flink/pull_requests >> >>>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> -- >> >>>>>>>>>>>>>> Best Regards >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> Jeff Zhang >> >>>>>>>>>>>>>> >> >>>>>>>>> >> >>>>>>> >> >>>>>> >> >>>>> >> >>> >> >> >> >> >> >> |
+1 and thanks for Chesnay' work on this.
Best, Zhijiang ------------------------------------------------------------------ From:Haibo Sun <[hidden email]> Send Time:2019年7月4日(星期四) 18:21 To:dev <[hidden email]> Cc:[hidden email] <[hidden email]> Subject:Re:Re: [VOTE] Migrate to sponsored Travis account +1. Thank Chesnay for pushing this forward. Best, Haibo At 2019-07-04 17:58:28, "Kurt Young" <[hidden email]> wrote: >+1 and great thanks Chesnay for pushing this. > >Best, >Kurt > > >On Thu, Jul 4, 2019 at 5:44 PM Aljoscha Krettek <[hidden email]> wrote: > >> +1 >> >> Aljoscha >> >> > On 4. Jul 2019, at 11:09, Stephan Ewen <[hidden email]> wrote: >> > >> > +1 to move to a private Travis account. >> > >> > I can confirm that Ververica will sponsor a Travis CI plan that is >> > equivalent or a bit higher than the previous ASF quota (10 concurrent >> build >> > queues) >> > >> > Best, >> > Stephan >> > >> > On Thu, Jul 4, 2019 at 10:46 AM Chesnay Schepler <[hidden email]> >> wrote: >> > >> >> I've raised a JIRA >> >> <https://issues.apache.org/jira/browse/INFRA-18703>with INFRA to >> inquire >> >> whether it would be possible to switch to a different Travis account, >> >> and if so what steps would need to be taken. >> >> We need a proper confirmation from INFRA since we are not in full >> >> control of the flink repository (for example, we cannot access the >> >> settings page). >> >> >> >> If this is indeed possible, Ververica is willing sponsor a Travis >> >> account for the Flink project. >> >> This would provide us with more than enough resources than we need. >> >> >> >> Since this makes the project more reliant on resources provided by >> >> external companies I would like to vote on this. >> >> >> >> Please vote on this proposal, as follows: >> >> [ ] +1, Approve the migration to a Ververica-sponsored Travis account, >> >> provided that INFRA approves >> >> [ ] -1, Do not approach the migration to a Ververica-sponsored Travis >> >> account >> >> >> >> The vote will be open for at least 24h, and until we have confirmation >> >> from INFRA. The voting period may be shorter than the usual 3 days since >> >> our current is effectively not working. >> >> >> >> On 04/07/2019 06:51, Bowen Li wrote: >> >>> Re: > Are they using their own Travis CI pool, or did the switch to an >> >>> entirely different CI service? >> >>> >> >>> I reached out to Wes and Krisztián from Apache Arrow PMC. They are >> >>> currently moving away from ASF's Travis to their own in-house metal >> >>> machines at [1] with custom CI application at [2]. They've seen >> >>> significant improvement w.r.t both much higher performance and >> >>> basically no resource waiting time, "night-and-day" difference quoting >> >>> Wes. >> >>> >> >>> Re: > If we can just switch to our own Travis pool, just for our >> >>> project, then this might be something we can do fairly quickly? >> >>> >> >>> I believe so, according to [3] and [4] >> >>> >> >>> >> >>> [1] https://ci.ursalabs.org/ <https://ci.ursalabs.org/#/> >> >>> [2] https://github.com/ursa-labs/ursabot >> >>> [3] >> >>> >> https://docs.travis-ci.com/user/migrate/open-source-repository-migration >> >>> [4] >> https://docs.travis-ci.com/user/migrate/open-source-on-travis-ci-com >> >>> >> >>> >> >>> >> >>> On Wed, Jul 3, 2019 at 12:01 AM Chesnay Schepler <[hidden email] >> >>> <mailto:[hidden email]>> wrote: >> >>> >> >>> Are they using their own Travis CI pool, or did the switch to an >> >>> entirely different CI service? >> >>> >> >>> If we can just switch to our own Travis pool, just for our >> >>> project, then >> >>> this might be something we can do fairly quickly? >> >>> >> >>> On 03/07/2019 05:55, Bowen Li wrote: >> >>>> I responded in the INFRA ticket [1] that I believe they are >> >>> using a wrong >> >>>> metric against Flink and the total build time is a completely >> >>> different >> >>>> thing than guaranteed build capacity. >> >>>> >> >>>> My response: >> >>>> >> >>>> "As mentioned above, since I started to pay attention to Flink's >> >>> build >> >>>> queue a few tens of days ago, I'm in Seattle and I saw no build >> >>> was kicking >> >>>> off in PST daytime in weekdays for Flink. Our teammates in China >> >>> and Europe >> >>>> have also reported similar observations. So we need to evaluate >> >>> how the >> >>>> large total build time came from - if 1) your number and 2) our >> >>>> observations from three locations that cover pretty much a full >> >>> day, are >> >>>> all true, I **guess** one reason can be that - highly likely the >> >>> extra >> >>>> build time came from weekends when other Apache projects may be >> >>> idle and >> >>>> Flink just drains hard its congested queue. >> >>>> >> >>>> Please be aware of that we're not complaining about the lack of >> >>> resources >> >>>> in general, I'm complaining about the lack of **stable, dedicated** >> >>>> resources. An example for the latter one is, currently even if >> >>> no build is >> >>>> in Flink's queue and I submit a request to be the queue head in PST >> >>>> morning, my build won't even start in 6-8+h. That is an absurd >> >>> amount of >> >>>> waiting time. >> >>>> >> >>>> That's saying, if ASF INFRA decides to adopt a quota system and >> >>> grants >> >>>> Flink five DEDICATED servers that runs all the time only for >> >>> Flink, that'll >> >>>> be PERFECT and can totally solve our problem now. >> >>>> >> >>>> Please be aware of that we're not complaining about the lack of >> >>> resources >> >>>> in general, I'm complaining about the lack of **stable, dedicated** >> >>>> resources. An example for the latter one is, currently even if >> >>> no build is >> >>>> in Flink's queue and I submit a request to be the queue head in PST >> >>>> morning, my build won't even start in 6-8+h. That is an absurd >> >>> amount of >> >>>> waiting time. >> >>>> >> >>>> >> >>>> That's saying, if ASF INFRA decides to adopt a quota system and >> >>> grants >> >>>> Flink five DEDICATED servers that runs all the time only for >> >>> Flink, that'll >> >>>> be PERFECT and can totally solve our problem now. >> >>>> >> >>>> I feel what's missing in the ASF INFRA's Travis resource pool is >> >>> some level >> >>>> of build capacity SLAs and certainty" >> >>>> >> >>>> >> >>>> Again, I believe there are differences in nature of these two >> >>> problems, >> >>>> long build time v.s. lack of dedicated build resource. That's >> >>> saying, >> >>>> shortening build time may relieve the situation, and may not. >> >>> I'm sightly >> >>>> negative on disabling IT cases for PRs, due to the downside is >> >>> that we are >> >>>> at risk of any potential bugs in PR that UTs doesn't catch, and >> >>> may cost a >> >>>> lot more to fix and if it slows others down or even block >> >>> others, but am >> >>>> open to others opinions on it. >> >>>> >> >>>> AFAICT from INFRA ticket[1], donating to ASF INFRA won't be >> >>> feasible to >> >>>> solve our problem since INFRA's pool is fully shared and they >> >>> have no >> >>>> control and finer insights over resource allocation to a >> >>> specific Apache >> >>>> project. As mentioned in [1], Apache Arrow is moving away from >> >>> ASF INFRA >> >>>> Travis pool (they are actually surprised Flink hasn't plan to do >> >>> so). I >> >>>> know that Spark is on its own build infra. If we all agree that >> >>> funding our >> >>>> own build infra, I'd be glad to help investigate any potential >> >>> options >> >>>> after releasing 1.9 since I'm super busy with 1.9 now. >> >>>> >> >>>> [1] https://issues.apache.org/jira/browse/INFRA-18533 >> >>>> >> >>>> >> >>>> >> >>>> On Tue, Jul 2, 2019 at 4:46 AM Chesnay Schepler >> >>> <[hidden email] <mailto:[hidden email]>> wrote: >> >>>> >> >>>>> As a short-term stopgap, since we can assume this issue to >> >>> become much >> >>>>> worse in the following days/weeks, we could disable IT cases in >> >>> PRs and >> >>>>> only run them on master. >> >>>>> >> >>>>> On 02/07/2019 12:03, Chesnay Schepler wrote: >> >>>>>> People really have to stop thinking that just because >> >>> something works >> >>>>>> for us it is also a good solution. >> >>>>>> Also, please remember that our builds run for 2h from start to >> >>> finish, >> >>>>>> and not the 14 _minutes_ it takes for zeppelin. >> >>>>>> We are dealing with an entirely different scale here, both in >> >>> terms of >> >>>>>> build times and number of builds. >> >>>>>> >> >>>>>> In this very thread people have been complaining about long queue >> >>>>>> times for their builds. Surprise, other Apache projects have been >> >>>>>> suffering the very same thing due to us not controlling our build >> >>>>>> times. While switching services (be it Jenkins, CircleCI or >> >>> whatever) >> >>>>>> will possibly work for us (and these options are actually >> >>> attractive, >> >>>>>> like CircleCI's proper support for build artifacts), it will also >> >>>>>> result in us likely negatively affecting other projects in >> >>> significant >> >>>>>> ways. >> >>>>>> >> >>>>>> Sure, the Jenkins setup has a good user experience for us, at >> >>> the cost >> >>>>>> of blocking Jenkins workers for a _lot_ of time. Right now we >> >>> have 25 >> >>>>>> PR's in our queue; that's possibly 50h we'd consume of Jenkins >> >>>>>> resources, and the European contributors haven't even really >> >>> started yet. >> >>>>>> >> >>>>>> FYI, the latest INFRA response from INFRA-18533: >> >>>>>> >> >>>>>> "Our rough metrics shows that Flink used over 5800 hours of >> >>> build time >> >>>>>> last month. That is equal to EIGHT servers running 24/7 for >> >>> the ENTIRE >> >>>>>> MONTH. EIGHT. nonstop. >> >>>>>> When we discovered this last night, we discussed it some and >> >>> are going >> >>>>>> to tune down Flink to allow only five executors maximum. We >> >> cannot >> >>>>>> allow Flink to consume so much of a Foundation shared resource." >> >>>>>> >> >>>>>> So yes, we either >> >>>>>> a) have to heavily reduce our CI usage or >> >>>>>> b) fund our own, either maintaining it ourselves or donating >> >>> to Apache. >> >>>>>> >> >>>>>> On 02/07/2019 05:11, Bowen Li wrote: >> >>>>>>> By looking at the git history of the Jenkins script, its core >> >>> part >> >>>>>>> was finished in March 2017 (and only two minor update in >> >>> 2017/2018), >> >>>>>>> so it's been running for over two years now and feels like >> >>> Zepplin >> >>>>>>> community has been quite happy with it. @Jeff Zhang >> >>>>>>> <mailto:[hidden email] <mailto:[hidden email]>> can you >> >>> share your insights and user >> >>>>>>> experience with the Jenkins+Travis approach? >> >>>>>>> >> >>>>>>> Things like: >> >>>>>>> >> >>>>>>> - has the approach completely solved the resource capacity >> >>> problem >> >>>>>>> for Zepplin community? is Zepplin community happy with the >> >>> result? >> >>>>>>> - is the whole configuration chain stable (e.g. uptime) enough? >> >>>>>>> - how often do you need to maintain the Jenkins infra? how many >> >>>>>>> people are usually involved in maintenance and bug-fixes? >> >>>>>>> >> >>>>>>> The downside of this approach seems mostly to be on the >> >>> maintenance >> >>>>>>> to me - maintain the script and Jenkins infra. >> >>>>>>> >> >>>>>>> ** Having Our Own Travis-CI.com Account ** >> >>>>>>> >> >>>>>>> Another alternative I've been thinking of is to have our own >> >>>>>>> travis-ci.com <http://travis-ci.com> <http://travis-ci.com> >> >>> account with paid dedicated >> >>>>>>> resources. Note travis-ci.org <http://travis-ci.org> >> >>> <http://travis-ci.org> is the free >> >>>>>>> version and travis-ci.com <http://travis-ci.com> >> >>> <http://travis-ci.com> is the commercial >> >>>>>>> version. We currently use a shared resource pool managed by >> >>> ASK INFRA >> >>>>>>> team on travis-ci.org <http://travis-ci.org> >> >>> <http://travis-ci.org>, but we have no control >> >>>>>>> over it - we can't see how it's configured, how much >> >>> resources are >> >>>>>>> available, how resources are allocated among Apache projects, >> >>> etc. >> >>>>>>> The nice thing about having an account on travis-ci.com >> >>> <http://travis-ci.com> >> >>>>>>> <http://travis-ci.com> are: >> >>>>>>> >> >>>>>>> - relatively low cost with much better resource guarantee >> >>> than what >> >>>>>>> we currently have [1]: $249/month with 5 dedicated concurrency, >> >>>>>>> $489/month with 10 concurrency >> >>>>>>> - low maintenance work compared to using Jenkins >> >>>>>>> - (potentially) no migration cost according to Travis's doc [2] >> >>>>>>> (pending verification) >> >>>>>>> - full control over the build capacity/configuration compared to >> >>>>>>> using ASF INFRA's pool >> >>>>>>> >> >>>>>>> I'd be surprised if we as such a vibrant community cannot >> >>> find and >> >>>>>>> fund $249*12=$2988 a year in exchange for a much better >> >> developer >> >>>>>>> experience and much higher productivity. >> >>>>>>> >> >>>>>>> [1] https://travis-ci.com/plans >> >>>>>>> [2] >> >>>>>>> >> >>>>> >> >>> >> >> >> https://docs.travis-ci.com/user/migrate/open-source-repository-migration >> >>>>>>> On Sat, Jun 29, 2019 at 8:39 AM Chesnay Schepler >> >>> <[hidden email] <mailto:[hidden email]> >> >>>>>>> <mailto:[hidden email] <mailto:[hidden email]>>> wrote: >> >>>>>>> >> >>>>>>> So yes, the Jenkins job keeps pulling the state from >> >>> Travis until it >> >>>>>>> finishes. >> >>>>>>> >> >>>>>>> Note sure I'm comfortable with the idea of using Jenkins >> >>> workers >> >>>>>>> just to >> >>>>>>> idle for a several hours. >> >>>>>>> >> >>>>>>> On 29/06/2019 14:56, Jeff Zhang wrote: >> >>>>>>>> Here's what zeppelin community did, we make a python >> >>> script to >> >>>>>>> check the >> >>>>>>>> build status of pull request. >> >>>>>>>> Here's script: >> >>>>>>>> >> >>> https://github.com/apache/zeppelin/blob/master/travis_check.py >> >>>>>>>> >> >>>>>>>> And this is the script we used in Jenkins build job. >> >>>>>>>> >> >>>>>>>> if [ -f "travis_check.py" ]; then >> >>>>>>>> git log -n 1 >> >>>>>>>> STATUS=$(curl -s $BUILD_URL | grep -e "GitHub pull >> >>>>>>> request.*from.*" | sed >> >>>>>>>> 's/.*GitHub pull request <a >> >>>>>>>> href=\"\(https[^"]*\).*from[^"]*.\(https[^"]*\).*/\1 >> >>> \2/g') >> >>>>>>>> AUTHOR=$(echo $STATUS | sed 's/.*[/]\(.*\)$/\1/g') >> >>>>>>>> PR=$(echo $STATUS | awk '{print $1}' | sed >> >>>>>>> 's/.*[/]\(.*\)$/\1/g') >> >>>>>>>> #COMMIT=$(git log -n 1 | grep "^Merge:" | awk >> >>> '{print $3}') >> >>>>>>>> #if [ -z $COMMIT ]; then >> >>>>>>>> # COMMIT=$(curl -s >> >>>>>>> https://api.github.com/repos/apache/zeppelin/pulls/$PR >> >>>>>>>> | grep -e "\"label\":" -e "\"ref\":" -e "\"sha\":" | >> >>> tr '\n' ' ' >> >>>>>>> | sed >> >>>>>>>> 's/\(.*sha[^,]*,\)\(.*ref.*\)/\1 = \2/g' | tr = '\n' | >> >>> grep -v >> >>>>>>> "apache:" | >> >>>>>>>> sed 's/.*sha.[^"]*["]\([^"]*\).*/\1/g') >> >>>>>>>> #fi >> >>>>>>>> >> >>>>>>>> # get commit hash from PR >> >>>>>>>> COMMIT=$(curl -s >> >>>>>>> https://api.github.com/repos/apache/zeppelin/pulls/$PR | >> >>>>>>>> grep -e "\"label\":" -e "\"ref\":" -e "\"sha\":" | tr >> >>> '\n' ' ' >> >>>>>>> | sed >> >>>>>>>> 's/\(.*sha[^,]*,\)\(.*ref.*\)/\1 = \2/g' | tr = '\n' | >> >>> grep -v >> >>>>>>> "apache:" | >> >>>>>>>> sed 's/.*sha.[^"]*["]\([^"]*\).*/\1/g') >> >>>>>>>> sleep 30 # sleep few moment to wait travis starts >> >>> the build >> >>>>>>>> RET_CODE=0 >> >>>>>>>> python ./travis_check.py ${AUTHOR} ${COMMIT} || >> >>> RET_CODE=$? >> >>>>>>>> if [ $RET_CODE -eq 2 ]; then # try with repository >> >>> name when >> >>>>>>> travis-ci is >> >>>>>>>> not available in the account >> >>>>>>>> RET_CODE=0 >> >>>>>>>> AUTHOR=$(curl -s >> >>>>>>> https://api.github.com/repos/apache/zeppelin/pulls/$PR >> >>>>>>>> | grep '"full_name":' | grep -v "apache/zeppelin" | sed >> >>>>>>>> 's/.*[:][^"]*["]\([^/]*\).*/\1/g') >> >>>>>>>> python ./travis_check.py ${AUTHOR} ${COMMIT} || >> >>> RET_CODE=$? >> >>>>>>>> fi >> >>>>>>>> >> >>>>>>>> if [ $RET_CODE -eq 2 ]; then # fail with can't find >> >>> build >> >>>>>>> information in >> >>>>>>>> the travis >> >>>>>>>> set +x >> >>>>>>>> echo >> >>> "-----------------------------------------------------" >> >>>>>>>> echo "Looks like travis-ci is not configured for >> >>> your fork." >> >>>>>>>> echo "Please setup by swich on 'zeppelin' >> >>> repository at >> >>>>>>>> https://travis-ci.org/profile and travis-ci." >> >>>>>>>> echo "And then make sure 'Build branch updates' >> >>> option is >> >>>>>>> enabled in >> >>>>>>>> the settings >> >>> https://travis-ci.org/${AUTHOR}/zeppelin/settings >> >>> <https://travis-ci.org/$%7BAUTHOR%7D/zeppelin/settings> >> >>>>>>> <https://travis-ci.org/$%7BAUTHOR%7D/zeppelin/settings>." >> >>>>>>>> echo "" >> >>>>>>>> echo "To trigger CI after setup, you will need >> >>> ammend your >> >>>>>>> last commit >> >>>>>>>> with" >> >>>>>>>> echo "git commit --amend" >> >>>>>>>> echo "git push your-remote HEAD --force" >> >>>>>>>> echo "" >> >>>>>>>> echo "See >> >>>>>>>> >> >>>>>>> >> >>>>> >> >>> >> >> >> http://zeppelin.apache.org/contribution/contributions.html#continuous-integration >> >>>>>>>> ." >> >>>>>>>> fi >> >>>>>>>> >> >>>>>>>> exit $RET_CODE >> >>>>>>>> else >> >>>>>>>> set +x >> >>>>>>>> echo "travis_check.py does not exists" >> >>>>>>>> exit 1 >> >>>>>>>> fi >> >>>>>>>> >> >>>>>>>> Chesnay Schepler <[hidden email] >> >>> <mailto:[hidden email]> >> >>>>>>> <mailto:[hidden email] <mailto:[hidden email]>>> >> >>> 于2019年6月29日周六 下午3:17写道: >> >>>>>>>> >> >>>>>>>>> Does this imply that a Jenkins job is active as long >> >>> as the >> >>>>>>> Travis build >> >>>>>>>>> runs? >> >>>>>>>>> >> >>>>>>>>> On 26/06/2019 21:28, Bowen Li wrote: >> >>>>>>>>>> Hi, >> >>>>>>>>>> >> >>>>>>>>>> @Dawid, I think the "long test running" as I >> >>> mentioned in the >> >>>>>>> first >> >>>>>>>>> email, >> >>>>>>>>>> also as you guys said, belongs to "a big effort >> >>> which is much >> >>>>>>> harder to >> >>>>>>>>>> accomplish in a short period of time and may deserve >> >>> its own >> >>>>>>> separate >> >>>>>>>>>> discussion". Thus I didn't include it in what we can >> >>> do in a >> >>>>>>> foreseeable >> >>>>>>>>>> short term. >> >>>>>>>>>> >> >>>>>>>>>> Besides, I don't think that's the ultimate reason >> >>> for lack of >> >>>>>>> build >> >>>>>>>>>> resources. Even if the build is shortened to >> >>> something like >> >>>>>>> 2h, the >> >>>>>>>>>> problems of no build machine works about 6 or more >> >>> hours in >> >>>>>>> PST daytime >> >>>>>>>>>> that I described will still happen, because no >> >>> machine from >> >>>>>>> ASF INFRA's >> >>>>>>>>>> pool is allocated to Flink. As I have paid close >> >>> attention to >> >>>>>>> the build >> >>>>>>>>>> queue in the past few weekdays, it's a pretty clear >> >>> pattern now. >> >>>>>>>>>> >> >>>>>>>>>> **The ultimate root cause** for that is - we don't >> >>> have any >> >>>>>>> **dedicated** >> >>>>>>>>>> build resources that we can stably rely on. I'm >> >>> actually ok to >> >>>>>>> wait for a >> >>>>>>>>>> long time if there are build requests running, it >> >>> means at >> >>>>>>> least we are >> >>>>>>>>>> making progress. But I'm not ok with no build >> >>> resource. A >> >>>>>>> better place I >> >>>>>>>>>> think we should aim at in short term is to always >> >>> have at >> >>>>>>> least a central >> >>>>>>>>>> pool (can be 3 or 5) of machines dedicated to build >> >>> Flink at >> >>>>>>> any time, or >> >>>>>>>>>> maybe use users resources. >> >>>>>>>>>> >> >>>>>>>>>> @Chesnay @Robert I synced with Jeff offline that >> >>> Zeppelin >> >>>>>>> community is >> >>>>>>>>>> using a Jenkins job to automatically build on users' >> >>> travis >> >>>>>>> account and >> >>>>>>>>>> link the result back to github PR. I guess the >> >>> Jenkins job >> >>>>>>> would fetch >> >>>>>>>>>> latest upstream master and build the PR against it. >> >>> Jeff has >> >>>>>>> filed >> >>>>>>>>> tickets >> >>>>>>>>>> to learn and get access to the Jenkins infra. It'll >> >>> better to >> >>>>>>> fully >> >>>>>>>>>> understand it first before judging this approach. >> >>>>>>>>>> >> >>>>>>>>>> I also heard good things about CircleCI, and ASF >> >>> INFRA seems >> >>>>>>> to have a >> >>>>>>>>> pool >> >>>>>>>>>> of build capacity there too. Can be an alternative >> >>> to consider. >> >>>>>>>>>> >> >>>>>>>>>> >> >>>>>>>>>> >> >>>>>>>>>> >> >>>>>>>>>> >> >>>>>>>>>> >> >>>>>>>>>> >> >>>>>>>>>> >> >>>>>>>>>> >> >>>>>>>>>> On Wed, Jun 26, 2019 at 12:44 AM Dawid Wysakowicz < >> >>>>>>>>> [hidden email] >> >>> <mailto:[hidden email]> <mailto:[hidden email] >> >>> <mailto:[hidden email]>>> >> >>>>>>>>>> wrote: >> >>>>>>>>>> >> >>>>>>>>>>> Sorry to jump in late, but I think Bowen missed the >> >>> most >> >>>>>>> important point >> >>>>>>>>>>> from Chesnay's previous message in the summary. The >> >>> ultimate >> >>>>>>> reason for >> >>>>>>>>>>> all the problems is that the tests take close to 2 >> >>> hours to >> >>>>>>> run already. >> >>>>>>>>>>> I fully support this claim: "Unless people start >> >>> caring about >> >>>>>>> test times >> >>>>>>>>>>> before adding them, this issue cannot be solved" >> >>>>>>>>>>> >> >>>>>>>>>>> This is also another reason why using user's Travis >> >>> account >> >>>>>>> won't help. >> >>>>>>>>>>> Every few weeks we reach the user's time limit for >> >>> a single >> >>>>>>> profile. >> >>>>>>>>>>> This makes the user's builds simply fail, until we >> >>> either >> >>>>>>> properly >> >>>>>>>>>>> decrease the time the tests take (which I am not >> >>> sure we ever >> >>>>>>> did) or >> >>>>>>>>>>> postpone the problem by splitting into more >> >>> profiles. (Note >> >>>>>>> that the ASF >> >>>>>>>>>>> Travis account has higher time limits) >> >>>>>>>>>>> >> >>>>>>>>>>> Best, >> >>>>>>>>>>> >> >>>>>>>>>>> Dawid >> >>>>>>>>>>> >> >>>>>>>>>>> On 26/06/2019 09:36, Robert Metzger wrote: >> >>>>>>>>>>>> Do we know if using "the best" available hardware >> >>> would >> >>>>>>> improve the >> >>>>>>>>> build >> >>>>>>>>>>>> times? >> >>>>>>>>>>>> Imagine we would run the build on machines with >> >>> plenty of >> >>>>>>> main memory >> >>>>>>>>> to >> >>>>>>>>>>>> mount everything to ramdisk + the latest CPU >> >>> architecture? >> >>>>>>>>>>>> >> >>>>>>>>>>>> Throwing hardware at the problem could help reduce >> >>> the time >> >>>>>>> of an >> >>>>>>>>>>>> individual build, and using our own infrastructure >> >>> would >> >>>>>>> remove our >> >>>>>>>>>>>> dependency on Apache's Travis account (with the >> >>> obvious >> >>>>>>> downside of >> >>>>>>>>>>> having >> >>>>>>>>>>>> to maintain the infrastructure) >> >>>>>>>>>>>> We could use an open source travis alternative, to >> >>> have a >> >>>>>>> similar >> >>>>>>>>>>>> experience and make the migration easy. >> >>>>>>>>>>>> >> >>>>>>>>>>>> >> >>>>>>>>>>>> On Wed, Jun 26, 2019 at 9:34 AM Chesnay Schepler >> >>>>>>> <[hidden email] <mailto:[hidden email]> >> >>> <mailto:[hidden email] <mailto:[hidden email]>>> >> >>>>>>>>>>> wrote: >> >>>>>>>>>>>>>> From what I gathered, there's no special >> >>> sauce that the >> >>>>>>> Zeppelin >> >>>>>>>>>>>>> project uses which actually integrates a users >> >> Travis >> >>>>>>> account into the >> >>>>>>>>>>> PR. >> >>>>>>>>>>>>> They just disabled Travis for PRs. And that's >> >>> kind of it. >> >>>>>>>>>>>>> >> >>>>>>>>>>>>> Naturally we can do this (duh) and safe the ASF a >> >>> fair >> >>>>>>> amount of >> >>>>>>>>>>>>> resources, but there are downsides: >> >>>>>>>>>>>>> >> >>>>>>>>>>>>> The discoverability of the Travis check takes a >> >>> nose-dive. >> >>>>>>> Either we >> >>>>>>>>>>>>> require every contributor to always, an every >> >>> commit, also >> >>>>>>> post a >> >>>>>>>>> Travis >> >>>>>>>>>>>>> build, or we have the reviewer sift through the >> >>>>>>> contributors account >> >>>>>>>>> to >> >>>>>>>>>>>>> find it. >> >>>>>>>>>>>>> >> >>>>>>>>>>>>> This is rather cumbersome. Additionally, it's >> >>> also not >> >>>>>>> equivalent to >> >>>>>>>>>>>>> having a PR build. >> >>>>>>>>>>>>> >> >>>>>>>>>>>>> A normal branch build takes a branch as is and >> >>> tests it. A >> >>>>>>> PR build >> >>>>>>>>>>>>> merges the branch into master, and then runs it. >> >>> (Fun fact: >> >>>>>>> This is >> >>>>>>>>> why >> >>>>>>>>>>>>> a PR without merge conflicts is not being run on >> >>> Travis.) >> >>>>>>>>>>>>> >> >>>>>>>>>>>>> And ultimately, everyone can already make use of >> >> this >> >>>>>>> approach anyway. >> >>>>>>>>>>>>> >> >>>>>>>>>>>>> On 25/06/2019 08:02, Jark Wu wrote: >> >>>>>>>>>>>>>> Hi Jeff, >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> Thanks for sharing the Zeppelin approach. I >> >>> think it's a >> >>>>>>> good idea to >> >>>>>>>>>>>>>> leverage user's travis account. >> >>>>>>>>>>>>>> In this way, we can have almost unlimited >> >>> concurrent build >> >>>>>>> jobs and >> >>>>>>>>>>>>>> developers can restart build by themselves >> >>> (currently only >> >>>>>>> committers >> >>>>>>>>>>>>>> can restart PR's build). >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> But I'm still not very clear how to integrate >> >> user's >> >>>>>>> travis build >> >>>>>>>>> into >> >>>>>>>>>>>>>> the Flink pull request's build automatically. >> >>> Can you >> >>>>>>> explain more in >> >>>>>>>>>>>>>> detail? >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> Another question: does travis only build >> >>> branches for user >> >>>>>>> account? >> >>>>>>>>>>>>>> My concern is that builds for PRs will rebase >> >> user's >> >>>>>>> commits against >> >>>>>>>>>>>>>> current master branch. >> >>>>>>>>>>>>>> This will help us to find problems before >> >>> merge. Builds >> >>>>>>> for branches >> >>>>>>>>>>>>>> will lose the impact of new commits in master. >> >>>>>>>>>>>>>> How does Zeppelin solve this problem? >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> Thanks again for sharing the idea. >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> Regards, >> >>>>>>>>>>>>>> Jark >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> On Tue, 25 Jun 2019 at 11:01, Jeff Zhang >> >>> <[hidden email] <mailto:[hidden email]> >> >>>>>>> <mailto:[hidden email] <mailto:[hidden email]>> >> >>>>>>>>>>>>>> <mailto:[hidden email] >> >>> <mailto:[hidden email]> <mailto:[hidden email] >> >>> <mailto:[hidden email]>>>> wrote: >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> Hi Folks, >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> Zeppelin meet this kind of issue before, we solve >> >>>>>>> it by >> >>>>>>>>> delegating >> >>>>>>>>>>>>>> each >> >>>>>>>>>>>>>> one's PR build to his travis account >> >>> (Everyone can >> >>>>>>> have 5 free >> >>>>>>>>>>>>>> slot for >> >>>>>>>>>>>>>> travis build). >> >>>>>>>>>>>>>> Apache account travis build is only triggered when >> >>>>>>> PR is merged. >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> Kurt Young <[hidden email] >> >>> <mailto:[hidden email]> >> >>>>>>> <mailto:[hidden email] <mailto:[hidden email]>> >> >>> <mailto:[hidden email] <mailto:[hidden email]> >> >>>>>>> <mailto:[hidden email] <mailto:[hidden email]>>>> >> >>>>>>>>>>>>>> 于2019年6月25日周二 上午10:16写道: >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> (Forgot to cc George) >> >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> Best, >> >>>>>>>>>>>>>>> Kurt >> >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> On Tue, Jun 25, 2019 at 10:16 AM Kurt Young >> >>>>>>> <[hidden email] <mailto:[hidden email]> >> >>> <mailto:[hidden email] <mailto:[hidden email]>> >> >>>>>>>>>>>>>> <mailto:[hidden email] >> >>> <mailto:[hidden email]> <mailto:[hidden email] >> >>> <mailto:[hidden email]>>>> >> >>>>>>> wrote: >> >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>> Hi Bowen, >> >>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>> Thanks for bringing this up. We >> >>> actually have >> >>>>>>> discussed >> >>>>>>>>> about >> >>>>>>>>>>>>>> this, and I >> >>>>>>>>>>>>>>>> think Till and George have >> >>>>>>>>>>>>>>>> already spend sometime investigating >> >>> it. I have >> >>>>>>> cced both of >> >>>>>>>>>>>>>> them, and >> >>>>>>>>>>>>>>>> maybe they can share >> >>>>>>>>>>>>>>>> their findings. >> >>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>> Best, >> >>>>>>>>>>>>>>>> Kurt >> >>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>> On Tue, Jun 25, 2019 at 10:08 AM Jark Wu >> >>>>>>> <[hidden email] <mailto:[hidden email]> >> >>> <mailto:[hidden email] <mailto:[hidden email]>> >> >>>>>>>>>>>>>> <mailto:[hidden email] >> >>> <mailto:[hidden email]> <mailto:[hidden email] >> >>> <mailto:[hidden email]>>>> >> >>>>>>> wrote: >> >>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>> Hi Bowen, >> >>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>> Thanks for bringing this. We also >> >>> suffered from >> >>>>>>> the long >> >>>>>>>>>>>>>> build time. >> >>>>>>>>>>>>>>>>> I agree that we should focus on >> >>> solving build >> >>>>>>> capacity >> >>>>>>>>>>>>>> problem in the >> >>>>>>>>>>>>>>>>> thread. >> >>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>> My observation is there is only one >> >>> build is >> >>>>>>> running, all >> >>>>>>>>> the >> >>>>>>>>>>>>>> others >> >>>>>>>>>>>>>>>>> (other >> >>>>>>>>>>>>>>>>> PRs, master) are pending. >> >>>>>>>>>>>>>>>>> The pricing plan[1] of travis shows >> >>> it can >> >>>>>>> support >> >>>>>>>>> concurrent >> >>>>>>>>>>>>>> build >> >>>>>>>>>>>>>>> jobs. >> >>>>>>>>>>>>>>>>> But I don't know which plan we are >> >>> using, might >> >>>>>>> be the free >> >>>>>>>>>>>>>> plan for >> >>>>>>>>>>>>>>> open >> >>>>>>>>>>>>>>>>> source. >> >>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>> I cc-ed Chesnay who may have some >> >>> experience on >> >>>>>>> Travis. >> >>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>> Regards, >> >>>>>>>>>>>>>>>>> Jark >> >>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>> [1]: https://travis-ci.com/plans >> >>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>> On Tue, 25 Jun 2019 at 08:11, Bowen Li < >> >>>>>>>>> [hidden email] <mailto:[hidden email]> >> >>> <mailto:[hidden email] <mailto:[hidden email]>> >> >>>>>>>>>>>>>> <mailto:[hidden email] >> >>> <mailto:[hidden email]> >> >>>>>>> <mailto:[hidden email] >> >>> <mailto:[hidden email]>>>> wrote: >> >>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>> Hi Steven, >> >>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>> I think you may not read what I >> >>> wrote. The >> >>>>>>> discussion is >> >>>>>>>>>>> about >> >>>>>>>>>>>>>>> "unstable >> >>>>>>>>>>>>>>>>>> build **capacity**", in another word >> >>>>>>> "unstable / lack of >> >>>>>>>>>>> build >> >>>>>>>>>>>>>>>>> resources", >> >>>>>>>>>>>>>>>>>> not "unstable build". >> >>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>> On Mon, Jun 24, 2019 at 4:40 PM >> >>> Steven Wu >> >>>>>>>>>>>>>> <[hidden email] >> >>> <mailto:[hidden email]> <mailto:[hidden email] >> >>> <mailto:[hidden email]>> >> >>>>>>> <mailto:[hidden email] >> >>> <mailto:[hidden email]> <mailto:[hidden email] >> >>> <mailto:[hidden email]>>>> >> >>>>>>>>>>>>>>> wrote: >> >>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>> long and sometimes unstable build is >> >>>>>>> definitely a pain >> >>>>>>>>>>>>> point. >> >>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>> I suspect the build failure here in >> >>>>>>>>> flink-connector-kafka >> >>>>>>>>>>>>>> is not >> >>>>>>>>>>>>>>>>> related >> >>>>>>>>>>>>>>>>>> to >> >>>>>>>>>>>>>>>>>>> my change. but there is no easy >> >>> re-run the >> >>>>>>> build on >> >>>>>>>>>>>>>> travis UI. >> >>>>>>>>>>>>>>>>>>> search showed a trick of >> >>> close-and-open the >> >>>>>>> PR will >> >>>>>>>>>>>>>> trigger rebuild. >> >>>>>>>>>>>>>>>>> but >> >>>>>>>>>>>>>>>>>>> that could add noises to the PR >> >>> activities. >> >>>>>>>>>>>>>>>>>>> >> >>>>>>> https://travis-ci.org/apache/flink/jobs/545555519 >> >>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>> travis-ci for my personal repo >> >>> often failed >> >>>>>>> with >> >>>>>>>>>>>>>> exceeding time >> >>>>>>>>>>>>>>> limit >> >>>>>>>>>>>>>>>>>> after >> >>>>>>>>>>>>>>>>>>> 4+ hours. >> >>>>>>>>>>>>>>>>>>> The job exceeded the maximum time >> >>> limit for >> >>>>>>> jobs, and >> >>>>>>>>> has >> >>>>>>>>>>>>>> been >> >>>>>>>>>>>>>>>>>> terminated. >> >>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>> On Mon, Jun 24, 2019 at 4:15 PM >> >>> Bowen Li >> >>>>>>>>>>>>>> <[hidden email] >> >>> <mailto:[hidden email]> <mailto:[hidden email] >> >>> <mailto:[hidden email]>> >> >>>>>>> <mailto:[hidden email] <mailto:[hidden email]> >> >>> <mailto:[hidden email] <mailto:[hidden email]>>>> >> >>>>>>>>>>>>>>> wrote: >> >>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>> >> >>>>>>> https://travis-ci.org/apache/flink/builds/549681530 >> >>>>>>>>>>>>>> This build >> >>>>>>>>>>>>>>>>>> request >> >>>>>>>>>>>>>>>>>>>> has >> >>>>>>>>>>>>>>>>>>>> been sitting at **HEAD of the >> >>> queue** >> >>>>>>> since I first >> >>>>>>>>> saw >> >>>>>>>>>>>>>> it at PST >> >>>>>>>>>>>>>>>>>> 10:30am >> >>>>>>>>>>>>>>>>>>>> (not sure how long it's been >> >>> there before >> >>>>>>> 10:30am). >> >>>>>>>>>>>>>> It's PST >> >>>>>>>>>>>>>>> 4:12pm >> >>>>>>>>>>>>>>>>> now >> >>>>>>>>>>>>>>>>>>> and >> >>>>>>>>>>>>>>>>>>>> it hasn't started yet. >> >>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>> On Mon, Jun 24, 2019 at 2:48 PM >> >>> Bowen Li >> >>>>>>>>>>>>>> <[hidden email] >> >>> <mailto:[hidden email]> <mailto:[hidden email] >> >>> <mailto:[hidden email]>> >> >>>>>>> <mailto:[hidden email] <mailto:[hidden email]> >> >>> <mailto:[hidden email] <mailto:[hidden email]>>>> >> >>>>>>>>>>>>>>>>> wrote: >> >>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>>> Hi devs, >> >>>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>>> I've been experiencing the pain >> >>>>>>> resulting from lack >> >>>>>>>>>>>>>> of stable >> >>>>>>>>>>>>>>>>> build >> >>>>>>>>>>>>>>>>>>>>> capacity on Travis for Flink >> >>> PRs [1]. >> >>>>>>>>> Specifically, I >> >>>>>>>>>>>>>> noticed >> >>>>>>>>>>>>>>>>> often >> >>>>>>>>>>>>>>>>>>> that >> >>>>>>>>>>>>>>>>>>>> no >> >>>>>>>>>>>>>>>>>>>>> build in the queue is making any >> >>>>>>> progress for >> >>>>>>>>> hours, >> >>>>>>>>>>> and >> >>>>>>>>>>>>>>> suddenly >> >>>>>>>>>>>>>>>>> 5 >> >>>>>>>>>>>>>>>>>> or >> >>>>>>>>>>>>>>>>>>> 6 >> >>>>>>>>>>>>>>>>>>>>> builds kick off all together >> >>> after the >> >>>>>>> long pause. >> >>>>>>>>>>>>>> I'm at PST >> >>>>>>>>>>>>>>>>>> (UTC-08) >> >>>>>>>>>>>>>>>>>>>> time >> >>>>>>>>>>>>>>>>>>>>> zone, and I've seen pause can >> >>> be as >> >>>>>>> long as 6 hours >> >>>>>>>>>>>>>> from PST 9am >> >>>>>>>>>>>>>>>>> to >> >>>>>>>>>>>>>>>>>> 3pm >> >>>>>>>>>>>>>>>>>>>>> (let alone the time needed to >> >>> drain the >> >>>>>>> queue >> >>>>>>>>>>>>>> afterwards). >> >>>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>>> I think this has greatly >> >>> impacted our >> >>>>>>> productivity. >> >>>>>>>>>>> I've >> >>>>>>>>>>>>>>>>> experienced >> >>>>>>>>>>>>>>>>>>> that >> >>>>>>>>>>>>>>>>>>>>> PRs submitted in the early >> >>> morning of >> >>>>>>> PST time zone >> >>>>>>>>>>>>>> won't finish >> >>>>>>>>>>>>>>>>>> their >> >>>>>>>>>>>>>>>>>>>>> build until late night of the >> >>> same day. >> >>>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>>> So my questions are: >> >>>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>>> - Has anyone else experienced >> >>> the same >> >>>>>>> problem or >> >>>>>>>>>>>>>> have similar >> >>>>>>>>>>>>>>>>>>>> observation >> >>>>>>>>>>>>>>>>>>>>> on TravisCI? (I suspect it >> >>> has things >> >>>>>>> to do with >> >>>>>>>>> time >> >>>>>>>>>>>>>> zone) >> >>>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>>> - What pricing plan of >> >>> TravisCI is >> >>>>>>> Flink currently >> >>>>>>>>>>>>>> using? Is it >> >>>>>>>>>>>>>>>>> the >> >>>>>>>>>>>>>>>>>>> free >> >>>>>>>>>>>>>>>>>>>>> plan for open source >> >>> projects? What >> >>>>>>> are the >> >>>>>>>>>>>>>> guaranteed build >> >>>>>>>>>>>>>>>>> capacity >> >>>>>>>>>>>>>>>>>>> of >> >>>>>>>>>>>>>>>>>>>>> the current plan? >> >>>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>>> - If the current pricing plan >> >>> (either >> >>>>>>> free or paid) >> >>>>>>>>>>>>> can't >> >>>>>>>>>>>>>>> provide >> >>>>>>>>>>>>>>>>>>> stable >> >>>>>>>>>>>>>>>>>>>>> build capacity, can we >> >>> upgrade to a >> >>>>>>> higher priced >> >>>>>>>>>>>>>> plan with >> >>>>>>>>>>>>>>> larger >> >>>>>>>>>>>>>>>>>> and >> >>>>>>>>>>>>>>>>>>>> more >> >>>>>>>>>>>>>>>>>>>>> stable build capacity? >> >>>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>>> BTW, another factor that >> >>> contribute to >> >>>>>>> the >> >>>>>>>>>>>>>> productivity problem >> >>>>>>>>>>>>>>> is >> >>>>>>>>>>>>>>>>>> that >> >>>>>>>>>>>>>>>>>>>>> our build is slow - we run >> >>> full build >> >>>>>>> for every PR >> >>>>>>>>>>> and a >> >>>>>>>>>>>>>>>>> successful >> >>>>>>>>>>>>>>>>>>> full >> >>>>>>>>>>>>>>>>>>>>> build takes ~5h. We >> >>> definitely have >> >>>>>>> more options to >> >>>>>>>>>>>>>> solve it, >> >>>>>>>>>>>>>>> for >> >>>>>>>>>>>>>>>>>>>> instance, >> >>>>>>>>>>>>>>>>>>>>> modularize the build graphs >> >>> and reuse >> >>>>>>> artifacts >> >>>>>>>>> from >> >>>>>>>>>>> the >> >>>>>>>>>>>>>>> previous >> >>>>>>>>>>>>>>>>>>> build. >> >>>>>>>>>>>>>>>>>>>>> But I think that can be a big >> >>> effort >> >>>>>>> which is much >> >>>>>>>>>>>>>> harder to >> >>>>>>>>>>>>>>>>>> accomplish >> >>>>>>>>>>>>>>>>>>>> in >> >>>>>>>>>>>>>>>>>>>>> a short period of time and >> >>> may deserve >> >>>>>>> its own >> >>>>>>>>>>> separate >> >>>>>>>>>>>>>>>>> discussion. >> >>>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>>> [1] >> >>>>>>>>> https://travis-ci.org/apache/flink/pull_requests >> >>>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>> >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> -- >> >>>>>>>>>>>>>> Best Regards >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> Jeff Zhang >> >>>>>>>>>>>>>> >> >>>>>>>>> >> >>>>>>> >> >>>>>> >> >>>>> >> >>> >> >> >> >> >> >> |
+1. Thanks Chesnay and Bowen for pushing this forward.
Regards, Dian > 在 2019年7月4日,下午6:28,zhijiang <[hidden email]> 写道: > > +1 and thanks for Chesnay' work on this. > > Best, > Zhijiang > > ------------------------------------------------------------------ > From:Haibo Sun <[hidden email]> > Send Time:2019年7月4日(星期四) 18:21 > To:dev <[hidden email]> > Cc:[hidden email] <[hidden email]> > Subject:Re:Re: [VOTE] Migrate to sponsored Travis account > > +1. Thank Chesnay for pushing this forward. > > Best, > Haibo > > > At 2019-07-04 17:58:28, "Kurt Young" <[hidden email]> wrote: >> +1 and great thanks Chesnay for pushing this. >> >> Best, >> Kurt >> >> >> On Thu, Jul 4, 2019 at 5:44 PM Aljoscha Krettek <[hidden email]> wrote: >> >>> +1 >>> >>> Aljoscha >>> >>>> On 4. Jul 2019, at 11:09, Stephan Ewen <[hidden email]> wrote: >>>> >>>> +1 to move to a private Travis account. >>>> >>>> I can confirm that Ververica will sponsor a Travis CI plan that is >>>> equivalent or a bit higher than the previous ASF quota (10 concurrent >>> build >>>> queues) >>>> >>>> Best, >>>> Stephan >>>> >>>> On Thu, Jul 4, 2019 at 10:46 AM Chesnay Schepler <[hidden email]> >>> wrote: >>>> >>>>> I've raised a JIRA >>>>> <https://issues.apache.org/jira/browse/INFRA-18703>with INFRA to >>> inquire >>>>> whether it would be possible to switch to a different Travis account, >>>>> and if so what steps would need to be taken. >>>>> We need a proper confirmation from INFRA since we are not in full >>>>> control of the flink repository (for example, we cannot access the >>>>> settings page). >>>>> >>>>> If this is indeed possible, Ververica is willing sponsor a Travis >>>>> account for the Flink project. >>>>> This would provide us with more than enough resources than we need. >>>>> >>>>> Since this makes the project more reliant on resources provided by >>>>> external companies I would like to vote on this. >>>>> >>>>> Please vote on this proposal, as follows: >>>>> [ ] +1, Approve the migration to a Ververica-sponsored Travis account, >>>>> provided that INFRA approves >>>>> [ ] -1, Do not approach the migration to a Ververica-sponsored Travis >>>>> account >>>>> >>>>> The vote will be open for at least 24h, and until we have confirmation >>>>> from INFRA. The voting period may be shorter than the usual 3 days since >>>>> our current is effectively not working. >>>>> >>>>> On 04/07/2019 06:51, Bowen Li wrote: >>>>>> Re: > Are they using their own Travis CI pool, or did the switch to an >>>>>> entirely different CI service? >>>>>> >>>>>> I reached out to Wes and Krisztián from Apache Arrow PMC. They are >>>>>> currently moving away from ASF's Travis to their own in-house metal >>>>>> machines at [1] with custom CI application at [2]. They've seen >>>>>> significant improvement w.r.t both much higher performance and >>>>>> basically no resource waiting time, "night-and-day" difference quoting >>>>>> Wes. >>>>>> >>>>>> Re: > If we can just switch to our own Travis pool, just for our >>>>>> project, then this might be something we can do fairly quickly? >>>>>> >>>>>> I believe so, according to [3] and [4] >>>>>> >>>>>> >>>>>> [1] https://ci.ursalabs.org/ <https://ci.ursalabs.org/#/> >>>>>> [2] https://github.com/ursa-labs/ursabot >>>>>> [3] >>>>>> >>> https://docs.travis-ci.com/user/migrate/open-source-repository-migration >>>>>> [4] >>> https://docs.travis-ci.com/user/migrate/open-source-on-travis-ci-com >>>>>> >>>>>> >>>>>> >>>>>> On Wed, Jul 3, 2019 at 12:01 AM Chesnay Schepler <[hidden email] >>>>>> <mailto:[hidden email]>> wrote: >>>>>> >>>>>> Are they using their own Travis CI pool, or did the switch to an >>>>>> entirely different CI service? >>>>>> >>>>>> If we can just switch to our own Travis pool, just for our >>>>>> project, then >>>>>> this might be something we can do fairly quickly? >>>>>> >>>>>> On 03/07/2019 05:55, Bowen Li wrote: >>>>>>> I responded in the INFRA ticket [1] that I believe they are >>>>>> using a wrong >>>>>>> metric against Flink and the total build time is a completely >>>>>> different >>>>>>> thing than guaranteed build capacity. >>>>>>> >>>>>>> My response: >>>>>>> >>>>>>> "As mentioned above, since I started to pay attention to Flink's >>>>>> build >>>>>>> queue a few tens of days ago, I'm in Seattle and I saw no build >>>>>> was kicking >>>>>>> off in PST daytime in weekdays for Flink. Our teammates in China >>>>>> and Europe >>>>>>> have also reported similar observations. So we need to evaluate >>>>>> how the >>>>>>> large total build time came from - if 1) your number and 2) our >>>>>>> observations from three locations that cover pretty much a full >>>>>> day, are >>>>>>> all true, I **guess** one reason can be that - highly likely the >>>>>> extra >>>>>>> build time came from weekends when other Apache projects may be >>>>>> idle and >>>>>>> Flink just drains hard its congested queue. >>>>>>> >>>>>>> Please be aware of that we're not complaining about the lack of >>>>>> resources >>>>>>> in general, I'm complaining about the lack of **stable, dedicated** >>>>>>> resources. An example for the latter one is, currently even if >>>>>> no build is >>>>>>> in Flink's queue and I submit a request to be the queue head in PST >>>>>>> morning, my build won't even start in 6-8+h. That is an absurd >>>>>> amount of >>>>>>> waiting time. >>>>>>> >>>>>>> That's saying, if ASF INFRA decides to adopt a quota system and >>>>>> grants >>>>>>> Flink five DEDICATED servers that runs all the time only for >>>>>> Flink, that'll >>>>>>> be PERFECT and can totally solve our problem now. >>>>>>> >>>>>>> Please be aware of that we're not complaining about the lack of >>>>>> resources >>>>>>> in general, I'm complaining about the lack of **stable, dedicated** >>>>>>> resources. An example for the latter one is, currently even if >>>>>> no build is >>>>>>> in Flink's queue and I submit a request to be the queue head in PST >>>>>>> morning, my build won't even start in 6-8+h. That is an absurd >>>>>> amount of >>>>>>> waiting time. >>>>>>> >>>>>>> >>>>>>> That's saying, if ASF INFRA decides to adopt a quota system and >>>>>> grants >>>>>>> Flink five DEDICATED servers that runs all the time only for >>>>>> Flink, that'll >>>>>>> be PERFECT and can totally solve our problem now. >>>>>>> >>>>>>> I feel what's missing in the ASF INFRA's Travis resource pool is >>>>>> some level >>>>>>> of build capacity SLAs and certainty" >>>>>>> >>>>>>> >>>>>>> Again, I believe there are differences in nature of these two >>>>>> problems, >>>>>>> long build time v.s. lack of dedicated build resource. That's >>>>>> saying, >>>>>>> shortening build time may relieve the situation, and may not. >>>>>> I'm sightly >>>>>>> negative on disabling IT cases for PRs, due to the downside is >>>>>> that we are >>>>>>> at risk of any potential bugs in PR that UTs doesn't catch, and >>>>>> may cost a >>>>>>> lot more to fix and if it slows others down or even block >>>>>> others, but am >>>>>>> open to others opinions on it. >>>>>>> >>>>>>> AFAICT from INFRA ticket[1], donating to ASF INFRA won't be >>>>>> feasible to >>>>>>> solve our problem since INFRA's pool is fully shared and they >>>>>> have no >>>>>>> control and finer insights over resource allocation to a >>>>>> specific Apache >>>>>>> project. As mentioned in [1], Apache Arrow is moving away from >>>>>> ASF INFRA >>>>>>> Travis pool (they are actually surprised Flink hasn't plan to do >>>>>> so). I >>>>>>> know that Spark is on its own build infra. If we all agree that >>>>>> funding our >>>>>>> own build infra, I'd be glad to help investigate any potential >>>>>> options >>>>>>> after releasing 1.9 since I'm super busy with 1.9 now. >>>>>>> >>>>>>> [1] https://issues.apache.org/jira/browse/INFRA-18533 >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Tue, Jul 2, 2019 at 4:46 AM Chesnay Schepler >>>>>> <[hidden email] <mailto:[hidden email]>> wrote: >>>>>>> >>>>>>>> As a short-term stopgap, since we can assume this issue to >>>>>> become much >>>>>>>> worse in the following days/weeks, we could disable IT cases in >>>>>> PRs and >>>>>>>> only run them on master. >>>>>>>> >>>>>>>> On 02/07/2019 12:03, Chesnay Schepler wrote: >>>>>>>>> People really have to stop thinking that just because >>>>>> something works >>>>>>>>> for us it is also a good solution. >>>>>>>>> Also, please remember that our builds run for 2h from start to >>>>>> finish, >>>>>>>>> and not the 14 _minutes_ it takes for zeppelin. >>>>>>>>> We are dealing with an entirely different scale here, both in >>>>>> terms of >>>>>>>>> build times and number of builds. >>>>>>>>> >>>>>>>>> In this very thread people have been complaining about long queue >>>>>>>>> times for their builds. Surprise, other Apache projects have been >>>>>>>>> suffering the very same thing due to us not controlling our build >>>>>>>>> times. While switching services (be it Jenkins, CircleCI or >>>>>> whatever) >>>>>>>>> will possibly work for us (and these options are actually >>>>>> attractive, >>>>>>>>> like CircleCI's proper support for build artifacts), it will also >>>>>>>>> result in us likely negatively affecting other projects in >>>>>> significant >>>>>>>>> ways. >>>>>>>>> >>>>>>>>> Sure, the Jenkins setup has a good user experience for us, at >>>>>> the cost >>>>>>>>> of blocking Jenkins workers for a _lot_ of time. Right now we >>>>>> have 25 >>>>>>>>> PR's in our queue; that's possibly 50h we'd consume of Jenkins >>>>>>>>> resources, and the European contributors haven't even really >>>>>> started yet. >>>>>>>>> >>>>>>>>> FYI, the latest INFRA response from INFRA-18533: >>>>>>>>> >>>>>>>>> "Our rough metrics shows that Flink used over 5800 hours of >>>>>> build time >>>>>>>>> last month. That is equal to EIGHT servers running 24/7 for >>>>>> the ENTIRE >>>>>>>>> MONTH. EIGHT. nonstop. >>>>>>>>> When we discovered this last night, we discussed it some and >>>>>> are going >>>>>>>>> to tune down Flink to allow only five executors maximum. We >>>>> cannot >>>>>>>>> allow Flink to consume so much of a Foundation shared resource." >>>>>>>>> >>>>>>>>> So yes, we either >>>>>>>>> a) have to heavily reduce our CI usage or >>>>>>>>> b) fund our own, either maintaining it ourselves or donating >>>>>> to Apache. >>>>>>>>> >>>>>>>>> On 02/07/2019 05:11, Bowen Li wrote: >>>>>>>>>> By looking at the git history of the Jenkins script, its core >>>>>> part >>>>>>>>>> was finished in March 2017 (and only two minor update in >>>>>> 2017/2018), >>>>>>>>>> so it's been running for over two years now and feels like >>>>>> Zepplin >>>>>>>>>> community has been quite happy with it. @Jeff Zhang >>>>>>>>>> <mailto:[hidden email] <mailto:[hidden email]>> can you >>>>>> share your insights and user >>>>>>>>>> experience with the Jenkins+Travis approach? >>>>>>>>>> >>>>>>>>>> Things like: >>>>>>>>>> >>>>>>>>>> - has the approach completely solved the resource capacity >>>>>> problem >>>>>>>>>> for Zepplin community? is Zepplin community happy with the >>>>>> result? >>>>>>>>>> - is the whole configuration chain stable (e.g. uptime) enough? >>>>>>>>>> - how often do you need to maintain the Jenkins infra? how many >>>>>>>>>> people are usually involved in maintenance and bug-fixes? >>>>>>>>>> >>>>>>>>>> The downside of this approach seems mostly to be on the >>>>>> maintenance >>>>>>>>>> to me - maintain the script and Jenkins infra. >>>>>>>>>> >>>>>>>>>> ** Having Our Own Travis-CI.com Account ** >>>>>>>>>> >>>>>>>>>> Another alternative I've been thinking of is to have our own >>>>>>>>>> travis-ci.com <http://travis-ci.com> <http://travis-ci.com> >>>>>> account with paid dedicated >>>>>>>>>> resources. Note travis-ci.org <http://travis-ci.org> >>>>>> <http://travis-ci.org> is the free >>>>>>>>>> version and travis-ci.com <http://travis-ci.com> >>>>>> <http://travis-ci.com> is the commercial >>>>>>>>>> version. We currently use a shared resource pool managed by >>>>>> ASK INFRA >>>>>>>>>> team on travis-ci.org <http://travis-ci.org> >>>>>> <http://travis-ci.org>, but we have no control >>>>>>>>>> over it - we can't see how it's configured, how much >>>>>> resources are >>>>>>>>>> available, how resources are allocated among Apache projects, >>>>>> etc. >>>>>>>>>> The nice thing about having an account on travis-ci.com >>>>>> <http://travis-ci.com> >>>>>>>>>> <http://travis-ci.com> are: >>>>>>>>>> >>>>>>>>>> - relatively low cost with much better resource guarantee >>>>>> than what >>>>>>>>>> we currently have [1]: $249/month with 5 dedicated concurrency, >>>>>>>>>> $489/month with 10 concurrency >>>>>>>>>> - low maintenance work compared to using Jenkins >>>>>>>>>> - (potentially) no migration cost according to Travis's doc [2] >>>>>>>>>> (pending verification) >>>>>>>>>> - full control over the build capacity/configuration compared to >>>>>>>>>> using ASF INFRA's pool >>>>>>>>>> >>>>>>>>>> I'd be surprised if we as such a vibrant community cannot >>>>>> find and >>>>>>>>>> fund $249*12=$2988 a year in exchange for a much better >>>>> developer >>>>>>>>>> experience and much higher productivity. >>>>>>>>>> >>>>>>>>>> [1] https://travis-ci.com/plans >>>>>>>>>> [2] >>>>>>>>>> >>>>>>>> >>>>>> >>>>> >>> https://docs.travis-ci.com/user/migrate/open-source-repository-migration >>>>>>>>>> On Sat, Jun 29, 2019 at 8:39 AM Chesnay Schepler >>>>>> <[hidden email] <mailto:[hidden email]> >>>>>>>>>> <mailto:[hidden email] <mailto:[hidden email]>>> wrote: >>>>>>>>>> >>>>>>>>>> So yes, the Jenkins job keeps pulling the state from >>>>>> Travis until it >>>>>>>>>> finishes. >>>>>>>>>> >>>>>>>>>> Note sure I'm comfortable with the idea of using Jenkins >>>>>> workers >>>>>>>>>> just to >>>>>>>>>> idle for a several hours. >>>>>>>>>> >>>>>>>>>> On 29/06/2019 14:56, Jeff Zhang wrote: >>>>>>>>>>> Here's what zeppelin community did, we make a python >>>>>> script to >>>>>>>>>> check the >>>>>>>>>>> build status of pull request. >>>>>>>>>>> Here's script: >>>>>>>>>>> >>>>>> https://github.com/apache/zeppelin/blob/master/travis_check.py >>>>>>>>>>> >>>>>>>>>>> And this is the script we used in Jenkins build job. >>>>>>>>>>> >>>>>>>>>>> if [ -f "travis_check.py" ]; then >>>>>>>>>>> git log -n 1 >>>>>>>>>>> STATUS=$(curl -s $BUILD_URL | grep -e "GitHub pull >>>>>>>>>> request.*from.*" | sed >>>>>>>>>>> 's/.*GitHub pull request <a >>>>>>>>>>> href=\"\(https[^"]*\).*from[^"]*.\(https[^"]*\).*/\1 >>>>>> \2/g') >>>>>>>>>>> AUTHOR=$(echo $STATUS | sed 's/.*[/]\(.*\)$/\1/g') >>>>>>>>>>> PR=$(echo $STATUS | awk '{print $1}' | sed >>>>>>>>>> 's/.*[/]\(.*\)$/\1/g') >>>>>>>>>>> #COMMIT=$(git log -n 1 | grep "^Merge:" | awk >>>>>> '{print $3}') >>>>>>>>>>> #if [ -z $COMMIT ]; then >>>>>>>>>>> # COMMIT=$(curl -s >>>>>>>>>> https://api.github.com/repos/apache/zeppelin/pulls/$PR >>>>>>>>>>> | grep -e "\"label\":" -e "\"ref\":" -e "\"sha\":" | >>>>>> tr '\n' ' ' >>>>>>>>>> | sed >>>>>>>>>>> 's/\(.*sha[^,]*,\)\(.*ref.*\)/\1 = \2/g' | tr = '\n' | >>>>>> grep -v >>>>>>>>>> "apache:" | >>>>>>>>>>> sed 's/.*sha.[^"]*["]\([^"]*\).*/\1/g') >>>>>>>>>>> #fi >>>>>>>>>>> >>>>>>>>>>> # get commit hash from PR >>>>>>>>>>> COMMIT=$(curl -s >>>>>>>>>> https://api.github.com/repos/apache/zeppelin/pulls/$PR | >>>>>>>>>>> grep -e "\"label\":" -e "\"ref\":" -e "\"sha\":" | tr >>>>>> '\n' ' ' >>>>>>>>>> | sed >>>>>>>>>>> 's/\(.*sha[^,]*,\)\(.*ref.*\)/\1 = \2/g' | tr = '\n' | >>>>>> grep -v >>>>>>>>>> "apache:" | >>>>>>>>>>> sed 's/.*sha.[^"]*["]\([^"]*\).*/\1/g') >>>>>>>>>>> sleep 30 # sleep few moment to wait travis starts >>>>>> the build >>>>>>>>>>> RET_CODE=0 >>>>>>>>>>> python ./travis_check.py ${AUTHOR} ${COMMIT} || >>>>>> RET_CODE=$? >>>>>>>>>>> if [ $RET_CODE -eq 2 ]; then # try with repository >>>>>> name when >>>>>>>>>> travis-ci is >>>>>>>>>>> not available in the account >>>>>>>>>>> RET_CODE=0 >>>>>>>>>>> AUTHOR=$(curl -s >>>>>>>>>> https://api.github.com/repos/apache/zeppelin/pulls/$PR >>>>>>>>>>> | grep '"full_name":' | grep -v "apache/zeppelin" | sed >>>>>>>>>>> 's/.*[:][^"]*["]\([^/]*\).*/\1/g') >>>>>>>>>>> python ./travis_check.py ${AUTHOR} ${COMMIT} || >>>>>> RET_CODE=$? >>>>>>>>>>> fi >>>>>>>>>>> >>>>>>>>>>> if [ $RET_CODE -eq 2 ]; then # fail with can't find >>>>>> build >>>>>>>>>> information in >>>>>>>>>>> the travis >>>>>>>>>>> set +x >>>>>>>>>>> echo >>>>>> "-----------------------------------------------------" >>>>>>>>>>> echo "Looks like travis-ci is not configured for >>>>>> your fork." >>>>>>>>>>> echo "Please setup by swich on 'zeppelin' >>>>>> repository at >>>>>>>>>>> https://travis-ci.org/profile and travis-ci." >>>>>>>>>>> echo "And then make sure 'Build branch updates' >>>>>> option is >>>>>>>>>> enabled in >>>>>>>>>>> the settings >>>>>> https://travis-ci.org/${AUTHOR}/zeppelin/settings >>>>>> <https://travis-ci.org/$%7BAUTHOR%7D/zeppelin/settings> >>>>>>>>>> <https://travis-ci.org/$%7BAUTHOR%7D/zeppelin/settings>." >>>>>>>>>>> echo "" >>>>>>>>>>> echo "To trigger CI after setup, you will need >>>>>> ammend your >>>>>>>>>> last commit >>>>>>>>>>> with" >>>>>>>>>>> echo "git commit --amend" >>>>>>>>>>> echo "git push your-remote HEAD --force" >>>>>>>>>>> echo "" >>>>>>>>>>> echo "See >>>>>>>>>>> >>>>>>>>>> >>>>>>>> >>>>>> >>>>> >>> http://zeppelin.apache.org/contribution/contributions.html#continuous-integration >>>>>>>>>>> ." >>>>>>>>>>> fi >>>>>>>>>>> >>>>>>>>>>> exit $RET_CODE >>>>>>>>>>> else >>>>>>>>>>> set +x >>>>>>>>>>> echo "travis_check.py does not exists" >>>>>>>>>>> exit 1 >>>>>>>>>>> fi >>>>>>>>>>> >>>>>>>>>>> Chesnay Schepler <[hidden email] >>>>>> <mailto:[hidden email]> >>>>>>>>>> <mailto:[hidden email] <mailto:[hidden email]>>> >>>>>> 于2019年6月29日周六 下午3:17写道: >>>>>>>>>>> >>>>>>>>>>>> Does this imply that a Jenkins job is active as long >>>>>> as the >>>>>>>>>> Travis build >>>>>>>>>>>> runs? >>>>>>>>>>>> >>>>>>>>>>>> On 26/06/2019 21:28, Bowen Li wrote: >>>>>>>>>>>>> Hi, >>>>>>>>>>>>> >>>>>>>>>>>>> @Dawid, I think the "long test running" as I >>>>>> mentioned in the >>>>>>>>>> first >>>>>>>>>>>> email, >>>>>>>>>>>>> also as you guys said, belongs to "a big effort >>>>>> which is much >>>>>>>>>> harder to >>>>>>>>>>>>> accomplish in a short period of time and may deserve >>>>>> its own >>>>>>>>>> separate >>>>>>>>>>>>> discussion". Thus I didn't include it in what we can >>>>>> do in a >>>>>>>>>> foreseeable >>>>>>>>>>>>> short term. >>>>>>>>>>>>> >>>>>>>>>>>>> Besides, I don't think that's the ultimate reason >>>>>> for lack of >>>>>>>>>> build >>>>>>>>>>>>> resources. Even if the build is shortened to >>>>>> something like >>>>>>>>>> 2h, the >>>>>>>>>>>>> problems of no build machine works about 6 or more >>>>>> hours in >>>>>>>>>> PST daytime >>>>>>>>>>>>> that I described will still happen, because no >>>>>> machine from >>>>>>>>>> ASF INFRA's >>>>>>>>>>>>> pool is allocated to Flink. As I have paid close >>>>>> attention to >>>>>>>>>> the build >>>>>>>>>>>>> queue in the past few weekdays, it's a pretty clear >>>>>> pattern now. >>>>>>>>>>>>> >>>>>>>>>>>>> **The ultimate root cause** for that is - we don't >>>>>> have any >>>>>>>>>> **dedicated** >>>>>>>>>>>>> build resources that we can stably rely on. I'm >>>>>> actually ok to >>>>>>>>>> wait for a >>>>>>>>>>>>> long time if there are build requests running, it >>>>>> means at >>>>>>>>>> least we are >>>>>>>>>>>>> making progress. But I'm not ok with no build >>>>>> resource. A >>>>>>>>>> better place I >>>>>>>>>>>>> think we should aim at in short term is to always >>>>>> have at >>>>>>>>>> least a central >>>>>>>>>>>>> pool (can be 3 or 5) of machines dedicated to build >>>>>> Flink at >>>>>>>>>> any time, or >>>>>>>>>>>>> maybe use users resources. >>>>>>>>>>>>> >>>>>>>>>>>>> @Chesnay @Robert I synced with Jeff offline that >>>>>> Zeppelin >>>>>>>>>> community is >>>>>>>>>>>>> using a Jenkins job to automatically build on users' >>>>>> travis >>>>>>>>>> account and >>>>>>>>>>>>> link the result back to github PR. I guess the >>>>>> Jenkins job >>>>>>>>>> would fetch >>>>>>>>>>>>> latest upstream master and build the PR against it. >>>>>> Jeff has >>>>>>>>>> filed >>>>>>>>>>>> tickets >>>>>>>>>>>>> to learn and get access to the Jenkins infra. It'll >>>>>> better to >>>>>>>>>> fully >>>>>>>>>>>>> understand it first before judging this approach. >>>>>>>>>>>>> >>>>>>>>>>>>> I also heard good things about CircleCI, and ASF >>>>>> INFRA seems >>>>>>>>>> to have a >>>>>>>>>>>> pool >>>>>>>>>>>>> of build capacity there too. Can be an alternative >>>>>> to consider. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Wed, Jun 26, 2019 at 12:44 AM Dawid Wysakowicz < >>>>>>>>>>>> [hidden email] >>>>>> <mailto:[hidden email]> <mailto:[hidden email] >>>>>> <mailto:[hidden email]>>> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Sorry to jump in late, but I think Bowen missed the >>>>>> most >>>>>>>>>> important point >>>>>>>>>>>>>> from Chesnay's previous message in the summary. The >>>>>> ultimate >>>>>>>>>> reason for >>>>>>>>>>>>>> all the problems is that the tests take close to 2 >>>>>> hours to >>>>>>>>>> run already. >>>>>>>>>>>>>> I fully support this claim: "Unless people start >>>>>> caring about >>>>>>>>>> test times >>>>>>>>>>>>>> before adding them, this issue cannot be solved" >>>>>>>>>>>>>> >>>>>>>>>>>>>> This is also another reason why using user's Travis >>>>>> account >>>>>>>>>> won't help. >>>>>>>>>>>>>> Every few weeks we reach the user's time limit for >>>>>> a single >>>>>>>>>> profile. >>>>>>>>>>>>>> This makes the user's builds simply fail, until we >>>>>> either >>>>>>>>>> properly >>>>>>>>>>>>>> decrease the time the tests take (which I am not >>>>>> sure we ever >>>>>>>>>> did) or >>>>>>>>>>>>>> postpone the problem by splitting into more >>>>>> profiles. (Note >>>>>>>>>> that the ASF >>>>>>>>>>>>>> Travis account has higher time limits) >>>>>>>>>>>>>> >>>>>>>>>>>>>> Best, >>>>>>>>>>>>>> >>>>>>>>>>>>>> Dawid >>>>>>>>>>>>>> >>>>>>>>>>>>>> On 26/06/2019 09:36, Robert Metzger wrote: >>>>>>>>>>>>>>> Do we know if using "the best" available hardware >>>>>> would >>>>>>>>>> improve the >>>>>>>>>>>> build >>>>>>>>>>>>>>> times? >>>>>>>>>>>>>>> Imagine we would run the build on machines with >>>>>> plenty of >>>>>>>>>> main memory >>>>>>>>>>>> to >>>>>>>>>>>>>>> mount everything to ramdisk + the latest CPU >>>>>> architecture? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Throwing hardware at the problem could help reduce >>>>>> the time >>>>>>>>>> of an >>>>>>>>>>>>>>> individual build, and using our own infrastructure >>>>>> would >>>>>>>>>> remove our >>>>>>>>>>>>>>> dependency on Apache's Travis account (with the >>>>>> obvious >>>>>>>>>> downside of >>>>>>>>>>>>>> having >>>>>>>>>>>>>>> to maintain the infrastructure) >>>>>>>>>>>>>>> We could use an open source travis alternative, to >>>>>> have a >>>>>>>>>> similar >>>>>>>>>>>>>>> experience and make the migration easy. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Wed, Jun 26, 2019 at 9:34 AM Chesnay Schepler >>>>>>>>>> <[hidden email] <mailto:[hidden email]> >>>>>> <mailto:[hidden email] <mailto:[hidden email]>>> >>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>> From what I gathered, there's no special >>>>>> sauce that the >>>>>>>>>> Zeppelin >>>>>>>>>>>>>>>> project uses which actually integrates a users >>>>> Travis >>>>>>>>>> account into the >>>>>>>>>>>>>> PR. >>>>>>>>>>>>>>>> They just disabled Travis for PRs. And that's >>>>>> kind of it. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Naturally we can do this (duh) and safe the ASF a >>>>>> fair >>>>>>>>>> amount of >>>>>>>>>>>>>>>> resources, but there are downsides: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> The discoverability of the Travis check takes a >>>>>> nose-dive. >>>>>>>>>> Either we >>>>>>>>>>>>>>>> require every contributor to always, an every >>>>>> commit, also >>>>>>>>>> post a >>>>>>>>>>>> Travis >>>>>>>>>>>>>>>> build, or we have the reviewer sift through the >>>>>>>>>> contributors account >>>>>>>>>>>> to >>>>>>>>>>>>>>>> find it. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> This is rather cumbersome. Additionally, it's >>>>>> also not >>>>>>>>>> equivalent to >>>>>>>>>>>>>>>> having a PR build. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> A normal branch build takes a branch as is and >>>>>> tests it. A >>>>>>>>>> PR build >>>>>>>>>>>>>>>> merges the branch into master, and then runs it. >>>>>> (Fun fact: >>>>>>>>>> This is >>>>>>>>>>>> why >>>>>>>>>>>>>>>> a PR without merge conflicts is not being run on >>>>>> Travis.) >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> And ultimately, everyone can already make use of >>>>> this >>>>>>>>>> approach anyway. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On 25/06/2019 08:02, Jark Wu wrote: >>>>>>>>>>>>>>>>> Hi Jeff, >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Thanks for sharing the Zeppelin approach. I >>>>>> think it's a >>>>>>>>>> good idea to >>>>>>>>>>>>>>>>> leverage user's travis account. >>>>>>>>>>>>>>>>> In this way, we can have almost unlimited >>>>>> concurrent build >>>>>>>>>> jobs and >>>>>>>>>>>>>>>>> developers can restart build by themselves >>>>>> (currently only >>>>>>>>>> committers >>>>>>>>>>>>>>>>> can restart PR's build). >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> But I'm still not very clear how to integrate >>>>> user's >>>>>>>>>> travis build >>>>>>>>>>>> into >>>>>>>>>>>>>>>>> the Flink pull request's build automatically. >>>>>> Can you >>>>>>>>>> explain more in >>>>>>>>>>>>>>>>> detail? >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Another question: does travis only build >>>>>> branches for user >>>>>>>>>> account? >>>>>>>>>>>>>>>>> My concern is that builds for PRs will rebase >>>>> user's >>>>>>>>>> commits against >>>>>>>>>>>>>>>>> current master branch. >>>>>>>>>>>>>>>>> This will help us to find problems before >>>>>> merge. Builds >>>>>>>>>> for branches >>>>>>>>>>>>>>>>> will lose the impact of new commits in master. >>>>>>>>>>>>>>>>> How does Zeppelin solve this problem? >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Thanks again for sharing the idea. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Regards, >>>>>>>>>>>>>>>>> Jark >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Tue, 25 Jun 2019 at 11:01, Jeff Zhang >>>>>> <[hidden email] <mailto:[hidden email]> >>>>>>>>>> <mailto:[hidden email] <mailto:[hidden email]>> >>>>>>>>>>>>>>>>> <mailto:[hidden email] >>>>>> <mailto:[hidden email]> <mailto:[hidden email] >>>>>> <mailto:[hidden email]>>>> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Hi Folks, >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Zeppelin meet this kind of issue before, we solve >>>>>>>>>> it by >>>>>>>>>>>> delegating >>>>>>>>>>>>>>>>> each >>>>>>>>>>>>>>>>> one's PR build to his travis account >>>>>> (Everyone can >>>>>>>>>> have 5 free >>>>>>>>>>>>>>>>> slot for >>>>>>>>>>>>>>>>> travis build). >>>>>>>>>>>>>>>>> Apache account travis build is only triggered when >>>>>>>>>> PR is merged. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Kurt Young <[hidden email] >>>>>> <mailto:[hidden email]> >>>>>>>>>> <mailto:[hidden email] <mailto:[hidden email]>> >>>>>> <mailto:[hidden email] <mailto:[hidden email]> >>>>>>>>>> <mailto:[hidden email] <mailto:[hidden email]>>>> >>>>>>>>>>>>>>>>> 于2019年6月25日周二 上午10:16写道: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> (Forgot to cc George) >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Best, >>>>>>>>>>>>>>>>>> Kurt >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Tue, Jun 25, 2019 at 10:16 AM Kurt Young >>>>>>>>>> <[hidden email] <mailto:[hidden email]> >>>>>> <mailto:[hidden email] <mailto:[hidden email]>> >>>>>>>>>>>>>>>>> <mailto:[hidden email] >>>>>> <mailto:[hidden email]> <mailto:[hidden email] >>>>>> <mailto:[hidden email]>>>> >>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Hi Bowen, >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Thanks for bringing this up. We >>>>>> actually have >>>>>>>>>> discussed >>>>>>>>>>>> about >>>>>>>>>>>>>>>>> this, and I >>>>>>>>>>>>>>>>>>> think Till and George have >>>>>>>>>>>>>>>>>>> already spend sometime investigating >>>>>> it. I have >>>>>>>>>> cced both of >>>>>>>>>>>>>>>>> them, and >>>>>>>>>>>>>>>>>>> maybe they can share >>>>>>>>>>>>>>>>>>> their findings. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Best, >>>>>>>>>>>>>>>>>>> Kurt >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> On Tue, Jun 25, 2019 at 10:08 AM Jark Wu >>>>>>>>>> <[hidden email] <mailto:[hidden email]> >>>>>> <mailto:[hidden email] <mailto:[hidden email]>> >>>>>>>>>>>>>>>>> <mailto:[hidden email] >>>>>> <mailto:[hidden email]> <mailto:[hidden email] >>>>>> <mailto:[hidden email]>>>> >>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Hi Bowen, >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Thanks for bringing this. We also >>>>>> suffered from >>>>>>>>>> the long >>>>>>>>>>>>>>>>> build time. >>>>>>>>>>>>>>>>>>>> I agree that we should focus on >>>>>> solving build >>>>>>>>>> capacity >>>>>>>>>>>>>>>>> problem in the >>>>>>>>>>>>>>>>>>>> thread. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> My observation is there is only one >>>>>> build is >>>>>>>>>> running, all >>>>>>>>>>>> the >>>>>>>>>>>>>>>>> others >>>>>>>>>>>>>>>>>>>> (other >>>>>>>>>>>>>>>>>>>> PRs, master) are pending. >>>>>>>>>>>>>>>>>>>> The pricing plan[1] of travis shows >>>>>> it can >>>>>>>>>> support >>>>>>>>>>>> concurrent >>>>>>>>>>>>>>>>> build >>>>>>>>>>>>>>>>>> jobs. >>>>>>>>>>>>>>>>>>>> But I don't know which plan we are >>>>>> using, might >>>>>>>>>> be the free >>>>>>>>>>>>>>>>> plan for >>>>>>>>>>>>>>>>>> open >>>>>>>>>>>>>>>>>>>> source. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> I cc-ed Chesnay who may have some >>>>>> experience on >>>>>>>>>> Travis. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Regards, >>>>>>>>>>>>>>>>>>>> Jark >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> [1]: https://travis-ci.com/plans >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> On Tue, 25 Jun 2019 at 08:11, Bowen Li < >>>>>>>>>>>> [hidden email] <mailto:[hidden email]> >>>>>> <mailto:[hidden email] <mailto:[hidden email]>> >>>>>>>>>>>>>>>>> <mailto:[hidden email] >>>>>> <mailto:[hidden email]> >>>>>>>>>> <mailto:[hidden email] >>>>>> <mailto:[hidden email]>>>> wrote: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Hi Steven, >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> I think you may not read what I >>>>>> wrote. The >>>>>>>>>> discussion is >>>>>>>>>>>>>> about >>>>>>>>>>>>>>>>>> "unstable >>>>>>>>>>>>>>>>>>>>> build **capacity**", in another word >>>>>>>>>> "unstable / lack of >>>>>>>>>>>>>> build >>>>>>>>>>>>>>>>>>>> resources", >>>>>>>>>>>>>>>>>>>>> not "unstable build". >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> On Mon, Jun 24, 2019 at 4:40 PM >>>>>> Steven Wu >>>>>>>>>>>>>>>>> <[hidden email] >>>>>> <mailto:[hidden email]> <mailto:[hidden email] >>>>>> <mailto:[hidden email]>> >>>>>>>>>> <mailto:[hidden email] >>>>>> <mailto:[hidden email]> <mailto:[hidden email] >>>>>> <mailto:[hidden email]>>>> >>>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> long and sometimes unstable build is >>>>>>>>>> definitely a pain >>>>>>>>>>>>>>>> point. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> I suspect the build failure here in >>>>>>>>>>>> flink-connector-kafka >>>>>>>>>>>>>>>>> is not >>>>>>>>>>>>>>>>>>>> related >>>>>>>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>>>>>>>> my change. but there is no easy >>>>>> re-run the >>>>>>>>>> build on >>>>>>>>>>>>>>>>> travis UI. >>>>>>>>>>>>>>>>>>>>>> search showed a trick of >>>>>> close-and-open the >>>>>>>>>> PR will >>>>>>>>>>>>>>>>> trigger rebuild. >>>>>>>>>>>>>>>>>>>> but >>>>>>>>>>>>>>>>>>>>>> that could add noises to the PR >>>>>> activities. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>> https://travis-ci.org/apache/flink/jobs/545555519 >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> travis-ci for my personal repo >>>>>> often failed >>>>>>>>>> with >>>>>>>>>>>>>>>>> exceeding time >>>>>>>>>>>>>>>>>> limit >>>>>>>>>>>>>>>>>>>>> after >>>>>>>>>>>>>>>>>>>>>> 4+ hours. >>>>>>>>>>>>>>>>>>>>>> The job exceeded the maximum time >>>>>> limit for >>>>>>>>>> jobs, and >>>>>>>>>>>> has >>>>>>>>>>>>>>>>> been >>>>>>>>>>>>>>>>>>>>> terminated. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> On Mon, Jun 24, 2019 at 4:15 PM >>>>>> Bowen Li >>>>>>>>>>>>>>>>> <[hidden email] >>>>>> <mailto:[hidden email]> <mailto:[hidden email] >>>>>> <mailto:[hidden email]>> >>>>>>>>>> <mailto:[hidden email] <mailto:[hidden email]> >>>>>> <mailto:[hidden email] <mailto:[hidden email]>>>> >>>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>> https://travis-ci.org/apache/flink/builds/549681530 >>>>>>>>>>>>>>>>> This build >>>>>>>>>>>>>>>>>>>>> request >>>>>>>>>>>>>>>>>>>>>>> has >>>>>>>>>>>>>>>>>>>>>>> been sitting at **HEAD of the >>>>>> queue** >>>>>>>>>> since I first >>>>>>>>>>>> saw >>>>>>>>>>>>>>>>> it at PST >>>>>>>>>>>>>>>>>>>>> 10:30am >>>>>>>>>>>>>>>>>>>>>>> (not sure how long it's been >>>>>> there before >>>>>>>>>> 10:30am). >>>>>>>>>>>>>>>>> It's PST >>>>>>>>>>>>>>>>>> 4:12pm >>>>>>>>>>>>>>>>>>>> now >>>>>>>>>>>>>>>>>>>>>> and >>>>>>>>>>>>>>>>>>>>>>> it hasn't started yet. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> On Mon, Jun 24, 2019 at 2:48 PM >>>>>> Bowen Li >>>>>>>>>>>>>>>>> <[hidden email] >>>>>> <mailto:[hidden email]> <mailto:[hidden email] >>>>>> <mailto:[hidden email]>> >>>>>>>>>> <mailto:[hidden email] <mailto:[hidden email]> >>>>>> <mailto:[hidden email] <mailto:[hidden email]>>>> >>>>>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Hi devs, >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> I've been experiencing the pain >>>>>>>>>> resulting from lack >>>>>>>>>>>>>>>>> of stable >>>>>>>>>>>>>>>>>>>> build >>>>>>>>>>>>>>>>>>>>>>>> capacity on Travis for Flink >>>>>> PRs [1]. >>>>>>>>>>>> Specifically, I >>>>>>>>>>>>>>>>> noticed >>>>>>>>>>>>>>>>>>>> often >>>>>>>>>>>>>>>>>>>>>> that >>>>>>>>>>>>>>>>>>>>>>> no >>>>>>>>>>>>>>>>>>>>>>>> build in the queue is making any >>>>>>>>>> progress for >>>>>>>>>>>> hours, >>>>>>>>>>>>>> and >>>>>>>>>>>>>>>>>> suddenly >>>>>>>>>>>>>>>>>>>> 5 >>>>>>>>>>>>>>>>>>>>> or >>>>>>>>>>>>>>>>>>>>>> 6 >>>>>>>>>>>>>>>>>>>>>>>> builds kick off all together >>>>>> after the >>>>>>>>>> long pause. >>>>>>>>>>>>>>>>> I'm at PST >>>>>>>>>>>>>>>>>>>>> (UTC-08) >>>>>>>>>>>>>>>>>>>>>>> time >>>>>>>>>>>>>>>>>>>>>>>> zone, and I've seen pause can >>>>>> be as >>>>>>>>>> long as 6 hours >>>>>>>>>>>>>>>>> from PST 9am >>>>>>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>>>>>>> 3pm >>>>>>>>>>>>>>>>>>>>>>>> (let alone the time needed to >>>>>> drain the >>>>>>>>>> queue >>>>>>>>>>>>>>>>> afterwards). >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> I think this has greatly >>>>>> impacted our >>>>>>>>>> productivity. >>>>>>>>>>>>>> I've >>>>>>>>>>>>>>>>>>>> experienced >>>>>>>>>>>>>>>>>>>>>> that >>>>>>>>>>>>>>>>>>>>>>>> PRs submitted in the early >>>>>> morning of >>>>>>>>>> PST time zone >>>>>>>>>>>>>>>>> won't finish >>>>>>>>>>>>>>>>>>>>> their >>>>>>>>>>>>>>>>>>>>>>>> build until late night of the >>>>>> same day. >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> So my questions are: >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> - Has anyone else experienced >>>>>> the same >>>>>>>>>> problem or >>>>>>>>>>>>>>>>> have similar >>>>>>>>>>>>>>>>>>>>>>> observation >>>>>>>>>>>>>>>>>>>>>>>> on TravisCI? (I suspect it >>>>>> has things >>>>>>>>>> to do with >>>>>>>>>>>> time >>>>>>>>>>>>>>>>> zone) >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> - What pricing plan of >>>>>> TravisCI is >>>>>>>>>> Flink currently >>>>>>>>>>>>>>>>> using? Is it >>>>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>>>>>> free >>>>>>>>>>>>>>>>>>>>>>>> plan for open source >>>>>> projects? What >>>>>>>>>> are the >>>>>>>>>>>>>>>>> guaranteed build >>>>>>>>>>>>>>>>>>>> capacity >>>>>>>>>>>>>>>>>>>>>> of >>>>>>>>>>>>>>>>>>>>>>>> the current plan? >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> - If the current pricing plan >>>>>> (either >>>>>>>>>> free or paid) >>>>>>>>>>>>>>>> can't >>>>>>>>>>>>>>>>>> provide >>>>>>>>>>>>>>>>>>>>>> stable >>>>>>>>>>>>>>>>>>>>>>>> build capacity, can we >>>>>> upgrade to a >>>>>>>>>> higher priced >>>>>>>>>>>>>>>>> plan with >>>>>>>>>>>>>>>>>> larger >>>>>>>>>>>>>>>>>>>>> and >>>>>>>>>>>>>>>>>>>>>>> more >>>>>>>>>>>>>>>>>>>>>>>> stable build capacity? >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> BTW, another factor that >>>>>> contribute to >>>>>>>>>> the >>>>>>>>>>>>>>>>> productivity problem >>>>>>>>>>>>>>>>>> is >>>>>>>>>>>>>>>>>>>>> that >>>>>>>>>>>>>>>>>>>>>>>> our build is slow - we run >>>>>> full build >>>>>>>>>> for every PR >>>>>>>>>>>>>> and a >>>>>>>>>>>>>>>>>>>> successful >>>>>>>>>>>>>>>>>>>>>> full >>>>>>>>>>>>>>>>>>>>>>>> build takes ~5h. We >>>>>> definitely have >>>>>>>>>> more options to >>>>>>>>>>>>>>>>> solve it, >>>>>>>>>>>>>>>>>> for >>>>>>>>>>>>>>>>>>>>>>> instance, >>>>>>>>>>>>>>>>>>>>>>>> modularize the build graphs >>>>>> and reuse >>>>>>>>>> artifacts >>>>>>>>>>>> from >>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>> previous >>>>>>>>>>>>>>>>>>>>>> build. >>>>>>>>>>>>>>>>>>>>>>>> But I think that can be a big >>>>>> effort >>>>>>>>>> which is much >>>>>>>>>>>>>>>>> harder to >>>>>>>>>>>>>>>>>>>>> accomplish >>>>>>>>>>>>>>>>>>>>>>> in >>>>>>>>>>>>>>>>>>>>>>>> a short period of time and >>>>>> may deserve >>>>>>>>>> its own >>>>>>>>>>>>>> separate >>>>>>>>>>>>>>>>>>>> discussion. >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> [1] >>>>>>>>>>>> https://travis-ci.org/apache/flink/pull_requests >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>> Best Regards >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Jeff Zhang >>>>>>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>> >>>>> >>>>> >>> >>> > |
+1
Dian Fu <[hidden email]> 于2019年7月4日周四 下午7:09写道: > +1. Thanks Chesnay and Bowen for pushing this forward. > > Regards, > Dian > > > 在 2019年7月4日,下午6:28,zhijiang <[hidden email]> 写道: > > > > +1 and thanks for Chesnay' work on this. > > > > Best, > > Zhijiang > > > > ------------------------------------------------------------------ > > From:Haibo Sun <[hidden email]> > > Send Time:2019年7月4日(星期四) 18:21 > > To:dev <[hidden email]> > > Cc:[hidden email] <[hidden email]> > > Subject:Re:Re: [VOTE] Migrate to sponsored Travis account > > > > +1. Thank Chesnay for pushing this forward. > > > > Best, > > Haibo > > > > > > At 2019-07-04 17:58:28, "Kurt Young" <[hidden email]> wrote: > >> +1 and great thanks Chesnay for pushing this. > >> > >> Best, > >> Kurt > >> > >> > >> On Thu, Jul 4, 2019 at 5:44 PM Aljoscha Krettek <[hidden email]> > wrote: > >> > >>> +1 > >>> > >>> Aljoscha > >>> > >>>> On 4. Jul 2019, at 11:09, Stephan Ewen <[hidden email]> wrote: > >>>> > >>>> +1 to move to a private Travis account. > >>>> > >>>> I can confirm that Ververica will sponsor a Travis CI plan that is > >>>> equivalent or a bit higher than the previous ASF quota (10 concurrent > >>> build > >>>> queues) > >>>> > >>>> Best, > >>>> Stephan > >>>> > >>>> On Thu, Jul 4, 2019 at 10:46 AM Chesnay Schepler <[hidden email]> > >>> wrote: > >>>> > >>>>> I've raised a JIRA > >>>>> <https://issues.apache.org/jira/browse/INFRA-18703>with INFRA to > >>> inquire > >>>>> whether it would be possible to switch to a different Travis account, > >>>>> and if so what steps would need to be taken. > >>>>> We need a proper confirmation from INFRA since we are not in full > >>>>> control of the flink repository (for example, we cannot access the > >>>>> settings page). > >>>>> > >>>>> If this is indeed possible, Ververica is willing sponsor a Travis > >>>>> account for the Flink project. > >>>>> This would provide us with more than enough resources than we need. > >>>>> > >>>>> Since this makes the project more reliant on resources provided by > >>>>> external companies I would like to vote on this. > >>>>> > >>>>> Please vote on this proposal, as follows: > >>>>> [ ] +1, Approve the migration to a Ververica-sponsored Travis > account, > >>>>> provided that INFRA approves > >>>>> [ ] -1, Do not approach the migration to a Ververica-sponsored Travis > >>>>> account > >>>>> > >>>>> The vote will be open for at least 24h, and until we have > confirmation > >>>>> from INFRA. The voting period may be shorter than the usual 3 days > since > >>>>> our current is effectively not working. > >>>>> > >>>>> On 04/07/2019 06:51, Bowen Li wrote: > >>>>>> Re: > Are they using their own Travis CI pool, or did the switch to > an > >>>>>> entirely different CI service? > >>>>>> > >>>>>> I reached out to Wes and Krisztián from Apache Arrow PMC. They are > >>>>>> currently moving away from ASF's Travis to their own in-house metal > >>>>>> machines at [1] with custom CI application at [2]. They've seen > >>>>>> significant improvement w.r.t both much higher performance and > >>>>>> basically no resource waiting time, "night-and-day" difference > quoting > >>>>>> Wes. > >>>>>> > >>>>>> Re: > If we can just switch to our own Travis pool, just for our > >>>>>> project, then this might be something we can do fairly quickly? > >>>>>> > >>>>>> I believe so, according to [3] and [4] > >>>>>> > >>>>>> > >>>>>> [1] https://ci.ursalabs.org/ <https://ci.ursalabs.org/#/> > >>>>>> [2] https://github.com/ursa-labs/ursabot > >>>>>> [3] > >>>>>> > >>> > https://docs.travis-ci.com/user/migrate/open-source-repository-migration > >>>>>> [4] > >>> https://docs.travis-ci.com/user/migrate/open-source-on-travis-ci-com > >>>>>> > >>>>>> > >>>>>> > >>>>>> On Wed, Jul 3, 2019 at 12:01 AM Chesnay Schepler < > [hidden email] > >>>>>> <mailto:[hidden email]>> wrote: > >>>>>> > >>>>>> Are they using their own Travis CI pool, or did the switch to an > >>>>>> entirely different CI service? > >>>>>> > >>>>>> If we can just switch to our own Travis pool, just for our > >>>>>> project, then > >>>>>> this might be something we can do fairly quickly? > >>>>>> > >>>>>> On 03/07/2019 05:55, Bowen Li wrote: > >>>>>>> I responded in the INFRA ticket [1] that I believe they are > >>>>>> using a wrong > >>>>>>> metric against Flink and the total build time is a completely > >>>>>> different > >>>>>>> thing than guaranteed build capacity. > >>>>>>> > >>>>>>> My response: > >>>>>>> > >>>>>>> "As mentioned above, since I started to pay attention to Flink's > >>>>>> build > >>>>>>> queue a few tens of days ago, I'm in Seattle and I saw no build > >>>>>> was kicking > >>>>>>> off in PST daytime in weekdays for Flink. Our teammates in China > >>>>>> and Europe > >>>>>>> have also reported similar observations. So we need to evaluate > >>>>>> how the > >>>>>>> large total build time came from - if 1) your number and 2) our > >>>>>>> observations from three locations that cover pretty much a full > >>>>>> day, are > >>>>>>> all true, I **guess** one reason can be that - highly likely the > >>>>>> extra > >>>>>>> build time came from weekends when other Apache projects may be > >>>>>> idle and > >>>>>>> Flink just drains hard its congested queue. > >>>>>>> > >>>>>>> Please be aware of that we're not complaining about the lack of > >>>>>> resources > >>>>>>> in general, I'm complaining about the lack of **stable, dedicated** > >>>>>>> resources. An example for the latter one is, currently even if > >>>>>> no build is > >>>>>>> in Flink's queue and I submit a request to be the queue head in PST > >>>>>>> morning, my build won't even start in 6-8+h. That is an absurd > >>>>>> amount of > >>>>>>> waiting time. > >>>>>>> > >>>>>>> That's saying, if ASF INFRA decides to adopt a quota system and > >>>>>> grants > >>>>>>> Flink five DEDICATED servers that runs all the time only for > >>>>>> Flink, that'll > >>>>>>> be PERFECT and can totally solve our problem now. > >>>>>>> > >>>>>>> Please be aware of that we're not complaining about the lack of > >>>>>> resources > >>>>>>> in general, I'm complaining about the lack of **stable, dedicated** > >>>>>>> resources. An example for the latter one is, currently even if > >>>>>> no build is > >>>>>>> in Flink's queue and I submit a request to be the queue head in PST > >>>>>>> morning, my build won't even start in 6-8+h. That is an absurd > >>>>>> amount of > >>>>>>> waiting time. > >>>>>>> > >>>>>>> > >>>>>>> That's saying, if ASF INFRA decides to adopt a quota system and > >>>>>> grants > >>>>>>> Flink five DEDICATED servers that runs all the time only for > >>>>>> Flink, that'll > >>>>>>> be PERFECT and can totally solve our problem now. > >>>>>>> > >>>>>>> I feel what's missing in the ASF INFRA's Travis resource pool is > >>>>>> some level > >>>>>>> of build capacity SLAs and certainty" > >>>>>>> > >>>>>>> > >>>>>>> Again, I believe there are differences in nature of these two > >>>>>> problems, > >>>>>>> long build time v.s. lack of dedicated build resource. That's > >>>>>> saying, > >>>>>>> shortening build time may relieve the situation, and may not. > >>>>>> I'm sightly > >>>>>>> negative on disabling IT cases for PRs, due to the downside is > >>>>>> that we are > >>>>>>> at risk of any potential bugs in PR that UTs doesn't catch, and > >>>>>> may cost a > >>>>>>> lot more to fix and if it slows others down or even block > >>>>>> others, but am > >>>>>>> open to others opinions on it. > >>>>>>> > >>>>>>> AFAICT from INFRA ticket[1], donating to ASF INFRA won't be > >>>>>> feasible to > >>>>>>> solve our problem since INFRA's pool is fully shared and they > >>>>>> have no > >>>>>>> control and finer insights over resource allocation to a > >>>>>> specific Apache > >>>>>>> project. As mentioned in [1], Apache Arrow is moving away from > >>>>>> ASF INFRA > >>>>>>> Travis pool (they are actually surprised Flink hasn't plan to do > >>>>>> so). I > >>>>>>> know that Spark is on its own build infra. If we all agree that > >>>>>> funding our > >>>>>>> own build infra, I'd be glad to help investigate any potential > >>>>>> options > >>>>>>> after releasing 1.9 since I'm super busy with 1.9 now. > >>>>>>> > >>>>>>> [1] https://issues.apache.org/jira/browse/INFRA-18533 > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> On Tue, Jul 2, 2019 at 4:46 AM Chesnay Schepler > >>>>>> <[hidden email] <mailto:[hidden email]>> wrote: > >>>>>>> > >>>>>>>> As a short-term stopgap, since we can assume this issue to > >>>>>> become much > >>>>>>>> worse in the following days/weeks, we could disable IT cases in > >>>>>> PRs and > >>>>>>>> only run them on master. > >>>>>>>> > >>>>>>>> On 02/07/2019 12:03, Chesnay Schepler wrote: > >>>>>>>>> People really have to stop thinking that just because > >>>>>> something works > >>>>>>>>> for us it is also a good solution. > >>>>>>>>> Also, please remember that our builds run for 2h from start to > >>>>>> finish, > >>>>>>>>> and not the 14 _minutes_ it takes for zeppelin. > >>>>>>>>> We are dealing with an entirely different scale here, both in > >>>>>> terms of > >>>>>>>>> build times and number of builds. > >>>>>>>>> > >>>>>>>>> In this very thread people have been complaining about long queue > >>>>>>>>> times for their builds. Surprise, other Apache projects have been > >>>>>>>>> suffering the very same thing due to us not controlling our build > >>>>>>>>> times. While switching services (be it Jenkins, CircleCI or > >>>>>> whatever) > >>>>>>>>> will possibly work for us (and these options are actually > >>>>>> attractive, > >>>>>>>>> like CircleCI's proper support for build artifacts), it will also > >>>>>>>>> result in us likely negatively affecting other projects in > >>>>>> significant > >>>>>>>>> ways. > >>>>>>>>> > >>>>>>>>> Sure, the Jenkins setup has a good user experience for us, at > >>>>>> the cost > >>>>>>>>> of blocking Jenkins workers for a _lot_ of time. Right now we > >>>>>> have 25 > >>>>>>>>> PR's in our queue; that's possibly 50h we'd consume of Jenkins > >>>>>>>>> resources, and the European contributors haven't even really > >>>>>> started yet. > >>>>>>>>> > >>>>>>>>> FYI, the latest INFRA response from INFRA-18533: > >>>>>>>>> > >>>>>>>>> "Our rough metrics shows that Flink used over 5800 hours of > >>>>>> build time > >>>>>>>>> last month. That is equal to EIGHT servers running 24/7 for > >>>>>> the ENTIRE > >>>>>>>>> MONTH. EIGHT. nonstop. > >>>>>>>>> When we discovered this last night, we discussed it some and > >>>>>> are going > >>>>>>>>> to tune down Flink to allow only five executors maximum. We > >>>>> cannot > >>>>>>>>> allow Flink to consume so much of a Foundation shared resource." > >>>>>>>>> > >>>>>>>>> So yes, we either > >>>>>>>>> a) have to heavily reduce our CI usage or > >>>>>>>>> b) fund our own, either maintaining it ourselves or donating > >>>>>> to Apache. > >>>>>>>>> > >>>>>>>>> On 02/07/2019 05:11, Bowen Li wrote: > >>>>>>>>>> By looking at the git history of the Jenkins script, its core > >>>>>> part > >>>>>>>>>> was finished in March 2017 (and only two minor update in > >>>>>> 2017/2018), > >>>>>>>>>> so it's been running for over two years now and feels like > >>>>>> Zepplin > >>>>>>>>>> community has been quite happy with it. @Jeff Zhang > >>>>>>>>>> <mailto:[hidden email] <mailto:[hidden email]>> can you > >>>>>> share your insights and user > >>>>>>>>>> experience with the Jenkins+Travis approach? > >>>>>>>>>> > >>>>>>>>>> Things like: > >>>>>>>>>> > >>>>>>>>>> - has the approach completely solved the resource capacity > >>>>>> problem > >>>>>>>>>> for Zepplin community? is Zepplin community happy with the > >>>>>> result? > >>>>>>>>>> - is the whole configuration chain stable (e.g. uptime) enough? > >>>>>>>>>> - how often do you need to maintain the Jenkins infra? how many > >>>>>>>>>> people are usually involved in maintenance and bug-fixes? > >>>>>>>>>> > >>>>>>>>>> The downside of this approach seems mostly to be on the > >>>>>> maintenance > >>>>>>>>>> to me - maintain the script and Jenkins infra. > >>>>>>>>>> > >>>>>>>>>> ** Having Our Own Travis-CI.com Account ** > >>>>>>>>>> > >>>>>>>>>> Another alternative I've been thinking of is to have our own > >>>>>>>>>> travis-ci.com <http://travis-ci.com> <http://travis-ci.com> > >>>>>> account with paid dedicated > >>>>>>>>>> resources. Note travis-ci.org <http://travis-ci.org> > >>>>>> <http://travis-ci.org> is the free > >>>>>>>>>> version and travis-ci.com <http://travis-ci.com> > >>>>>> <http://travis-ci.com> is the commercial > >>>>>>>>>> version. We currently use a shared resource pool managed by > >>>>>> ASK INFRA > >>>>>>>>>> team on travis-ci.org <http://travis-ci.org> > >>>>>> <http://travis-ci.org>, but we have no control > >>>>>>>>>> over it - we can't see how it's configured, how much > >>>>>> resources are > >>>>>>>>>> available, how resources are allocated among Apache projects, > >>>>>> etc. > >>>>>>>>>> The nice thing about having an account on travis-ci.com > >>>>>> <http://travis-ci.com> > >>>>>>>>>> <http://travis-ci.com> are: > >>>>>>>>>> > >>>>>>>>>> - relatively low cost with much better resource guarantee > >>>>>> than what > >>>>>>>>>> we currently have [1]: $249/month with 5 dedicated concurrency, > >>>>>>>>>> $489/month with 10 concurrency > >>>>>>>>>> - low maintenance work compared to using Jenkins > >>>>>>>>>> - (potentially) no migration cost according to Travis's doc [2] > >>>>>>>>>> (pending verification) > >>>>>>>>>> - full control over the build capacity/configuration compared to > >>>>>>>>>> using ASF INFRA's pool > >>>>>>>>>> > >>>>>>>>>> I'd be surprised if we as such a vibrant community cannot > >>>>>> find and > >>>>>>>>>> fund $249*12=$2988 a year in exchange for a much better > >>>>> developer > >>>>>>>>>> experience and much higher productivity. > >>>>>>>>>> > >>>>>>>>>> [1] https://travis-ci.com/plans > >>>>>>>>>> [2] > >>>>>>>>>> > >>>>>>>> > >>>>>> > >>>>> > >>> > https://docs.travis-ci.com/user/migrate/open-source-repository-migration > >>>>>>>>>> On Sat, Jun 29, 2019 at 8:39 AM Chesnay Schepler > >>>>>> <[hidden email] <mailto:[hidden email]> > >>>>>>>>>> <mailto:[hidden email] <mailto:[hidden email]>>> wrote: > >>>>>>>>>> > >>>>>>>>>> So yes, the Jenkins job keeps pulling the state from > >>>>>> Travis until it > >>>>>>>>>> finishes. > >>>>>>>>>> > >>>>>>>>>> Note sure I'm comfortable with the idea of using Jenkins > >>>>>> workers > >>>>>>>>>> just to > >>>>>>>>>> idle for a several hours. > >>>>>>>>>> > >>>>>>>>>> On 29/06/2019 14:56, Jeff Zhang wrote: > >>>>>>>>>>> Here's what zeppelin community did, we make a python > >>>>>> script to > >>>>>>>>>> check the > >>>>>>>>>>> build status of pull request. > >>>>>>>>>>> Here's script: > >>>>>>>>>>> > >>>>>> https://github.com/apache/zeppelin/blob/master/travis_check.py > >>>>>>>>>>> > >>>>>>>>>>> And this is the script we used in Jenkins build job. > >>>>>>>>>>> > >>>>>>>>>>> if [ -f "travis_check.py" ]; then > >>>>>>>>>>> git log -n 1 > >>>>>>>>>>> STATUS=$(curl -s $BUILD_URL | grep -e "GitHub pull > >>>>>>>>>> request.*from.*" | sed > >>>>>>>>>>> 's/.*GitHub pull request <a > >>>>>>>>>>> href=\"\(https[^"]*\).*from[^"]*.\(https[^"]*\).*/\1 > >>>>>> \2/g') > >>>>>>>>>>> AUTHOR=$(echo $STATUS | sed 's/.*[/]\(.*\)$/\1/g') > >>>>>>>>>>> PR=$(echo $STATUS | awk '{print $1}' | sed > >>>>>>>>>> 's/.*[/]\(.*\)$/\1/g') > >>>>>>>>>>> #COMMIT=$(git log -n 1 | grep "^Merge:" | awk > >>>>>> '{print $3}') > >>>>>>>>>>> #if [ -z $COMMIT ]; then > >>>>>>>>>>> # COMMIT=$(curl -s > >>>>>>>>>> https://api.github.com/repos/apache/zeppelin/pulls/$PR > >>>>>>>>>>> | grep -e "\"label\":" -e "\"ref\":" -e "\"sha\":" | > >>>>>> tr '\n' ' ' > >>>>>>>>>> | sed > >>>>>>>>>>> 's/\(.*sha[^,]*,\)\(.*ref.*\)/\1 = \2/g' | tr = '\n' | > >>>>>> grep -v > >>>>>>>>>> "apache:" | > >>>>>>>>>>> sed 's/.*sha.[^"]*["]\([^"]*\).*/\1/g') > >>>>>>>>>>> #fi > >>>>>>>>>>> > >>>>>>>>>>> # get commit hash from PR > >>>>>>>>>>> COMMIT=$(curl -s > >>>>>>>>>> https://api.github.com/repos/apache/zeppelin/pulls/$PR | > >>>>>>>>>>> grep -e "\"label\":" -e "\"ref\":" -e "\"sha\":" | tr > >>>>>> '\n' ' ' > >>>>>>>>>> | sed > >>>>>>>>>>> 's/\(.*sha[^,]*,\)\(.*ref.*\)/\1 = \2/g' | tr = '\n' | > >>>>>> grep -v > >>>>>>>>>> "apache:" | > >>>>>>>>>>> sed 's/.*sha.[^"]*["]\([^"]*\).*/\1/g') > >>>>>>>>>>> sleep 30 # sleep few moment to wait travis starts > >>>>>> the build > >>>>>>>>>>> RET_CODE=0 > >>>>>>>>>>> python ./travis_check.py ${AUTHOR} ${COMMIT} || > >>>>>> RET_CODE=$? > >>>>>>>>>>> if [ $RET_CODE -eq 2 ]; then # try with repository > >>>>>> name when > >>>>>>>>>> travis-ci is > >>>>>>>>>>> not available in the account > >>>>>>>>>>> RET_CODE=0 > >>>>>>>>>>> AUTHOR=$(curl -s > >>>>>>>>>> https://api.github.com/repos/apache/zeppelin/pulls/$PR > >>>>>>>>>>> | grep '"full_name":' | grep -v "apache/zeppelin" | sed > >>>>>>>>>>> 's/.*[:][^"]*["]\([^/]*\).*/\1/g') > >>>>>>>>>>> python ./travis_check.py ${AUTHOR} ${COMMIT} || > >>>>>> RET_CODE=$? > >>>>>>>>>>> fi > >>>>>>>>>>> > >>>>>>>>>>> if [ $RET_CODE -eq 2 ]; then # fail with can't find > >>>>>> build > >>>>>>>>>> information in > >>>>>>>>>>> the travis > >>>>>>>>>>> set +x > >>>>>>>>>>> echo > >>>>>> "-----------------------------------------------------" > >>>>>>>>>>> echo "Looks like travis-ci is not configured for > >>>>>> your fork." > >>>>>>>>>>> echo "Please setup by swich on 'zeppelin' > >>>>>> repository at > >>>>>>>>>>> https://travis-ci.org/profile and travis-ci." > >>>>>>>>>>> echo "And then make sure 'Build branch updates' > >>>>>> option is > >>>>>>>>>> enabled in > >>>>>>>>>>> the settings > >>>>>> https://travis-ci.org/${AUTHOR}/zeppelin/settings > >>>>>> <https://travis-ci.org/$%7BAUTHOR%7D/zeppelin/settings> > >>>>>>>>>> <https://travis-ci.org/$%7BAUTHOR%7D/zeppelin/settings>." > >>>>>>>>>>> echo "" > >>>>>>>>>>> echo "To trigger CI after setup, you will need > >>>>>> ammend your > >>>>>>>>>> last commit > >>>>>>>>>>> with" > >>>>>>>>>>> echo "git commit --amend" > >>>>>>>>>>> echo "git push your-remote HEAD --force" > >>>>>>>>>>> echo "" > >>>>>>>>>>> echo "See > >>>>>>>>>>> > >>>>>>>>>> > >>>>>>>> > >>>>>> > >>>>> > >>> > http://zeppelin.apache.org/contribution/contributions.html#continuous-integration > >>>>>>>>>>> ." > >>>>>>>>>>> fi > >>>>>>>>>>> > >>>>>>>>>>> exit $RET_CODE > >>>>>>>>>>> else > >>>>>>>>>>> set +x > >>>>>>>>>>> echo "travis_check.py does not exists" > >>>>>>>>>>> exit 1 > >>>>>>>>>>> fi > >>>>>>>>>>> > >>>>>>>>>>> Chesnay Schepler <[hidden email] > >>>>>> <mailto:[hidden email]> > >>>>>>>>>> <mailto:[hidden email] <mailto:[hidden email]>>> > >>>>>> 于2019年6月29日周六 下午3:17写道: > >>>>>>>>>>> > >>>>>>>>>>>> Does this imply that a Jenkins job is active as long > >>>>>> as the > >>>>>>>>>> Travis build > >>>>>>>>>>>> runs? > >>>>>>>>>>>> > >>>>>>>>>>>> On 26/06/2019 21:28, Bowen Li wrote: > >>>>>>>>>>>>> Hi, > >>>>>>>>>>>>> > >>>>>>>>>>>>> @Dawid, I think the "long test running" as I > >>>>>> mentioned in the > >>>>>>>>>> first > >>>>>>>>>>>> email, > >>>>>>>>>>>>> also as you guys said, belongs to "a big effort > >>>>>> which is much > >>>>>>>>>> harder to > >>>>>>>>>>>>> accomplish in a short period of time and may deserve > >>>>>> its own > >>>>>>>>>> separate > >>>>>>>>>>>>> discussion". Thus I didn't include it in what we can > >>>>>> do in a > >>>>>>>>>> foreseeable > >>>>>>>>>>>>> short term. > >>>>>>>>>>>>> > >>>>>>>>>>>>> Besides, I don't think that's the ultimate reason > >>>>>> for lack of > >>>>>>>>>> build > >>>>>>>>>>>>> resources. Even if the build is shortened to > >>>>>> something like > >>>>>>>>>> 2h, the > >>>>>>>>>>>>> problems of no build machine works about 6 or more > >>>>>> hours in > >>>>>>>>>> PST daytime > >>>>>>>>>>>>> that I described will still happen, because no > >>>>>> machine from > >>>>>>>>>> ASF INFRA's > >>>>>>>>>>>>> pool is allocated to Flink. As I have paid close > >>>>>> attention to > >>>>>>>>>> the build > >>>>>>>>>>>>> queue in the past few weekdays, it's a pretty clear > >>>>>> pattern now. > >>>>>>>>>>>>> > >>>>>>>>>>>>> **The ultimate root cause** for that is - we don't > >>>>>> have any > >>>>>>>>>> **dedicated** > >>>>>>>>>>>>> build resources that we can stably rely on. I'm > >>>>>> actually ok to > >>>>>>>>>> wait for a > >>>>>>>>>>>>> long time if there are build requests running, it > >>>>>> means at > >>>>>>>>>> least we are > >>>>>>>>>>>>> making progress. But I'm not ok with no build > >>>>>> resource. A > >>>>>>>>>> better place I > >>>>>>>>>>>>> think we should aim at in short term is to always > >>>>>> have at > >>>>>>>>>> least a central > >>>>>>>>>>>>> pool (can be 3 or 5) of machines dedicated to build > >>>>>> Flink at > >>>>>>>>>> any time, or > >>>>>>>>>>>>> maybe use users resources. > >>>>>>>>>>>>> > >>>>>>>>>>>>> @Chesnay @Robert I synced with Jeff offline that > >>>>>> Zeppelin > >>>>>>>>>> community is > >>>>>>>>>>>>> using a Jenkins job to automatically build on users' > >>>>>> travis > >>>>>>>>>> account and > >>>>>>>>>>>>> link the result back to github PR. I guess the > >>>>>> Jenkins job > >>>>>>>>>> would fetch > >>>>>>>>>>>>> latest upstream master and build the PR against it. > >>>>>> Jeff has > >>>>>>>>>> filed > >>>>>>>>>>>> tickets > >>>>>>>>>>>>> to learn and get access to the Jenkins infra. It'll > >>>>>> better to > >>>>>>>>>> fully > >>>>>>>>>>>>> understand it first before judging this approach. > >>>>>>>>>>>>> > >>>>>>>>>>>>> I also heard good things about CircleCI, and ASF > >>>>>> INFRA seems > >>>>>>>>>> to have a > >>>>>>>>>>>> pool > >>>>>>>>>>>>> of build capacity there too. Can be an alternative > >>>>>> to consider. > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> On Wed, Jun 26, 2019 at 12:44 AM Dawid Wysakowicz < > >>>>>>>>>>>> [hidden email] > >>>>>> <mailto:[hidden email]> <mailto:[hidden email] > >>>>>> <mailto:[hidden email]>>> > >>>>>>>>>>>>> wrote: > >>>>>>>>>>>>> > >>>>>>>>>>>>>> Sorry to jump in late, but I think Bowen missed the > >>>>>> most > >>>>>>>>>> important point > >>>>>>>>>>>>>> from Chesnay's previous message in the summary. The > >>>>>> ultimate > >>>>>>>>>> reason for > >>>>>>>>>>>>>> all the problems is that the tests take close to 2 > >>>>>> hours to > >>>>>>>>>> run already. > >>>>>>>>>>>>>> I fully support this claim: "Unless people start > >>>>>> caring about > >>>>>>>>>> test times > >>>>>>>>>>>>>> before adding them, this issue cannot be solved" > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> This is also another reason why using user's Travis > >>>>>> account > >>>>>>>>>> won't help. > >>>>>>>>>>>>>> Every few weeks we reach the user's time limit for > >>>>>> a single > >>>>>>>>>> profile. > >>>>>>>>>>>>>> This makes the user's builds simply fail, until we > >>>>>> either > >>>>>>>>>> properly > >>>>>>>>>>>>>> decrease the time the tests take (which I am not > >>>>>> sure we ever > >>>>>>>>>> did) or > >>>>>>>>>>>>>> postpone the problem by splitting into more > >>>>>> profiles. (Note > >>>>>>>>>> that the ASF > >>>>>>>>>>>>>> Travis account has higher time limits) > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Best, > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Dawid > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> On 26/06/2019 09:36, Robert Metzger wrote: > >>>>>>>>>>>>>>> Do we know if using "the best" available hardware > >>>>>> would > >>>>>>>>>> improve the > >>>>>>>>>>>> build > >>>>>>>>>>>>>>> times? > >>>>>>>>>>>>>>> Imagine we would run the build on machines with > >>>>>> plenty of > >>>>>>>>>> main memory > >>>>>>>>>>>> to > >>>>>>>>>>>>>>> mount everything to ramdisk + the latest CPU > >>>>>> architecture? > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Throwing hardware at the problem could help reduce > >>>>>> the time > >>>>>>>>>> of an > >>>>>>>>>>>>>>> individual build, and using our own infrastructure > >>>>>> would > >>>>>>>>>> remove our > >>>>>>>>>>>>>>> dependency on Apache's Travis account (with the > >>>>>> obvious > >>>>>>>>>> downside of > >>>>>>>>>>>>>> having > >>>>>>>>>>>>>>> to maintain the infrastructure) > >>>>>>>>>>>>>>> We could use an open source travis alternative, to > >>>>>> have a > >>>>>>>>>> similar > >>>>>>>>>>>>>>> experience and make the migration easy. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> On Wed, Jun 26, 2019 at 9:34 AM Chesnay Schepler > >>>>>>>>>> <[hidden email] <mailto:[hidden email]> > >>>>>> <mailto:[hidden email] <mailto:[hidden email]>>> > >>>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>>>>> From what I gathered, there's no special > >>>>>> sauce that the > >>>>>>>>>> Zeppelin > >>>>>>>>>>>>>>>> project uses which actually integrates a users > >>>>> Travis > >>>>>>>>>> account into the > >>>>>>>>>>>>>> PR. > >>>>>>>>>>>>>>>> They just disabled Travis for PRs. And that's > >>>>>> kind of it. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Naturally we can do this (duh) and safe the ASF a > >>>>>> fair > >>>>>>>>>> amount of > >>>>>>>>>>>>>>>> resources, but there are downsides: > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> The discoverability of the Travis check takes a > >>>>>> nose-dive. > >>>>>>>>>> Either we > >>>>>>>>>>>>>>>> require every contributor to always, an every > >>>>>> commit, also > >>>>>>>>>> post a > >>>>>>>>>>>> Travis > >>>>>>>>>>>>>>>> build, or we have the reviewer sift through the > >>>>>>>>>> contributors account > >>>>>>>>>>>> to > >>>>>>>>>>>>>>>> find it. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> This is rather cumbersome. Additionally, it's > >>>>>> also not > >>>>>>>>>> equivalent to > >>>>>>>>>>>>>>>> having a PR build. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> A normal branch build takes a branch as is and > >>>>>> tests it. A > >>>>>>>>>> PR build > >>>>>>>>>>>>>>>> merges the branch into master, and then runs it. > >>>>>> (Fun fact: > >>>>>>>>>> This is > >>>>>>>>>>>> why > >>>>>>>>>>>>>>>> a PR without merge conflicts is not being run on > >>>>>> Travis.) > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> And ultimately, everyone can already make use of > >>>>> this > >>>>>>>>>> approach anyway. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> On 25/06/2019 08:02, Jark Wu wrote: > >>>>>>>>>>>>>>>>> Hi Jeff, > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> Thanks for sharing the Zeppelin approach. I > >>>>>> think it's a > >>>>>>>>>> good idea to > >>>>>>>>>>>>>>>>> leverage user's travis account. > >>>>>>>>>>>>>>>>> In this way, we can have almost unlimited > >>>>>> concurrent build > >>>>>>>>>> jobs and > >>>>>>>>>>>>>>>>> developers can restart build by themselves > >>>>>> (currently only > >>>>>>>>>> committers > >>>>>>>>>>>>>>>>> can restart PR's build). > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> But I'm still not very clear how to integrate > >>>>> user's > >>>>>>>>>> travis build > >>>>>>>>>>>> into > >>>>>>>>>>>>>>>>> the Flink pull request's build automatically. > >>>>>> Can you > >>>>>>>>>> explain more in > >>>>>>>>>>>>>>>>> detail? > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> Another question: does travis only build > >>>>>> branches for user > >>>>>>>>>> account? > >>>>>>>>>>>>>>>>> My concern is that builds for PRs will rebase > >>>>> user's > >>>>>>>>>> commits against > >>>>>>>>>>>>>>>>> current master branch. > >>>>>>>>>>>>>>>>> This will help us to find problems before > >>>>>> merge. Builds > >>>>>>>>>> for branches > >>>>>>>>>>>>>>>>> will lose the impact of new commits in master. > >>>>>>>>>>>>>>>>> How does Zeppelin solve this problem? > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> Thanks again for sharing the idea. > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> Regards, > >>>>>>>>>>>>>>>>> Jark > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> On Tue, 25 Jun 2019 at 11:01, Jeff Zhang > >>>>>> <[hidden email] <mailto:[hidden email]> > >>>>>>>>>> <mailto:[hidden email] <mailto:[hidden email]>> > >>>>>>>>>>>>>>>>> <mailto:[hidden email] > >>>>>> <mailto:[hidden email]> <mailto:[hidden email] > >>>>>> <mailto:[hidden email]>>>> wrote: > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> Hi Folks, > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> Zeppelin meet this kind of issue before, we solve > >>>>>>>>>> it by > >>>>>>>>>>>> delegating > >>>>>>>>>>>>>>>>> each > >>>>>>>>>>>>>>>>> one's PR build to his travis account > >>>>>> (Everyone can > >>>>>>>>>> have 5 free > >>>>>>>>>>>>>>>>> slot for > >>>>>>>>>>>>>>>>> travis build). > >>>>>>>>>>>>>>>>> Apache account travis build is only triggered when > >>>>>>>>>> PR is merged. > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> Kurt Young <[hidden email] > >>>>>> <mailto:[hidden email]> > >>>>>>>>>> <mailto:[hidden email] <mailto:[hidden email]>> > >>>>>> <mailto:[hidden email] <mailto:[hidden email]> > >>>>>>>>>> <mailto:[hidden email] <mailto:[hidden email]>>>> > >>>>>>>>>>>>>>>>> 于2019年6月25日周二 上午10:16写道: > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> (Forgot to cc George) > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> Best, > >>>>>>>>>>>>>>>>>> Kurt > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> On Tue, Jun 25, 2019 at 10:16 AM Kurt Young > >>>>>>>>>> <[hidden email] <mailto:[hidden email]> > >>>>>> <mailto:[hidden email] <mailto:[hidden email]>> > >>>>>>>>>>>>>>>>> <mailto:[hidden email] > >>>>>> <mailto:[hidden email]> <mailto:[hidden email] > >>>>>> <mailto:[hidden email]>>>> > >>>>>>>>>> wrote: > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> Hi Bowen, > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> Thanks for bringing this up. We > >>>>>> actually have > >>>>>>>>>> discussed > >>>>>>>>>>>> about > >>>>>>>>>>>>>>>>> this, and I > >>>>>>>>>>>>>>>>>>> think Till and George have > >>>>>>>>>>>>>>>>>>> already spend sometime investigating > >>>>>> it. I have > >>>>>>>>>> cced both of > >>>>>>>>>>>>>>>>> them, and > >>>>>>>>>>>>>>>>>>> maybe they can share > >>>>>>>>>>>>>>>>>>> their findings. > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> Best, > >>>>>>>>>>>>>>>>>>> Kurt > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> On Tue, Jun 25, 2019 at 10:08 AM Jark Wu > >>>>>>>>>> <[hidden email] <mailto:[hidden email]> > >>>>>> <mailto:[hidden email] <mailto:[hidden email]>> > >>>>>>>>>>>>>>>>> <mailto:[hidden email] > >>>>>> <mailto:[hidden email]> <mailto:[hidden email] > >>>>>> <mailto:[hidden email]>>>> > >>>>>>>>>> wrote: > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> Hi Bowen, > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> Thanks for bringing this. We also > >>>>>> suffered from > >>>>>>>>>> the long > >>>>>>>>>>>>>>>>> build time. > >>>>>>>>>>>>>>>>>>>> I agree that we should focus on > >>>>>> solving build > >>>>>>>>>> capacity > >>>>>>>>>>>>>>>>> problem in the > >>>>>>>>>>>>>>>>>>>> thread. > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> My observation is there is only one > >>>>>> build is > >>>>>>>>>> running, all > >>>>>>>>>>>> the > >>>>>>>>>>>>>>>>> others > >>>>>>>>>>>>>>>>>>>> (other > >>>>>>>>>>>>>>>>>>>> PRs, master) are pending. > >>>>>>>>>>>>>>>>>>>> The pricing plan[1] of travis shows > >>>>>> it can > >>>>>>>>>> support > >>>>>>>>>>>> concurrent > >>>>>>>>>>>>>>>>> build > >>>>>>>>>>>>>>>>>> jobs. > >>>>>>>>>>>>>>>>>>>> But I don't know which plan we are > >>>>>> using, might > >>>>>>>>>> be the free > >>>>>>>>>>>>>>>>> plan for > >>>>>>>>>>>>>>>>>> open > >>>>>>>>>>>>>>>>>>>> source. > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> I cc-ed Chesnay who may have some > >>>>>> experience on > >>>>>>>>>> Travis. > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> Regards, > >>>>>>>>>>>>>>>>>>>> Jark > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> [1]: https://travis-ci.com/plans > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> On Tue, 25 Jun 2019 at 08:11, Bowen Li < > >>>>>>>>>>>> [hidden email] <mailto:[hidden email]> > >>>>>> <mailto:[hidden email] <mailto:[hidden email]>> > >>>>>>>>>>>>>>>>> <mailto:[hidden email] > >>>>>> <mailto:[hidden email]> > >>>>>>>>>> <mailto:[hidden email] > >>>>>> <mailto:[hidden email]>>>> wrote: > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> Hi Steven, > >>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> I think you may not read what I > >>>>>> wrote. The > >>>>>>>>>> discussion is > >>>>>>>>>>>>>> about > >>>>>>>>>>>>>>>>>> "unstable > >>>>>>>>>>>>>>>>>>>>> build **capacity**", in another word > >>>>>>>>>> "unstable / lack of > >>>>>>>>>>>>>> build > >>>>>>>>>>>>>>>>>>>> resources", > >>>>>>>>>>>>>>>>>>>>> not "unstable build". > >>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> On Mon, Jun 24, 2019 at 4:40 PM > >>>>>> Steven Wu > >>>>>>>>>>>>>>>>> <[hidden email] > >>>>>> <mailto:[hidden email]> <mailto:[hidden email] > >>>>>> <mailto:[hidden email]>> > >>>>>>>>>> <mailto:[hidden email] > >>>>>> <mailto:[hidden email]> <mailto:[hidden email] > >>>>>> <mailto:[hidden email]>>>> > >>>>>>>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>> long and sometimes unstable build is > >>>>>>>>>> definitely a pain > >>>>>>>>>>>>>>>> point. > >>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>> I suspect the build failure here in > >>>>>>>>>>>> flink-connector-kafka > >>>>>>>>>>>>>>>>> is not > >>>>>>>>>>>>>>>>>>>> related > >>>>>>>>>>>>>>>>>>>>> to > >>>>>>>>>>>>>>>>>>>>>> my change. but there is no easy > >>>>>> re-run the > >>>>>>>>>> build on > >>>>>>>>>>>>>>>>> travis UI. > >>>>>>>>>>>>>>>>>>>>>> search showed a trick of > >>>>>> close-and-open the > >>>>>>>>>> PR will > >>>>>>>>>>>>>>>>> trigger rebuild. > >>>>>>>>>>>>>>>>>>>> but > >>>>>>>>>>>>>>>>>>>>>> that could add noises to the PR > >>>>>> activities. > >>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>> https://travis-ci.org/apache/flink/jobs/545555519 > >>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>> travis-ci for my personal repo > >>>>>> often failed > >>>>>>>>>> with > >>>>>>>>>>>>>>>>> exceeding time > >>>>>>>>>>>>>>>>>> limit > >>>>>>>>>>>>>>>>>>>>> after > >>>>>>>>>>>>>>>>>>>>>> 4+ hours. > >>>>>>>>>>>>>>>>>>>>>> The job exceeded the maximum time > >>>>>> limit for > >>>>>>>>>> jobs, and > >>>>>>>>>>>> has > >>>>>>>>>>>>>>>>> been > >>>>>>>>>>>>>>>>>>>>> terminated. > >>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>> On Mon, Jun 24, 2019 at 4:15 PM > >>>>>> Bowen Li > >>>>>>>>>>>>>>>>> <[hidden email] > >>>>>> <mailto:[hidden email]> <mailto:[hidden email] > >>>>>> <mailto:[hidden email]>> > >>>>>>>>>> <mailto:[hidden email] <mailto:[hidden email]> > >>>>>> <mailto:[hidden email] <mailto:[hidden email]>>>> > >>>>>>>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>> https://travis-ci.org/apache/flink/builds/549681530 > >>>>>>>>>>>>>>>>> This build > >>>>>>>>>>>>>>>>>>>>> request > >>>>>>>>>>>>>>>>>>>>>>> has > >>>>>>>>>>>>>>>>>>>>>>> been sitting at **HEAD of the > >>>>>> queue** > >>>>>>>>>> since I first > >>>>>>>>>>>> saw > >>>>>>>>>>>>>>>>> it at PST > >>>>>>>>>>>>>>>>>>>>> 10:30am > >>>>>>>>>>>>>>>>>>>>>>> (not sure how long it's been > >>>>>> there before > >>>>>>>>>> 10:30am). > >>>>>>>>>>>>>>>>> It's PST > >>>>>>>>>>>>>>>>>> 4:12pm > >>>>>>>>>>>>>>>>>>>> now > >>>>>>>>>>>>>>>>>>>>>> and > >>>>>>>>>>>>>>>>>>>>>>> it hasn't started yet. > >>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>> On Mon, Jun 24, 2019 at 2:48 PM > >>>>>> Bowen Li > >>>>>>>>>>>>>>>>> <[hidden email] > >>>>>> <mailto:[hidden email]> <mailto:[hidden email] > >>>>>> <mailto:[hidden email]>> > >>>>>>>>>> <mailto:[hidden email] <mailto:[hidden email]> > >>>>>> <mailto:[hidden email] <mailto:[hidden email]>>>> > >>>>>>>>>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>> Hi devs, > >>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>> I've been experiencing the pain > >>>>>>>>>> resulting from lack > >>>>>>>>>>>>>>>>> of stable > >>>>>>>>>>>>>>>>>>>> build > >>>>>>>>>>>>>>>>>>>>>>>> capacity on Travis for Flink > >>>>>> PRs [1]. > >>>>>>>>>>>> Specifically, I > >>>>>>>>>>>>>>>>> noticed > >>>>>>>>>>>>>>>>>>>> often > >>>>>>>>>>>>>>>>>>>>>> that > >>>>>>>>>>>>>>>>>>>>>>> no > >>>>>>>>>>>>>>>>>>>>>>>> build in the queue is making any > >>>>>>>>>> progress for > >>>>>>>>>>>> hours, > >>>>>>>>>>>>>> and > >>>>>>>>>>>>>>>>>> suddenly > >>>>>>>>>>>>>>>>>>>> 5 > >>>>>>>>>>>>>>>>>>>>> or > >>>>>>>>>>>>>>>>>>>>>> 6 > >>>>>>>>>>>>>>>>>>>>>>>> builds kick off all together > >>>>>> after the > >>>>>>>>>> long pause. > >>>>>>>>>>>>>>>>> I'm at PST > >>>>>>>>>>>>>>>>>>>>> (UTC-08) > >>>>>>>>>>>>>>>>>>>>>>> time > >>>>>>>>>>>>>>>>>>>>>>>> zone, and I've seen pause can > >>>>>> be as > >>>>>>>>>> long as 6 hours > >>>>>>>>>>>>>>>>> from PST 9am > >>>>>>>>>>>>>>>>>>>> to > >>>>>>>>>>>>>>>>>>>>> 3pm > >>>>>>>>>>>>>>>>>>>>>>>> (let alone the time needed to > >>>>>> drain the > >>>>>>>>>> queue > >>>>>>>>>>>>>>>>> afterwards). > >>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>> I think this has greatly > >>>>>> impacted our > >>>>>>>>>> productivity. > >>>>>>>>>>>>>> I've > >>>>>>>>>>>>>>>>>>>> experienced > >>>>>>>>>>>>>>>>>>>>>> that > >>>>>>>>>>>>>>>>>>>>>>>> PRs submitted in the early > >>>>>> morning of > >>>>>>>>>> PST time zone > >>>>>>>>>>>>>>>>> won't finish > >>>>>>>>>>>>>>>>>>>>> their > >>>>>>>>>>>>>>>>>>>>>>>> build until late night of the > >>>>>> same day. > >>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>> So my questions are: > >>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>> - Has anyone else experienced > >>>>>> the same > >>>>>>>>>> problem or > >>>>>>>>>>>>>>>>> have similar > >>>>>>>>>>>>>>>>>>>>>>> observation > >>>>>>>>>>>>>>>>>>>>>>>> on TravisCI? (I suspect it > >>>>>> has things > >>>>>>>>>> to do with > >>>>>>>>>>>> time > >>>>>>>>>>>>>>>>> zone) > >>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>> - What pricing plan of > >>>>>> TravisCI is > >>>>>>>>>> Flink currently > >>>>>>>>>>>>>>>>> using? Is it > >>>>>>>>>>>>>>>>>>>> the > >>>>>>>>>>>>>>>>>>>>>> free > >>>>>>>>>>>>>>>>>>>>>>>> plan for open source > >>>>>> projects? What > >>>>>>>>>> are the > >>>>>>>>>>>>>>>>> guaranteed build > >>>>>>>>>>>>>>>>>>>> capacity > >>>>>>>>>>>>>>>>>>>>>> of > >>>>>>>>>>>>>>>>>>>>>>>> the current plan? > >>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>> - If the current pricing plan > >>>>>> (either > >>>>>>>>>> free or paid) > >>>>>>>>>>>>>>>> can't > >>>>>>>>>>>>>>>>>> provide > >>>>>>>>>>>>>>>>>>>>>> stable > >>>>>>>>>>>>>>>>>>>>>>>> build capacity, can we > >>>>>> upgrade to a > >>>>>>>>>> higher priced > >>>>>>>>>>>>>>>>> plan with > >>>>>>>>>>>>>>>>>> larger > >>>>>>>>>>>>>>>>>>>>> and > >>>>>>>>>>>>>>>>>>>>>>> more > >>>>>>>>>>>>>>>>>>>>>>>> stable build capacity? > >>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>> BTW, another factor that > >>>>>> contribute to > >>>>>>>>>> the > >>>>>>>>>>>>>>>>> productivity problem > >>>>>>>>>>>>>>>>>> is > >>>>>>>>>>>>>>>>>>>>> that > >>>>>>>>>>>>>>>>>>>>>>>> our build is slow - we run > >>>>>> full build > >>>>>>>>>> for every PR > >>>>>>>>>>>>>> and a > >>>>>>>>>>>>>>>>>>>> successful > >>>>>>>>>>>>>>>>>>>>>> full > >>>>>>>>>>>>>>>>>>>>>>>> build takes ~5h. We > >>>>>> definitely have > >>>>>>>>>> more options to > >>>>>>>>>>>>>>>>> solve it, > >>>>>>>>>>>>>>>>>> for > >>>>>>>>>>>>>>>>>>>>>>> instance, > >>>>>>>>>>>>>>>>>>>>>>>> modularize the build graphs > >>>>>> and reuse > >>>>>>>>>> artifacts > >>>>>>>>>>>> from > >>>>>>>>>>>>>> the > >>>>>>>>>>>>>>>>>> previous > >>>>>>>>>>>>>>>>>>>>>> build. > >>>>>>>>>>>>>>>>>>>>>>>> But I think that can be a big > >>>>>> effort > >>>>>>>>>> which is much > >>>>>>>>>>>>>>>>> harder to > >>>>>>>>>>>>>>>>>>>>> accomplish > >>>>>>>>>>>>>>>>>>>>>>> in > >>>>>>>>>>>>>>>>>>>>>>>> a short period of time and > >>>>>> may deserve > >>>>>>>>>> its own > >>>>>>>>>>>>>> separate > >>>>>>>>>>>>>>>>>>>> discussion. > >>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>> [1] > >>>>>>>>>>>> https://travis-ci.org/apache/flink/pull_requests > >>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> -- > >>>>>>>>>>>>>>>>> Best Regards > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> Jeff Zhang > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>> > >>>>>>>>> > >>>>>>>> > >>>>>> > >>>>> > >>>>> > >>> > >>> > > > > > |
+1
vino yang <[hidden email]> 于2019年7月4日周四 下午7:55写道: > +1 > > Dian Fu <[hidden email]> 于2019年7月4日周四 下午7:09写道: > > > +1. Thanks Chesnay and Bowen for pushing this forward. > > > > Regards, > > Dian > > > > > 在 2019年7月4日,下午6:28,zhijiang <[hidden email]> 写道: > > > > > > +1 and thanks for Chesnay' work on this. > > > > > > Best, > > > Zhijiang > > > > > > ------------------------------------------------------------------ > > > From:Haibo Sun <[hidden email]> > > > Send Time:2019年7月4日(星期四) 18:21 > > > To:dev <[hidden email]> > > > Cc:[hidden email] <[hidden email]> > > > Subject:Re:Re: [VOTE] Migrate to sponsored Travis account > > > > > > +1. Thank Chesnay for pushing this forward. > > > > > > Best, > > > Haibo > > > > > > > > > At 2019-07-04 17:58:28, "Kurt Young" <[hidden email]> wrote: > > >> +1 and great thanks Chesnay for pushing this. > > >> > > >> Best, > > >> Kurt > > >> > > >> > > >> On Thu, Jul 4, 2019 at 5:44 PM Aljoscha Krettek <[hidden email]> > > wrote: > > >> > > >>> +1 > > >>> > > >>> Aljoscha > > >>> > > >>>> On 4. Jul 2019, at 11:09, Stephan Ewen <[hidden email]> wrote: > > >>>> > > >>>> +1 to move to a private Travis account. > > >>>> > > >>>> I can confirm that Ververica will sponsor a Travis CI plan that is > > >>>> equivalent or a bit higher than the previous ASF quota (10 > concurrent > > >>> build > > >>>> queues) > > >>>> > > >>>> Best, > > >>>> Stephan > > >>>> > > >>>> On Thu, Jul 4, 2019 at 10:46 AM Chesnay Schepler < > [hidden email]> > > >>> wrote: > > >>>> > > >>>>> I've raised a JIRA > > >>>>> <https://issues.apache.org/jira/browse/INFRA-18703>with INFRA to > > >>> inquire > > >>>>> whether it would be possible to switch to a different Travis > account, > > >>>>> and if so what steps would need to be taken. > > >>>>> We need a proper confirmation from INFRA since we are not in full > > >>>>> control of the flink repository (for example, we cannot access the > > >>>>> settings page). > > >>>>> > > >>>>> If this is indeed possible, Ververica is willing sponsor a Travis > > >>>>> account for the Flink project. > > >>>>> This would provide us with more than enough resources than we need. > > >>>>> > > >>>>> Since this makes the project more reliant on resources provided by > > >>>>> external companies I would like to vote on this. > > >>>>> > > >>>>> Please vote on this proposal, as follows: > > >>>>> [ ] +1, Approve the migration to a Ververica-sponsored Travis > > account, > > >>>>> provided that INFRA approves > > >>>>> [ ] -1, Do not approach the migration to a Ververica-sponsored > Travis > > >>>>> account > > >>>>> > > >>>>> The vote will be open for at least 24h, and until we have > > confirmation > > >>>>> from INFRA. The voting period may be shorter than the usual 3 days > > since > > >>>>> our current is effectively not working. > > >>>>> > > >>>>> On 04/07/2019 06:51, Bowen Li wrote: > > >>>>>> Re: > Are they using their own Travis CI pool, or did the switch > to > > an > > >>>>>> entirely different CI service? > > >>>>>> > > >>>>>> I reached out to Wes and Krisztián from Apache Arrow PMC. They are > > >>>>>> currently moving away from ASF's Travis to their own in-house > metal > > >>>>>> machines at [1] with custom CI application at [2]. They've seen > > >>>>>> significant improvement w.r.t both much higher performance and > > >>>>>> basically no resource waiting time, "night-and-day" difference > > quoting > > >>>>>> Wes. > > >>>>>> > > >>>>>> Re: > If we can just switch to our own Travis pool, just for our > > >>>>>> project, then this might be something we can do fairly quickly? > > >>>>>> > > >>>>>> I believe so, according to [3] and [4] > > >>>>>> > > >>>>>> > > >>>>>> [1] https://ci.ursalabs.org/ <https://ci.ursalabs.org/#/> > > >>>>>> [2] https://github.com/ursa-labs/ursabot > > >>>>>> [3] > > >>>>>> > > >>> > > https://docs.travis-ci.com/user/migrate/open-source-repository-migration > > >>>>>> [4] > > >>> https://docs.travis-ci.com/user/migrate/open-source-on-travis-ci-com > > >>>>>> > > >>>>>> > > >>>>>> > > >>>>>> On Wed, Jul 3, 2019 at 12:01 AM Chesnay Schepler < > > [hidden email] > > >>>>>> <mailto:[hidden email]>> wrote: > > >>>>>> > > >>>>>> Are they using their own Travis CI pool, or did the switch to an > > >>>>>> entirely different CI service? > > >>>>>> > > >>>>>> If we can just switch to our own Travis pool, just for our > > >>>>>> project, then > > >>>>>> this might be something we can do fairly quickly? > > >>>>>> > > >>>>>> On 03/07/2019 05:55, Bowen Li wrote: > > >>>>>>> I responded in the INFRA ticket [1] that I believe they are > > >>>>>> using a wrong > > >>>>>>> metric against Flink and the total build time is a completely > > >>>>>> different > > >>>>>>> thing than guaranteed build capacity. > > >>>>>>> > > >>>>>>> My response: > > >>>>>>> > > >>>>>>> "As mentioned above, since I started to pay attention to Flink's > > >>>>>> build > > >>>>>>> queue a few tens of days ago, I'm in Seattle and I saw no build > > >>>>>> was kicking > > >>>>>>> off in PST daytime in weekdays for Flink. Our teammates in China > > >>>>>> and Europe > > >>>>>>> have also reported similar observations. So we need to evaluate > > >>>>>> how the > > >>>>>>> large total build time came from - if 1) your number and 2) our > > >>>>>>> observations from three locations that cover pretty much a full > > >>>>>> day, are > > >>>>>>> all true, I **guess** one reason can be that - highly likely the > > >>>>>> extra > > >>>>>>> build time came from weekends when other Apache projects may be > > >>>>>> idle and > > >>>>>>> Flink just drains hard its congested queue. > > >>>>>>> > > >>>>>>> Please be aware of that we're not complaining about the lack of > > >>>>>> resources > > >>>>>>> in general, I'm complaining about the lack of **stable, > dedicated** > > >>>>>>> resources. An example for the latter one is, currently even if > > >>>>>> no build is > > >>>>>>> in Flink's queue and I submit a request to be the queue head in > PST > > >>>>>>> morning, my build won't even start in 6-8+h. That is an absurd > > >>>>>> amount of > > >>>>>>> waiting time. > > >>>>>>> > > >>>>>>> That's saying, if ASF INFRA decides to adopt a quota system and > > >>>>>> grants > > >>>>>>> Flink five DEDICATED servers that runs all the time only for > > >>>>>> Flink, that'll > > >>>>>>> be PERFECT and can totally solve our problem now. > > >>>>>>> > > >>>>>>> Please be aware of that we're not complaining about the lack of > > >>>>>> resources > > >>>>>>> in general, I'm complaining about the lack of **stable, > dedicated** > > >>>>>>> resources. An example for the latter one is, currently even if > > >>>>>> no build is > > >>>>>>> in Flink's queue and I submit a request to be the queue head in > PST > > >>>>>>> morning, my build won't even start in 6-8+h. That is an absurd > > >>>>>> amount of > > >>>>>>> waiting time. > > >>>>>>> > > >>>>>>> > > >>>>>>> That's saying, if ASF INFRA decides to adopt a quota system and > > >>>>>> grants > > >>>>>>> Flink five DEDICATED servers that runs all the time only for > > >>>>>> Flink, that'll > > >>>>>>> be PERFECT and can totally solve our problem now. > > >>>>>>> > > >>>>>>> I feel what's missing in the ASF INFRA's Travis resource pool is > > >>>>>> some level > > >>>>>>> of build capacity SLAs and certainty" > > >>>>>>> > > >>>>>>> > > >>>>>>> Again, I believe there are differences in nature of these two > > >>>>>> problems, > > >>>>>>> long build time v.s. lack of dedicated build resource. That's > > >>>>>> saying, > > >>>>>>> shortening build time may relieve the situation, and may not. > > >>>>>> I'm sightly > > >>>>>>> negative on disabling IT cases for PRs, due to the downside is > > >>>>>> that we are > > >>>>>>> at risk of any potential bugs in PR that UTs doesn't catch, and > > >>>>>> may cost a > > >>>>>>> lot more to fix and if it slows others down or even block > > >>>>>> others, but am > > >>>>>>> open to others opinions on it. > > >>>>>>> > > >>>>>>> AFAICT from INFRA ticket[1], donating to ASF INFRA won't be > > >>>>>> feasible to > > >>>>>>> solve our problem since INFRA's pool is fully shared and they > > >>>>>> have no > > >>>>>>> control and finer insights over resource allocation to a > > >>>>>> specific Apache > > >>>>>>> project. As mentioned in [1], Apache Arrow is moving away from > > >>>>>> ASF INFRA > > >>>>>>> Travis pool (they are actually surprised Flink hasn't plan to do > > >>>>>> so). I > > >>>>>>> know that Spark is on its own build infra. If we all agree that > > >>>>>> funding our > > >>>>>>> own build infra, I'd be glad to help investigate any potential > > >>>>>> options > > >>>>>>> after releasing 1.9 since I'm super busy with 1.9 now. > > >>>>>>> > > >>>>>>> [1] https://issues.apache.org/jira/browse/INFRA-18533 > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> On Tue, Jul 2, 2019 at 4:46 AM Chesnay Schepler > > >>>>>> <[hidden email] <mailto:[hidden email]>> wrote: > > >>>>>>> > > >>>>>>>> As a short-term stopgap, since we can assume this issue to > > >>>>>> become much > > >>>>>>>> worse in the following days/weeks, we could disable IT cases in > > >>>>>> PRs and > > >>>>>>>> only run them on master. > > >>>>>>>> > > >>>>>>>> On 02/07/2019 12:03, Chesnay Schepler wrote: > > >>>>>>>>> People really have to stop thinking that just because > > >>>>>> something works > > >>>>>>>>> for us it is also a good solution. > > >>>>>>>>> Also, please remember that our builds run for 2h from start to > > >>>>>> finish, > > >>>>>>>>> and not the 14 _minutes_ it takes for zeppelin. > > >>>>>>>>> We are dealing with an entirely different scale here, both in > > >>>>>> terms of > > >>>>>>>>> build times and number of builds. > > >>>>>>>>> > > >>>>>>>>> In this very thread people have been complaining about long > queue > > >>>>>>>>> times for their builds. Surprise, other Apache projects have > been > > >>>>>>>>> suffering the very same thing due to us not controlling our > build > > >>>>>>>>> times. While switching services (be it Jenkins, CircleCI or > > >>>>>> whatever) > > >>>>>>>>> will possibly work for us (and these options are actually > > >>>>>> attractive, > > >>>>>>>>> like CircleCI's proper support for build artifacts), it will > also > > >>>>>>>>> result in us likely negatively affecting other projects in > > >>>>>> significant > > >>>>>>>>> ways. > > >>>>>>>>> > > >>>>>>>>> Sure, the Jenkins setup has a good user experience for us, at > > >>>>>> the cost > > >>>>>>>>> of blocking Jenkins workers for a _lot_ of time. Right now we > > >>>>>> have 25 > > >>>>>>>>> PR's in our queue; that's possibly 50h we'd consume of Jenkins > > >>>>>>>>> resources, and the European contributors haven't even really > > >>>>>> started yet. > > >>>>>>>>> > > >>>>>>>>> FYI, the latest INFRA response from INFRA-18533: > > >>>>>>>>> > > >>>>>>>>> "Our rough metrics shows that Flink used over 5800 hours of > > >>>>>> build time > > >>>>>>>>> last month. That is equal to EIGHT servers running 24/7 for > > >>>>>> the ENTIRE > > >>>>>>>>> MONTH. EIGHT. nonstop. > > >>>>>>>>> When we discovered this last night, we discussed it some and > > >>>>>> are going > > >>>>>>>>> to tune down Flink to allow only five executors maximum. We > > >>>>> cannot > > >>>>>>>>> allow Flink to consume so much of a Foundation shared > resource." > > >>>>>>>>> > > >>>>>>>>> So yes, we either > > >>>>>>>>> a) have to heavily reduce our CI usage or > > >>>>>>>>> b) fund our own, either maintaining it ourselves or donating > > >>>>>> to Apache. > > >>>>>>>>> > > >>>>>>>>> On 02/07/2019 05:11, Bowen Li wrote: > > >>>>>>>>>> By looking at the git history of the Jenkins script, its core > > >>>>>> part > > >>>>>>>>>> was finished in March 2017 (and only two minor update in > > >>>>>> 2017/2018), > > >>>>>>>>>> so it's been running for over two years now and feels like > > >>>>>> Zepplin > > >>>>>>>>>> community has been quite happy with it. @Jeff Zhang > > >>>>>>>>>> <mailto:[hidden email] <mailto:[hidden email]>> can you > > >>>>>> share your insights and user > > >>>>>>>>>> experience with the Jenkins+Travis approach? > > >>>>>>>>>> > > >>>>>>>>>> Things like: > > >>>>>>>>>> > > >>>>>>>>>> - has the approach completely solved the resource capacity > > >>>>>> problem > > >>>>>>>>>> for Zepplin community? is Zepplin community happy with the > > >>>>>> result? > > >>>>>>>>>> - is the whole configuration chain stable (e.g. uptime) > enough? > > >>>>>>>>>> - how often do you need to maintain the Jenkins infra? how > many > > >>>>>>>>>> people are usually involved in maintenance and bug-fixes? > > >>>>>>>>>> > > >>>>>>>>>> The downside of this approach seems mostly to be on the > > >>>>>> maintenance > > >>>>>>>>>> to me - maintain the script and Jenkins infra. > > >>>>>>>>>> > > >>>>>>>>>> ** Having Our Own Travis-CI.com Account ** > > >>>>>>>>>> > > >>>>>>>>>> Another alternative I've been thinking of is to have our own > > >>>>>>>>>> travis-ci.com <http://travis-ci.com> <http://travis-ci.com> > > >>>>>> account with paid dedicated > > >>>>>>>>>> resources. Note travis-ci.org <http://travis-ci.org> > > >>>>>> <http://travis-ci.org> is the free > > >>>>>>>>>> version and travis-ci.com <http://travis-ci.com> > > >>>>>> <http://travis-ci.com> is the commercial > > >>>>>>>>>> version. We currently use a shared resource pool managed by > > >>>>>> ASK INFRA > > >>>>>>>>>> team on travis-ci.org <http://travis-ci.org> > > >>>>>> <http://travis-ci.org>, but we have no control > > >>>>>>>>>> over it - we can't see how it's configured, how much > > >>>>>> resources are > > >>>>>>>>>> available, how resources are allocated among Apache projects, > > >>>>>> etc. > > >>>>>>>>>> The nice thing about having an account on travis-ci.com > > >>>>>> <http://travis-ci.com> > > >>>>>>>>>> <http://travis-ci.com> are: > > >>>>>>>>>> > > >>>>>>>>>> - relatively low cost with much better resource guarantee > > >>>>>> than what > > >>>>>>>>>> we currently have [1]: $249/month with 5 dedicated > concurrency, > > >>>>>>>>>> $489/month with 10 concurrency > > >>>>>>>>>> - low maintenance work compared to using Jenkins > > >>>>>>>>>> - (potentially) no migration cost according to Travis's doc > [2] > > >>>>>>>>>> (pending verification) > > >>>>>>>>>> - full control over the build capacity/configuration compared > to > > >>>>>>>>>> using ASF INFRA's pool > > >>>>>>>>>> > > >>>>>>>>>> I'd be surprised if we as such a vibrant community cannot > > >>>>>> find and > > >>>>>>>>>> fund $249*12=$2988 a year in exchange for a much better > > >>>>> developer > > >>>>>>>>>> experience and much higher productivity. > > >>>>>>>>>> > > >>>>>>>>>> [1] https://travis-ci.com/plans > > >>>>>>>>>> [2] > > >>>>>>>>>> > > >>>>>>>> > > >>>>>> > > >>>>> > > >>> > > https://docs.travis-ci.com/user/migrate/open-source-repository-migration > > >>>>>>>>>> On Sat, Jun 29, 2019 at 8:39 AM Chesnay Schepler > > >>>>>> <[hidden email] <mailto:[hidden email]> > > >>>>>>>>>> <mailto:[hidden email] <mailto:[hidden email]>>> > wrote: > > >>>>>>>>>> > > >>>>>>>>>> So yes, the Jenkins job keeps pulling the state from > > >>>>>> Travis until it > > >>>>>>>>>> finishes. > > >>>>>>>>>> > > >>>>>>>>>> Note sure I'm comfortable with the idea of using Jenkins > > >>>>>> workers > > >>>>>>>>>> just to > > >>>>>>>>>> idle for a several hours. > > >>>>>>>>>> > > >>>>>>>>>> On 29/06/2019 14:56, Jeff Zhang wrote: > > >>>>>>>>>>> Here's what zeppelin community did, we make a python > > >>>>>> script to > > >>>>>>>>>> check the > > >>>>>>>>>>> build status of pull request. > > >>>>>>>>>>> Here's script: > > >>>>>>>>>>> > > >>>>>> https://github.com/apache/zeppelin/blob/master/travis_check.py > > >>>>>>>>>>> > > >>>>>>>>>>> And this is the script we used in Jenkins build job. > > >>>>>>>>>>> > > >>>>>>>>>>> if [ -f "travis_check.py" ]; then > > >>>>>>>>>>> git log -n 1 > > >>>>>>>>>>> STATUS=$(curl -s $BUILD_URL | grep -e "GitHub pull > > >>>>>>>>>> request.*from.*" | sed > > >>>>>>>>>>> 's/.*GitHub pull request <a > > >>>>>>>>>>> href=\"\(https[^"]*\).*from[^"]*.\(https[^"]*\).*/\1 > > >>>>>> \2/g') > > >>>>>>>>>>> AUTHOR=$(echo $STATUS | sed 's/.*[/]\(.*\)$/\1/g') > > >>>>>>>>>>> PR=$(echo $STATUS | awk '{print $1}' | sed > > >>>>>>>>>> 's/.*[/]\(.*\)$/\1/g') > > >>>>>>>>>>> #COMMIT=$(git log -n 1 | grep "^Merge:" | awk > > >>>>>> '{print $3}') > > >>>>>>>>>>> #if [ -z $COMMIT ]; then > > >>>>>>>>>>> # COMMIT=$(curl -s > > >>>>>>>>>> https://api.github.com/repos/apache/zeppelin/pulls/$PR > > >>>>>>>>>>> | grep -e "\"label\":" -e "\"ref\":" -e "\"sha\":" | > > >>>>>> tr '\n' ' ' > > >>>>>>>>>> | sed > > >>>>>>>>>>> 's/\(.*sha[^,]*,\)\(.*ref.*\)/\1 = \2/g' | tr = '\n' | > > >>>>>> grep -v > > >>>>>>>>>> "apache:" | > > >>>>>>>>>>> sed 's/.*sha.[^"]*["]\([^"]*\).*/\1/g') > > >>>>>>>>>>> #fi > > >>>>>>>>>>> > > >>>>>>>>>>> # get commit hash from PR > > >>>>>>>>>>> COMMIT=$(curl -s > > >>>>>>>>>> https://api.github.com/repos/apache/zeppelin/pulls/$PR | > > >>>>>>>>>>> grep -e "\"label\":" -e "\"ref\":" -e "\"sha\":" | tr > > >>>>>> '\n' ' ' > > >>>>>>>>>> | sed > > >>>>>>>>>>> 's/\(.*sha[^,]*,\)\(.*ref.*\)/\1 = \2/g' | tr = '\n' | > > >>>>>> grep -v > > >>>>>>>>>> "apache:" | > > >>>>>>>>>>> sed 's/.*sha.[^"]*["]\([^"]*\).*/\1/g') > > >>>>>>>>>>> sleep 30 # sleep few moment to wait travis starts > > >>>>>> the build > > >>>>>>>>>>> RET_CODE=0 > > >>>>>>>>>>> python ./travis_check.py ${AUTHOR} ${COMMIT} || > > >>>>>> RET_CODE=$? > > >>>>>>>>>>> if [ $RET_CODE -eq 2 ]; then # try with repository > > >>>>>> name when > > >>>>>>>>>> travis-ci is > > >>>>>>>>>>> not available in the account > > >>>>>>>>>>> RET_CODE=0 > > >>>>>>>>>>> AUTHOR=$(curl -s > > >>>>>>>>>> https://api.github.com/repos/apache/zeppelin/pulls/$PR > > >>>>>>>>>>> | grep '"full_name":' | grep -v "apache/zeppelin" | sed > > >>>>>>>>>>> 's/.*[:][^"]*["]\([^/]*\).*/\1/g') > > >>>>>>>>>>> python ./travis_check.py ${AUTHOR} ${COMMIT} || > > >>>>>> RET_CODE=$? > > >>>>>>>>>>> fi > > >>>>>>>>>>> > > >>>>>>>>>>> if [ $RET_CODE -eq 2 ]; then # fail with can't find > > >>>>>> build > > >>>>>>>>>> information in > > >>>>>>>>>>> the travis > > >>>>>>>>>>> set +x > > >>>>>>>>>>> echo > > >>>>>> "-----------------------------------------------------" > > >>>>>>>>>>> echo "Looks like travis-ci is not configured for > > >>>>>> your fork." > > >>>>>>>>>>> echo "Please setup by swich on 'zeppelin' > > >>>>>> repository at > > >>>>>>>>>>> https://travis-ci.org/profile and travis-ci." > > >>>>>>>>>>> echo "And then make sure 'Build branch updates' > > >>>>>> option is > > >>>>>>>>>> enabled in > > >>>>>>>>>>> the settings > > >>>>>> https://travis-ci.org/${AUTHOR}/zeppelin/settings > > >>>>>> <https://travis-ci.org/$%7BAUTHOR%7D/zeppelin/settings> > > >>>>>>>>>> <https://travis-ci.org/$%7BAUTHOR%7D/zeppelin/settings>." > > >>>>>>>>>>> echo "" > > >>>>>>>>>>> echo "To trigger CI after setup, you will need > > >>>>>> ammend your > > >>>>>>>>>> last commit > > >>>>>>>>>>> with" > > >>>>>>>>>>> echo "git commit --amend" > > >>>>>>>>>>> echo "git push your-remote HEAD --force" > > >>>>>>>>>>> echo "" > > >>>>>>>>>>> echo "See > > >>>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>> > > >>>>>> > > >>>>> > > >>> > > > http://zeppelin.apache.org/contribution/contributions.html#continuous-integration > > >>>>>>>>>>> ." > > >>>>>>>>>>> fi > > >>>>>>>>>>> > > >>>>>>>>>>> exit $RET_CODE > > >>>>>>>>>>> else > > >>>>>>>>>>> set +x > > >>>>>>>>>>> echo "travis_check.py does not exists" > > >>>>>>>>>>> exit 1 > > >>>>>>>>>>> fi > > >>>>>>>>>>> > > >>>>>>>>>>> Chesnay Schepler <[hidden email] > > >>>>>> <mailto:[hidden email]> > > >>>>>>>>>> <mailto:[hidden email] <mailto:[hidden email]>>> > > >>>>>> 于2019年6月29日周六 下午3:17写道: > > >>>>>>>>>>> > > >>>>>>>>>>>> Does this imply that a Jenkins job is active as long > > >>>>>> as the > > >>>>>>>>>> Travis build > > >>>>>>>>>>>> runs? > > >>>>>>>>>>>> > > >>>>>>>>>>>> On 26/06/2019 21:28, Bowen Li wrote: > > >>>>>>>>>>>>> Hi, > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> @Dawid, I think the "long test running" as I > > >>>>>> mentioned in the > > >>>>>>>>>> first > > >>>>>>>>>>>> email, > > >>>>>>>>>>>>> also as you guys said, belongs to "a big effort > > >>>>>> which is much > > >>>>>>>>>> harder to > > >>>>>>>>>>>>> accomplish in a short period of time and may deserve > > >>>>>> its own > > >>>>>>>>>> separate > > >>>>>>>>>>>>> discussion". Thus I didn't include it in what we can > > >>>>>> do in a > > >>>>>>>>>> foreseeable > > >>>>>>>>>>>>> short term. > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> Besides, I don't think that's the ultimate reason > > >>>>>> for lack of > > >>>>>>>>>> build > > >>>>>>>>>>>>> resources. Even if the build is shortened to > > >>>>>> something like > > >>>>>>>>>> 2h, the > > >>>>>>>>>>>>> problems of no build machine works about 6 or more > > >>>>>> hours in > > >>>>>>>>>> PST daytime > > >>>>>>>>>>>>> that I described will still happen, because no > > >>>>>> machine from > > >>>>>>>>>> ASF INFRA's > > >>>>>>>>>>>>> pool is allocated to Flink. As I have paid close > > >>>>>> attention to > > >>>>>>>>>> the build > > >>>>>>>>>>>>> queue in the past few weekdays, it's a pretty clear > > >>>>>> pattern now. > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> **The ultimate root cause** for that is - we don't > > >>>>>> have any > > >>>>>>>>>> **dedicated** > > >>>>>>>>>>>>> build resources that we can stably rely on. I'm > > >>>>>> actually ok to > > >>>>>>>>>> wait for a > > >>>>>>>>>>>>> long time if there are build requests running, it > > >>>>>> means at > > >>>>>>>>>> least we are > > >>>>>>>>>>>>> making progress. But I'm not ok with no build > > >>>>>> resource. A > > >>>>>>>>>> better place I > > >>>>>>>>>>>>> think we should aim at in short term is to always > > >>>>>> have at > > >>>>>>>>>> least a central > > >>>>>>>>>>>>> pool (can be 3 or 5) of machines dedicated to build > > >>>>>> Flink at > > >>>>>>>>>> any time, or > > >>>>>>>>>>>>> maybe use users resources. > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> @Chesnay @Robert I synced with Jeff offline that > > >>>>>> Zeppelin > > >>>>>>>>>> community is > > >>>>>>>>>>>>> using a Jenkins job to automatically build on users' > > >>>>>> travis > > >>>>>>>>>> account and > > >>>>>>>>>>>>> link the result back to github PR. I guess the > > >>>>>> Jenkins job > > >>>>>>>>>> would fetch > > >>>>>>>>>>>>> latest upstream master and build the PR against it. > > >>>>>> Jeff has > > >>>>>>>>>> filed > > >>>>>>>>>>>> tickets > > >>>>>>>>>>>>> to learn and get access to the Jenkins infra. It'll > > >>>>>> better to > > >>>>>>>>>> fully > > >>>>>>>>>>>>> understand it first before judging this approach. > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> I also heard good things about CircleCI, and ASF > > >>>>>> INFRA seems > > >>>>>>>>>> to have a > > >>>>>>>>>>>> pool > > >>>>>>>>>>>>> of build capacity there too. Can be an alternative > > >>>>>> to consider. > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> On Wed, Jun 26, 2019 at 12:44 AM Dawid Wysakowicz < > > >>>>>>>>>>>> [hidden email] > > >>>>>> <mailto:[hidden email]> <mailto:[hidden email] > > >>>>>> <mailto:[hidden email]>>> > > >>>>>>>>>>>>> wrote: > > >>>>>>>>>>>>> > > >>>>>>>>>>>>>> Sorry to jump in late, but I think Bowen missed the > > >>>>>> most > > >>>>>>>>>> important point > > >>>>>>>>>>>>>> from Chesnay's previous message in the summary. The > > >>>>>> ultimate > > >>>>>>>>>> reason for > > >>>>>>>>>>>>>> all the problems is that the tests take close to 2 > > >>>>>> hours to > > >>>>>>>>>> run already. > > >>>>>>>>>>>>>> I fully support this claim: "Unless people start > > >>>>>> caring about > > >>>>>>>>>> test times > > >>>>>>>>>>>>>> before adding them, this issue cannot be solved" > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> This is also another reason why using user's Travis > > >>>>>> account > > >>>>>>>>>> won't help. > > >>>>>>>>>>>>>> Every few weeks we reach the user's time limit for > > >>>>>> a single > > >>>>>>>>>> profile. > > >>>>>>>>>>>>>> This makes the user's builds simply fail, until we > > >>>>>> either > > >>>>>>>>>> properly > > >>>>>>>>>>>>>> decrease the time the tests take (which I am not > > >>>>>> sure we ever > > >>>>>>>>>> did) or > > >>>>>>>>>>>>>> postpone the problem by splitting into more > > >>>>>> profiles. (Note > > >>>>>>>>>> that the ASF > > >>>>>>>>>>>>>> Travis account has higher time limits) > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> Best, > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> Dawid > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> On 26/06/2019 09:36, Robert Metzger wrote: > > >>>>>>>>>>>>>>> Do we know if using "the best" available hardware > > >>>>>> would > > >>>>>>>>>> improve the > > >>>>>>>>>>>> build > > >>>>>>>>>>>>>>> times? > > >>>>>>>>>>>>>>> Imagine we would run the build on machines with > > >>>>>> plenty of > > >>>>>>>>>> main memory > > >>>>>>>>>>>> to > > >>>>>>>>>>>>>>> mount everything to ramdisk + the latest CPU > > >>>>>> architecture? > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> Throwing hardware at the problem could help reduce > > >>>>>> the time > > >>>>>>>>>> of an > > >>>>>>>>>>>>>>> individual build, and using our own infrastructure > > >>>>>> would > > >>>>>>>>>> remove our > > >>>>>>>>>>>>>>> dependency on Apache's Travis account (with the > > >>>>>> obvious > > >>>>>>>>>> downside of > > >>>>>>>>>>>>>> having > > >>>>>>>>>>>>>>> to maintain the infrastructure) > > >>>>>>>>>>>>>>> We could use an open source travis alternative, to > > >>>>>> have a > > >>>>>>>>>> similar > > >>>>>>>>>>>>>>> experience and make the migration easy. > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> On Wed, Jun 26, 2019 at 9:34 AM Chesnay Schepler > > >>>>>>>>>> <[hidden email] <mailto:[hidden email]> > > >>>>>> <mailto:[hidden email] <mailto:[hidden email]>>> > > >>>>>>>>>>>>>> wrote: > > >>>>>>>>>>>>>>>>> From what I gathered, there's no special > > >>>>>> sauce that the > > >>>>>>>>>> Zeppelin > > >>>>>>>>>>>>>>>> project uses which actually integrates a users > > >>>>> Travis > > >>>>>>>>>> account into the > > >>>>>>>>>>>>>> PR. > > >>>>>>>>>>>>>>>> They just disabled Travis for PRs. And that's > > >>>>>> kind of it. > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> Naturally we can do this (duh) and safe the ASF a > > >>>>>> fair > > >>>>>>>>>> amount of > > >>>>>>>>>>>>>>>> resources, but there are downsides: > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> The discoverability of the Travis check takes a > > >>>>>> nose-dive. > > >>>>>>>>>> Either we > > >>>>>>>>>>>>>>>> require every contributor to always, an every > > >>>>>> commit, also > > >>>>>>>>>> post a > > >>>>>>>>>>>> Travis > > >>>>>>>>>>>>>>>> build, or we have the reviewer sift through the > > >>>>>>>>>> contributors account > > >>>>>>>>>>>> to > > >>>>>>>>>>>>>>>> find it. > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> This is rather cumbersome. Additionally, it's > > >>>>>> also not > > >>>>>>>>>> equivalent to > > >>>>>>>>>>>>>>>> having a PR build. > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> A normal branch build takes a branch as is and > > >>>>>> tests it. A > > >>>>>>>>>> PR build > > >>>>>>>>>>>>>>>> merges the branch into master, and then runs it. > > >>>>>> (Fun fact: > > >>>>>>>>>> This is > > >>>>>>>>>>>> why > > >>>>>>>>>>>>>>>> a PR without merge conflicts is not being run on > > >>>>>> Travis.) > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> And ultimately, everyone can already make use of > > >>>>> this > > >>>>>>>>>> approach anyway. > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> On 25/06/2019 08:02, Jark Wu wrote: > > >>>>>>>>>>>>>>>>> Hi Jeff, > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> Thanks for sharing the Zeppelin approach. I > > >>>>>> think it's a > > >>>>>>>>>> good idea to > > >>>>>>>>>>>>>>>>> leverage user's travis account. > > >>>>>>>>>>>>>>>>> In this way, we can have almost unlimited > > >>>>>> concurrent build > > >>>>>>>>>> jobs and > > >>>>>>>>>>>>>>>>> developers can restart build by themselves > > >>>>>> (currently only > > >>>>>>>>>> committers > > >>>>>>>>>>>>>>>>> can restart PR's build). > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> But I'm still not very clear how to integrate > > >>>>> user's > > >>>>>>>>>> travis build > > >>>>>>>>>>>> into > > >>>>>>>>>>>>>>>>> the Flink pull request's build automatically. > > >>>>>> Can you > > >>>>>>>>>> explain more in > > >>>>>>>>>>>>>>>>> detail? > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> Another question: does travis only build > > >>>>>> branches for user > > >>>>>>>>>> account? > > >>>>>>>>>>>>>>>>> My concern is that builds for PRs will rebase > > >>>>> user's > > >>>>>>>>>> commits against > > >>>>>>>>>>>>>>>>> current master branch. > > >>>>>>>>>>>>>>>>> This will help us to find problems before > > >>>>>> merge. Builds > > >>>>>>>>>> for branches > > >>>>>>>>>>>>>>>>> will lose the impact of new commits in master. > > >>>>>>>>>>>>>>>>> How does Zeppelin solve this problem? > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> Thanks again for sharing the idea. > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> Regards, > > >>>>>>>>>>>>>>>>> Jark > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> On Tue, 25 Jun 2019 at 11:01, Jeff Zhang > > >>>>>> <[hidden email] <mailto:[hidden email]> > > >>>>>>>>>> <mailto:[hidden email] <mailto:[hidden email]>> > > >>>>>>>>>>>>>>>>> <mailto:[hidden email] > > >>>>>> <mailto:[hidden email]> <mailto:[hidden email] > > >>>>>> <mailto:[hidden email]>>>> wrote: > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> Hi Folks, > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> Zeppelin meet this kind of issue before, we solve > > >>>>>>>>>> it by > > >>>>>>>>>>>> delegating > > >>>>>>>>>>>>>>>>> each > > >>>>>>>>>>>>>>>>> one's PR build to his travis account > > >>>>>> (Everyone can > > >>>>>>>>>> have 5 free > > >>>>>>>>>>>>>>>>> slot for > > >>>>>>>>>>>>>>>>> travis build). > > >>>>>>>>>>>>>>>>> Apache account travis build is only triggered when > > >>>>>>>>>> PR is merged. > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> Kurt Young <[hidden email] > > >>>>>> <mailto:[hidden email]> > > >>>>>>>>>> <mailto:[hidden email] <mailto:[hidden email]>> > > >>>>>> <mailto:[hidden email] <mailto:[hidden email]> > > >>>>>>>>>> <mailto:[hidden email] <mailto:[hidden email]>>>> > > >>>>>>>>>>>>>>>>> 于2019年6月25日周二 上午10:16写道: > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>> (Forgot to cc George) > > >>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>> Best, > > >>>>>>>>>>>>>>>>>> Kurt > > >>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>> On Tue, Jun 25, 2019 at 10:16 AM Kurt Young > > >>>>>>>>>> <[hidden email] <mailto:[hidden email]> > > >>>>>> <mailto:[hidden email] <mailto:[hidden email]>> > > >>>>>>>>>>>>>>>>> <mailto:[hidden email] > > >>>>>> <mailto:[hidden email]> <mailto:[hidden email] > > >>>>>> <mailto:[hidden email]>>>> > > >>>>>>>>>> wrote: > > >>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>> Hi Bowen, > > >>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>> Thanks for bringing this up. We > > >>>>>> actually have > > >>>>>>>>>> discussed > > >>>>>>>>>>>> about > > >>>>>>>>>>>>>>>>> this, and I > > >>>>>>>>>>>>>>>>>>> think Till and George have > > >>>>>>>>>>>>>>>>>>> already spend sometime investigating > > >>>>>> it. I have > > >>>>>>>>>> cced both of > > >>>>>>>>>>>>>>>>> them, and > > >>>>>>>>>>>>>>>>>>> maybe they can share > > >>>>>>>>>>>>>>>>>>> their findings. > > >>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>> Best, > > >>>>>>>>>>>>>>>>>>> Kurt > > >>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>> On Tue, Jun 25, 2019 at 10:08 AM Jark Wu > > >>>>>>>>>> <[hidden email] <mailto:[hidden email]> > > >>>>>> <mailto:[hidden email] <mailto:[hidden email]>> > > >>>>>>>>>>>>>>>>> <mailto:[hidden email] > > >>>>>> <mailto:[hidden email]> <mailto:[hidden email] > > >>>>>> <mailto:[hidden email]>>>> > > >>>>>>>>>> wrote: > > >>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> Hi Bowen, > > >>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> Thanks for bringing this. We also > > >>>>>> suffered from > > >>>>>>>>>> the long > > >>>>>>>>>>>>>>>>> build time. > > >>>>>>>>>>>>>>>>>>>> I agree that we should focus on > > >>>>>> solving build > > >>>>>>>>>> capacity > > >>>>>>>>>>>>>>>>> problem in the > > >>>>>>>>>>>>>>>>>>>> thread. > > >>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> My observation is there is only one > > >>>>>> build is > > >>>>>>>>>> running, all > > >>>>>>>>>>>> the > > >>>>>>>>>>>>>>>>> others > > >>>>>>>>>>>>>>>>>>>> (other > > >>>>>>>>>>>>>>>>>>>> PRs, master) are pending. > > >>>>>>>>>>>>>>>>>>>> The pricing plan[1] of travis shows > > >>>>>> it can > > >>>>>>>>>> support > > >>>>>>>>>>>> concurrent > > >>>>>>>>>>>>>>>>> build > > >>>>>>>>>>>>>>>>>> jobs. > > >>>>>>>>>>>>>>>>>>>> But I don't know which plan we are > > >>>>>> using, might > > >>>>>>>>>> be the free > > >>>>>>>>>>>>>>>>> plan for > > >>>>>>>>>>>>>>>>>> open > > >>>>>>>>>>>>>>>>>>>> source. > > >>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> I cc-ed Chesnay who may have some > > >>>>>> experience on > > >>>>>>>>>> Travis. > > >>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> Regards, > > >>>>>>>>>>>>>>>>>>>> Jark > > >>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> [1]: https://travis-ci.com/plans > > >>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> On Tue, 25 Jun 2019 at 08:11, Bowen Li < > > >>>>>>>>>>>> [hidden email] <mailto:[hidden email]> > > >>>>>> <mailto:[hidden email] <mailto:[hidden email]>> > > >>>>>>>>>>>>>>>>> <mailto:[hidden email] > > >>>>>> <mailto:[hidden email]> > > >>>>>>>>>> <mailto:[hidden email] > > >>>>>> <mailto:[hidden email]>>>> wrote: > > >>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> Hi Steven, > > >>>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> I think you may not read what I > > >>>>>> wrote. The > > >>>>>>>>>> discussion is > > >>>>>>>>>>>>>> about > > >>>>>>>>>>>>>>>>>> "unstable > > >>>>>>>>>>>>>>>>>>>>> build **capacity**", in another word > > >>>>>>>>>> "unstable / lack of > > >>>>>>>>>>>>>> build > > >>>>>>>>>>>>>>>>>>>> resources", > > >>>>>>>>>>>>>>>>>>>>> not "unstable build". > > >>>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> On Mon, Jun 24, 2019 at 4:40 PM > > >>>>>> Steven Wu > > >>>>>>>>>>>>>>>>> <[hidden email] > > >>>>>> <mailto:[hidden email]> <mailto:[hidden email] > > >>>>>> <mailto:[hidden email]>> > > >>>>>>>>>> <mailto:[hidden email] > > >>>>>> <mailto:[hidden email]> <mailto:[hidden email] > > >>>>>> <mailto:[hidden email]>>>> > > >>>>>>>>>>>>>>>>>> wrote: > > >>>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>> long and sometimes unstable build is > > >>>>>>>>>> definitely a pain > > >>>>>>>>>>>>>>>> point. > > >>>>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>> I suspect the build failure here in > > >>>>>>>>>>>> flink-connector-kafka > > >>>>>>>>>>>>>>>>> is not > > >>>>>>>>>>>>>>>>>>>> related > > >>>>>>>>>>>>>>>>>>>>> to > > >>>>>>>>>>>>>>>>>>>>>> my change. but there is no easy > > >>>>>> re-run the > > >>>>>>>>>> build on > > >>>>>>>>>>>>>>>>> travis UI. > > >>>>>>>>>>>>>>>>>>>>>> search showed a trick of > > >>>>>> close-and-open the > > >>>>>>>>>> PR will > > >>>>>>>>>>>>>>>>> trigger rebuild. > > >>>>>>>>>>>>>>>>>>>> but > > >>>>>>>>>>>>>>>>>>>>>> that could add noises to the PR > > >>>>>> activities. > > >>>>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>> https://travis-ci.org/apache/flink/jobs/545555519 > > >>>>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>> travis-ci for my personal repo > > >>>>>> often failed > > >>>>>>>>>> with > > >>>>>>>>>>>>>>>>> exceeding time > > >>>>>>>>>>>>>>>>>> limit > > >>>>>>>>>>>>>>>>>>>>> after > > >>>>>>>>>>>>>>>>>>>>>> 4+ hours. > > >>>>>>>>>>>>>>>>>>>>>> The job exceeded the maximum time > > >>>>>> limit for > > >>>>>>>>>> jobs, and > > >>>>>>>>>>>> has > > >>>>>>>>>>>>>>>>> been > > >>>>>>>>>>>>>>>>>>>>> terminated. > > >>>>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>> On Mon, Jun 24, 2019 at 4:15 PM > > >>>>>> Bowen Li > > >>>>>>>>>>>>>>>>> <[hidden email] > > >>>>>> <mailto:[hidden email]> <mailto:[hidden email] > > >>>>>> <mailto:[hidden email]>> > > >>>>>>>>>> <mailto:[hidden email] <mailto:[hidden email]> > > >>>>>> <mailto:[hidden email] <mailto:[hidden email]>>>> > > >>>>>>>>>>>>>>>>>> wrote: > > >>>>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>> https://travis-ci.org/apache/flink/builds/549681530 > > >>>>>>>>>>>>>>>>> This build > > >>>>>>>>>>>>>>>>>>>>> request > > >>>>>>>>>>>>>>>>>>>>>>> has > > >>>>>>>>>>>>>>>>>>>>>>> been sitting at **HEAD of the > > >>>>>> queue** > > >>>>>>>>>> since I first > > >>>>>>>>>>>> saw > > >>>>>>>>>>>>>>>>> it at PST > > >>>>>>>>>>>>>>>>>>>>> 10:30am > > >>>>>>>>>>>>>>>>>>>>>>> (not sure how long it's been > > >>>>>> there before > > >>>>>>>>>> 10:30am). > > >>>>>>>>>>>>>>>>> It's PST > > >>>>>>>>>>>>>>>>>> 4:12pm > > >>>>>>>>>>>>>>>>>>>> now > > >>>>>>>>>>>>>>>>>>>>>> and > > >>>>>>>>>>>>>>>>>>>>>>> it hasn't started yet. > > >>>>>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>> On Mon, Jun 24, 2019 at 2:48 PM > > >>>>>> Bowen Li > > >>>>>>>>>>>>>>>>> <[hidden email] > > >>>>>> <mailto:[hidden email]> <mailto:[hidden email] > > >>>>>> <mailto:[hidden email]>> > > >>>>>>>>>> <mailto:[hidden email] <mailto:[hidden email]> > > >>>>>> <mailto:[hidden email] <mailto:[hidden email]>>>> > > >>>>>>>>>>>>>>>>>>>> wrote: > > >>>>>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> Hi devs, > > >>>>>>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> I've been experiencing the pain > > >>>>>>>>>> resulting from lack > > >>>>>>>>>>>>>>>>> of stable > > >>>>>>>>>>>>>>>>>>>> build > > >>>>>>>>>>>>>>>>>>>>>>>> capacity on Travis for Flink > > >>>>>> PRs [1]. > > >>>>>>>>>>>> Specifically, I > > >>>>>>>>>>>>>>>>> noticed > > >>>>>>>>>>>>>>>>>>>> often > > >>>>>>>>>>>>>>>>>>>>>> that > > >>>>>>>>>>>>>>>>>>>>>>> no > > >>>>>>>>>>>>>>>>>>>>>>>> build in the queue is making any > > >>>>>>>>>> progress for > > >>>>>>>>>>>> hours, > > >>>>>>>>>>>>>> and > > >>>>>>>>>>>>>>>>>> suddenly > > >>>>>>>>>>>>>>>>>>>> 5 > > >>>>>>>>>>>>>>>>>>>>> or > > >>>>>>>>>>>>>>>>>>>>>> 6 > > >>>>>>>>>>>>>>>>>>>>>>>> builds kick off all together > > >>>>>> after the > > >>>>>>>>>> long pause. > > >>>>>>>>>>>>>>>>> I'm at PST > > >>>>>>>>>>>>>>>>>>>>> (UTC-08) > > >>>>>>>>>>>>>>>>>>>>>>> time > > >>>>>>>>>>>>>>>>>>>>>>>> zone, and I've seen pause can > > >>>>>> be as > > >>>>>>>>>> long as 6 hours > > >>>>>>>>>>>>>>>>> from PST 9am > > >>>>>>>>>>>>>>>>>>>> to > > >>>>>>>>>>>>>>>>>>>>> 3pm > > >>>>>>>>>>>>>>>>>>>>>>>> (let alone the time needed to > > >>>>>> drain the > > >>>>>>>>>> queue > > >>>>>>>>>>>>>>>>> afterwards). > > >>>>>>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> I think this has greatly > > >>>>>> impacted our > > >>>>>>>>>> productivity. > > >>>>>>>>>>>>>> I've > > >>>>>>>>>>>>>>>>>>>> experienced > > >>>>>>>>>>>>>>>>>>>>>> that > > >>>>>>>>>>>>>>>>>>>>>>>> PRs submitted in the early > > >>>>>> morning of > > >>>>>>>>>> PST time zone > > >>>>>>>>>>>>>>>>> won't finish > > >>>>>>>>>>>>>>>>>>>>> their > > >>>>>>>>>>>>>>>>>>>>>>>> build until late night of the > > >>>>>> same day. > > >>>>>>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> So my questions are: > > >>>>>>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> - Has anyone else experienced > > >>>>>> the same > > >>>>>>>>>> problem or > > >>>>>>>>>>>>>>>>> have similar > > >>>>>>>>>>>>>>>>>>>>>>> observation > > >>>>>>>>>>>>>>>>>>>>>>>> on TravisCI? (I suspect it > > >>>>>> has things > > >>>>>>>>>> to do with > > >>>>>>>>>>>> time > > >>>>>>>>>>>>>>>>> zone) > > >>>>>>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> - What pricing plan of > > >>>>>> TravisCI is > > >>>>>>>>>> Flink currently > > >>>>>>>>>>>>>>>>> using? Is it > > >>>>>>>>>>>>>>>>>>>> the > > >>>>>>>>>>>>>>>>>>>>>> free > > >>>>>>>>>>>>>>>>>>>>>>>> plan for open source > > >>>>>> projects? What > > >>>>>>>>>> are the > > >>>>>>>>>>>>>>>>> guaranteed build > > >>>>>>>>>>>>>>>>>>>> capacity > > >>>>>>>>>>>>>>>>>>>>>> of > > >>>>>>>>>>>>>>>>>>>>>>>> the current plan? > > >>>>>>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> - If the current pricing plan > > >>>>>> (either > > >>>>>>>>>> free or paid) > > >>>>>>>>>>>>>>>> can't > > >>>>>>>>>>>>>>>>>> provide > > >>>>>>>>>>>>>>>>>>>>>> stable > > >>>>>>>>>>>>>>>>>>>>>>>> build capacity, can we > > >>>>>> upgrade to a > > >>>>>>>>>> higher priced > > >>>>>>>>>>>>>>>>> plan with > > >>>>>>>>>>>>>>>>>> larger > > >>>>>>>>>>>>>>>>>>>>> and > > >>>>>>>>>>>>>>>>>>>>>>> more > > >>>>>>>>>>>>>>>>>>>>>>>> stable build capacity? > > >>>>>>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> BTW, another factor that > > >>>>>> contribute to > > >>>>>>>>>> the > > >>>>>>>>>>>>>>>>> productivity problem > > >>>>>>>>>>>>>>>>>> is > > >>>>>>>>>>>>>>>>>>>>> that > > >>>>>>>>>>>>>>>>>>>>>>>> our build is slow - we run > > >>>>>> full build > > >>>>>>>>>> for every PR > > >>>>>>>>>>>>>> and a > > >>>>>>>>>>>>>>>>>>>> successful > > >>>>>>>>>>>>>>>>>>>>>> full > > >>>>>>>>>>>>>>>>>>>>>>>> build takes ~5h. We > > >>>>>> definitely have > > >>>>>>>>>> more options to > > >>>>>>>>>>>>>>>>> solve it, > > >>>>>>>>>>>>>>>>>> for > > >>>>>>>>>>>>>>>>>>>>>>> instance, > > >>>>>>>>>>>>>>>>>>>>>>>> modularize the build graphs > > >>>>>> and reuse > > >>>>>>>>>> artifacts > > >>>>>>>>>>>> from > > >>>>>>>>>>>>>> the > > >>>>>>>>>>>>>>>>>> previous > > >>>>>>>>>>>>>>>>>>>>>> build. > > >>>>>>>>>>>>>>>>>>>>>>>> But I think that can be a big > > >>>>>> effort > > >>>>>>>>>> which is much > > >>>>>>>>>>>>>>>>> harder to > > >>>>>>>>>>>>>>>>>>>>> accomplish > > >>>>>>>>>>>>>>>>>>>>>>> in > > >>>>>>>>>>>>>>>>>>>>>>>> a short period of time and > > >>>>>> may deserve > > >>>>>>>>>> its own > > >>>>>>>>>>>>>> separate > > >>>>>>>>>>>>>>>>>>>> discussion. > > >>>>>>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> [1] > > >>>>>>>>>>>> https://travis-ci.org/apache/flink/pull_requests > > >>>>>>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> -- > > >>>>>>>>>>>>>>>>> Best Regards > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> Jeff Zhang > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>> > > >>>>>>>> > > >>>>>> > > >>>>> > > >>>>> > > >>> > > >>> > > > > > > > > > > |
In reply to this post by Chesnay Schepler-3
Note that the Flinkbot approach isn't that trivial either; we can't
_just_ trigger builds for a branch in the apache repo, but would first have to clone the branch/pr into a separate repository (that is owned by the github account that the travis account would be tied to). One roadblock after the next showing up... On 04/07/2019 11:59, Chesnay Schepler wrote: > Small update with mostly bad news: > > INFRA doesn't know whether it is possible, and referred my to Travis > support. > They did point out that it could be problematic in regards to > read/write permissions for the repository. > > From my own findings /so far/ with a test repo/organization, it does > not appear possible to configure the Travis account used for a > specific repository. > > So yeah, if we go down this route we may have to pimp the Flinkbot to > trigger builds through the Travis REST API. > > On 04/07/2019 10:46, Chesnay Schepler wrote: >> I've raised a JIRA >> <https://issues.apache.org/jira/browse/INFRA-18703>with INFRA to >> inquire whether it would be possible to switch to a different Travis >> account, and if so what steps would need to be taken. >> We need a proper confirmation from INFRA since we are not in full >> control of the flink repository (for example, we cannot access the >> settings page). >> >> If this is indeed possible, Ververica is willing sponsor a Travis >> account for the Flink project. >> This would provide us with more than enough resources than we need. >> >> Since this makes the project more reliant on resources provided by >> external companies I would like to vote on this. >> >> Please vote on this proposal, as follows: >> [ ] +1, Approve the migration to a Ververica-sponsored Travis >> account, provided that INFRA approves >> [ ] -1, Do not approach the migration to a Ververica-sponsored Travis >> account >> >> The vote will be open for at least 24h, and until we have >> confirmation from INFRA. The voting period may be shorter than the >> usual 3 days since our current is effectively not working. >> >> On 04/07/2019 06:51, Bowen Li wrote: >>> Re: > Are they using their own Travis CI pool, or did the switch to >>> an entirely different CI service? >>> >>> I reached out to Wes and Krisztián from Apache Arrow PMC. They are >>> currently moving away from ASF's Travis to their own in-house metal >>> machines at [1] with custom CI application at [2]. They've seen >>> significant improvement w.r.t both much higher performance and >>> basically no resource waiting time, "night-and-day" difference >>> quoting Wes. >>> >>> Re: > If we can just switch to our own Travis pool, just for our >>> project, then this might be something we can do fairly quickly? >>> >>> I believe so, according to [3] and [4] >>> >>> >>> [1] https://ci.ursalabs.org/ <https://ci.ursalabs.org/#/> >>> [2] https://github.com/ursa-labs/ursabot >>> [3] >>> https://docs.travis-ci.com/user/migrate/open-source-repository-migration >>> >>> [4] >>> https://docs.travis-ci.com/user/migrate/open-source-on-travis-ci-com >>> >>> >>> >>> On Wed, Jul 3, 2019 at 12:01 AM Chesnay Schepler <[hidden email] >>> <mailto:[hidden email]>> wrote: >>> >>> Are they using their own Travis CI pool, or did the switch to an >>> entirely different CI service? >>> >>> If we can just switch to our own Travis pool, just for our >>> project, then >>> this might be something we can do fairly quickly? >>> >>> On 03/07/2019 05:55, Bowen Li wrote: >>> > I responded in the INFRA ticket [1] that I believe they are >>> using a wrong >>> > metric against Flink and the total build time is a completely >>> different >>> > thing than guaranteed build capacity. >>> > >>> > My response: >>> > >>> > "As mentioned above, since I started to pay attention to Flink's >>> build >>> > queue a few tens of days ago, I'm in Seattle and I saw no build >>> was kicking >>> > off in PST daytime in weekdays for Flink. Our teammates in China >>> and Europe >>> > have also reported similar observations. So we need to evaluate >>> how the >>> > large total build time came from - if 1) your number and 2) our >>> > observations from three locations that cover pretty much a full >>> day, are >>> > all true, I **guess** one reason can be that - highly likely the >>> extra >>> > build time came from weekends when other Apache projects may be >>> idle and >>> > Flink just drains hard its congested queue. >>> > >>> > Please be aware of that we're not complaining about the lack of >>> resources >>> > in general, I'm complaining about the lack of **stable, >>> dedicated** >>> > resources. An example for the latter one is, currently even if >>> no build is >>> > in Flink's queue and I submit a request to be the queue head >>> in PST >>> > morning, my build won't even start in 6-8+h. That is an absurd >>> amount of >>> > waiting time. >>> > >>> > That's saying, if ASF INFRA decides to adopt a quota system and >>> grants >>> > Flink five DEDICATED servers that runs all the time only for >>> Flink, that'll >>> > be PERFECT and can totally solve our problem now. >>> > >>> > Please be aware of that we're not complaining about the lack of >>> resources >>> > in general, I'm complaining about the lack of **stable, >>> dedicated** >>> > resources. An example for the latter one is, currently even if >>> no build is >>> > in Flink's queue and I submit a request to be the queue head >>> in PST >>> > morning, my build won't even start in 6-8+h. That is an absurd >>> amount of >>> > waiting time. >>> > >>> > >>> > That's saying, if ASF INFRA decides to adopt a quota system and >>> grants >>> > Flink five DEDICATED servers that runs all the time only for >>> Flink, that'll >>> > be PERFECT and can totally solve our problem now. >>> > >>> > I feel what's missing in the ASF INFRA's Travis resource pool is >>> some level >>> > of build capacity SLAs and certainty" >>> > >>> > >>> > Again, I believe there are differences in nature of these two >>> problems, >>> > long build time v.s. lack of dedicated build resource. That's >>> saying, >>> > shortening build time may relieve the situation, and may not. >>> I'm sightly >>> > negative on disabling IT cases for PRs, due to the downside is >>> that we are >>> > at risk of any potential bugs in PR that UTs doesn't catch, and >>> may cost a >>> > lot more to fix and if it slows others down or even block >>> others, but am >>> > open to others opinions on it. >>> > >>> > AFAICT from INFRA ticket[1], donating to ASF INFRA won't be >>> feasible to >>> > solve our problem since INFRA's pool is fully shared and they >>> have no >>> > control and finer insights over resource allocation to a >>> specific Apache >>> > project. As mentioned in [1], Apache Arrow is moving away from >>> ASF INFRA >>> > Travis pool (they are actually surprised Flink hasn't plan to do >>> so). I >>> > know that Spark is on its own build infra. If we all agree that >>> funding our >>> > own build infra, I'd be glad to help investigate any potential >>> options >>> > after releasing 1.9 since I'm super busy with 1.9 now. >>> > >>> > [1] https://issues.apache.org/jira/browse/INFRA-18533 >>> > >>> > >>> > >>> > On Tue, Jul 2, 2019 at 4:46 AM Chesnay Schepler >>> <[hidden email] <mailto:[hidden email]>> wrote: >>> > >>> >> As a short-term stopgap, since we can assume this issue to >>> become much >>> >> worse in the following days/weeks, we could disable IT cases in >>> PRs and >>> >> only run them on master. >>> >> >>> >> On 02/07/2019 12:03, Chesnay Schepler wrote: >>> >>> People really have to stop thinking that just because >>> something works >>> >>> for us it is also a good solution. >>> >>> Also, please remember that our builds run for 2h from start to >>> finish, >>> >>> and not the 14 _minutes_ it takes for zeppelin. >>> >>> We are dealing with an entirely different scale here, both in >>> terms of >>> >>> build times and number of builds. >>> >>> >>> >>> In this very thread people have been complaining about long >>> queue >>> >>> times for their builds. Surprise, other Apache projects have >>> been >>> >>> suffering the very same thing due to us not controlling our >>> build >>> >>> times. While switching services (be it Jenkins, CircleCI or >>> whatever) >>> >>> will possibly work for us (and these options are actually >>> attractive, >>> >>> like CircleCI's proper support for build artifacts), it will >>> also >>> >>> result in us likely negatively affecting other projects in >>> significant >>> >>> ways. >>> >>> >>> >>> Sure, the Jenkins setup has a good user experience for us, at >>> the cost >>> >>> of blocking Jenkins workers for a _lot_ of time. Right now we >>> have 25 >>> >>> PR's in our queue; that's possibly 50h we'd consume of Jenkins >>> >>> resources, and the European contributors haven't even really >>> started yet. >>> >>> >>> >>> FYI, the latest INFRA response from INFRA-18533: >>> >>> >>> >>> "Our rough metrics shows that Flink used over 5800 hours of >>> build time >>> >>> last month. That is equal to EIGHT servers running 24/7 for >>> the ENTIRE >>> >>> MONTH. EIGHT. nonstop. >>> >>> When we discovered this last night, we discussed it some and >>> are going >>> >>> to tune down Flink to allow only five executors maximum. We >>> cannot >>> >>> allow Flink to consume so much of a Foundation shared >>> resource." >>> >>> >>> >>> So yes, we either >>> >>> a) have to heavily reduce our CI usage or >>> >>> b) fund our own, either maintaining it ourselves or donating >>> to Apache. >>> >>> >>> >>> On 02/07/2019 05:11, Bowen Li wrote: >>> >>>> By looking at the git history of the Jenkins script, its core >>> part >>> >>>> was finished in March 2017 (and only two minor update in >>> 2017/2018), >>> >>>> so it's been running for over two years now and feels like >>> Zepplin >>> >>>> community has been quite happy with it. @Jeff Zhang >>> >>>> <mailto:[hidden email] <mailto:[hidden email]>> can you >>> share your insights and user >>> >>>> experience with the Jenkins+Travis approach? >>> >>>> >>> >>>> Things like: >>> >>>> >>> >>>> - has the approach completely solved the resource capacity >>> problem >>> >>>> for Zepplin community? is Zepplin community happy with the >>> result? >>> >>>> - is the whole configuration chain stable (e.g. uptime) >>> enough? >>> >>>> - how often do you need to maintain the Jenkins infra? how >>> many >>> >>>> people are usually involved in maintenance and bug-fixes? >>> >>>> >>> >>>> The downside of this approach seems mostly to be on the >>> maintenance >>> >>>> to me - maintain the script and Jenkins infra. >>> >>>> >>> >>>> ** Having Our Own Travis-CI.com Account ** >>> >>>> >>> >>>> Another alternative I've been thinking of is to have our own >>> >>>> travis-ci.com <http://travis-ci.com> <http://travis-ci.com> >>> account with paid dedicated >>> >>>> resources. Note travis-ci.org <http://travis-ci.org> >>> <http://travis-ci.org> is the free >>> >>>> version and travis-ci.com <http://travis-ci.com> >>> <http://travis-ci.com> is the commercial >>> >>>> version. We currently use a shared resource pool managed by >>> ASK INFRA >>> >>>> team on travis-ci.org <http://travis-ci.org> >>> <http://travis-ci.org>, but we have no control >>> >>>> over it - we can't see how it's configured, how much >>> resources are >>> >>>> available, how resources are allocated among Apache projects, >>> etc. >>> >>>> The nice thing about having an account on travis-ci.com >>> <http://travis-ci.com> >>> >>>> <http://travis-ci.com> are: >>> >>>> >>> >>>> - relatively low cost with much better resource guarantee >>> than what >>> >>>> we currently have [1]: $249/month with 5 dedicated >>> concurrency, >>> >>>> $489/month with 10 concurrency >>> >>>> - low maintenance work compared to using Jenkins >>> >>>> - (potentially) no migration cost according to Travis's doc >>> [2] >>> >>>> (pending verification) >>> >>>> - full control over the build capacity/configuration >>> compared to >>> >>>> using ASF INFRA's pool >>> >>>> >>> >>>> I'd be surprised if we as such a vibrant community cannot >>> find and >>> >>>> fund $249*12=$2988 a year in exchange for a much better >>> developer >>> >>>> experience and much higher productivity. >>> >>>> >>> >>>> [1] https://travis-ci.com/plans >>> >>>> [2] >>> >>>> >>> >> >>> https://docs.travis-ci.com/user/migrate/open-source-repository-migration >>> >>> >>>> On Sat, Jun 29, 2019 at 8:39 AM Chesnay Schepler >>> <[hidden email] <mailto:[hidden email]> >>> >>>> <mailto:[hidden email] <mailto:[hidden email]>>> >>> wrote: >>> >>>> >>> >>>> So yes, the Jenkins job keeps pulling the state from >>> Travis until it >>> >>>> finishes. >>> >>>> >>> >>>> Note sure I'm comfortable with the idea of using Jenkins >>> workers >>> >>>> just to >>> >>>> idle for a several hours. >>> >>>> >>> >>>> On 29/06/2019 14:56, Jeff Zhang wrote: >>> >>>> > Here's what zeppelin community did, we make a python >>> script to >>> >>>> check the >>> >>>> > build status of pull request. >>> >>>> > Here's script: >>> >>>> > >>> https://github.com/apache/zeppelin/blob/master/travis_check.py >>> >>>> > >>> >>>> > And this is the script we used in Jenkins build job. >>> >>>> > >>> >>>> > if [ -f "travis_check.py" ]; then >>> >>>> > git log -n 1 >>> >>>> > STATUS=$(curl -s $BUILD_URL | grep -e "GitHub pull >>> >>>> request.*from.*" | sed >>> >>>> > 's/.*GitHub pull request <a >>> >>>> > href=\"\(https[^"]*\).*from[^"]*.\(https[^"]*\).*/\1 >>> \2/g') >>> >>>> > AUTHOR=$(echo $STATUS | sed 's/.*[/]\(.*\)$/\1/g') >>> >>>> > PR=$(echo $STATUS | awk '{print $1}' | sed >>> >>>> 's/.*[/]\(.*\)$/\1/g') >>> >>>> > #COMMIT=$(git log -n 1 | grep "^Merge:" | awk >>> '{print $3}') >>> >>>> > #if [ -z $COMMIT ]; then >>> >>>> > # COMMIT=$(curl -s >>> >>>> https://api.github.com/repos/apache/zeppelin/pulls/$PR >>> >>>> > | grep -e "\"label\":" -e "\"ref\":" -e "\"sha\":" | >>> tr '\n' ' ' >>> >>>> | sed >>> >>>> > 's/\(.*sha[^,]*,\)\(.*ref.*\)/\1 = \2/g' | tr = '\n' | >>> grep -v >>> >>>> "apache:" | >>> >>>> > sed 's/.*sha.[^"]*["]\([^"]*\).*/\1/g') >>> >>>> > #fi >>> >>>> > >>> >>>> > # get commit hash from PR >>> >>>> > COMMIT=$(curl -s >>> >>>> https://api.github.com/repos/apache/zeppelin/pulls/$PR | >>> >>>> > grep -e "\"label\":" -e "\"ref\":" -e "\"sha\":" | tr >>> '\n' ' ' >>> >>>> | sed >>> >>>> > 's/\(.*sha[^,]*,\)\(.*ref.*\)/\1 = \2/g' | tr = '\n' | >>> grep -v >>> >>>> "apache:" | >>> >>>> > sed 's/.*sha.[^"]*["]\([^"]*\).*/\1/g') >>> >>>> > sleep 30 # sleep few moment to wait travis starts >>> the build >>> >>>> > RET_CODE=0 >>> >>>> > python ./travis_check.py ${AUTHOR} ${COMMIT} || >>> RET_CODE=$? >>> >>>> > if [ $RET_CODE -eq 2 ]; then # try with repository >>> name when >>> >>>> travis-ci is >>> >>>> > not available in the account >>> >>>> > RET_CODE=0 >>> >>>> > AUTHOR=$(curl -s >>> >>>> https://api.github.com/repos/apache/zeppelin/pulls/$PR >>> >>>> > | grep '"full_name":' | grep -v "apache/zeppelin" | sed >>> >>>> > 's/.*[:][^"]*["]\([^/]*\).*/\1/g') >>> >>>> > python ./travis_check.py ${AUTHOR} ${COMMIT} || >>> RET_CODE=$? >>> >>>> > fi >>> >>>> > >>> >>>> > if [ $RET_CODE -eq 2 ]; then # fail with can't find >>> build >>> >>>> information in >>> >>>> > the travis >>> >>>> > set +x >>> >>>> > echo >>> "-----------------------------------------------------" >>> >>>> > echo "Looks like travis-ci is not configured for >>> your fork." >>> >>>> > echo "Please setup by swich on 'zeppelin' >>> repository at >>> >>>> > https://travis-ci.org/profile and travis-ci." >>> >>>> > echo "And then make sure 'Build branch updates' >>> option is >>> >>>> enabled in >>> >>>> > the settings >>> https://travis-ci.org/${AUTHOR}/zeppelin/settings >>> <https://travis-ci.org/$%7BAUTHOR%7D/zeppelin/settings> >>> >>>> <https://travis-ci.org/$%7BAUTHOR%7D/zeppelin/settings>." >>> >>>> > echo "" >>> >>>> > echo "To trigger CI after setup, you will need >>> ammend your >>> >>>> last commit >>> >>>> > with" >>> >>>> > echo "git commit --amend" >>> >>>> > echo "git push your-remote HEAD --force" >>> >>>> > echo "" >>> >>>> > echo "See >>> >>>> > >>> >>>> >>> >> >>> http://zeppelin.apache.org/contribution/contributions.html#continuous-integration >>> >>>> > ." >>> >>>> > fi >>> >>>> > >>> >>>> > exit $RET_CODE >>> >>>> > else >>> >>>> > set +x >>> >>>> > echo "travis_check.py does not exists" >>> >>>> > exit 1 >>> >>>> > fi >>> >>>> > >>> >>>> > Chesnay Schepler <[hidden email] >>> <mailto:[hidden email]> >>> >>>> <mailto:[hidden email] <mailto:[hidden email]>>> >>> 于2019年6月29日周六 下午3:17写道: >>> >>>> > >>> >>>> >> Does this imply that a Jenkins job is active as long >>> as the >>> >>>> Travis build >>> >>>> >> runs? >>> >>>> >> >>> >>>> >> On 26/06/2019 21:28, Bowen Li wrote: >>> >>>> >>> Hi, >>> >>>> >>> >>> >>>> >>> @Dawid, I think the "long test running" as I >>> mentioned in the >>> >>>> first >>> >>>> >> email, >>> >>>> >>> also as you guys said, belongs to "a big effort >>> which is much >>> >>>> harder to >>> >>>> >>> accomplish in a short period of time and may deserve >>> its own >>> >>>> separate >>> >>>> >>> discussion". Thus I didn't include it in what we can >>> do in a >>> >>>> foreseeable >>> >>>> >>> short term. >>> >>>> >>> >>> >>>> >>> Besides, I don't think that's the ultimate reason >>> for lack of >>> >>>> build >>> >>>> >>> resources. Even if the build is shortened to >>> something like >>> >>>> 2h, the >>> >>>> >>> problems of no build machine works about 6 or more >>> hours in >>> >>>> PST daytime >>> >>>> >>> that I described will still happen, because no >>> machine from >>> >>>> ASF INFRA's >>> >>>> >>> pool is allocated to Flink. As I have paid close >>> attention to >>> >>>> the build >>> >>>> >>> queue in the past few weekdays, it's a pretty clear >>> pattern now. >>> >>>> >>> >>> >>>> >>> **The ultimate root cause** for that is - we don't >>> have any >>> >>>> **dedicated** >>> >>>> >>> build resources that we can stably rely on. I'm >>> actually ok to >>> >>>> wait for a >>> >>>> >>> long time if there are build requests running, it >>> means at >>> >>>> least we are >>> >>>> >>> making progress. But I'm not ok with no build >>> resource. A >>> >>>> better place I >>> >>>> >>> think we should aim at in short term is to always >>> have at >>> >>>> least a central >>> >>>> >>> pool (can be 3 or 5) of machines dedicated to build >>> Flink at >>> >>>> any time, or >>> >>>> >>> maybe use users resources. >>> >>>> >>> >>> >>>> >>> @Chesnay @Robert I synced with Jeff offline that >>> Zeppelin >>> >>>> community is >>> >>>> >>> using a Jenkins job to automatically build on users' >>> travis >>> >>>> account and >>> >>>> >>> link the result back to github PR. I guess the >>> Jenkins job >>> >>>> would fetch >>> >>>> >>> latest upstream master and build the PR against it. >>> Jeff has >>> >>>> filed >>> >>>> >> tickets >>> >>>> >>> to learn and get access to the Jenkins infra. It'll >>> better to >>> >>>> fully >>> >>>> >>> understand it first before judging this approach. >>> >>>> >>> >>> >>>> >>> I also heard good things about CircleCI, and ASF >>> INFRA seems >>> >>>> to have a >>> >>>> >> pool >>> >>>> >>> of build capacity there too. Can be an alternative >>> to consider. >>> >>>> >>> >>> >>>> >>> >>> >>>> >>> >>> >>>> >>> >>> >>>> >>> >>> >>>> >>> >>> >>>> >>> >>> >>>> >>> >>> >>>> >>> >>> >>>> >>> On Wed, Jun 26, 2019 at 12:44 AM Dawid Wysakowicz < >>> >>>> >> [hidden email] >>> <mailto:[hidden email]> <mailto:[hidden email] >>> <mailto:[hidden email]>>> >>> >>>> >>> wrote: >>> >>>> >>> >>> >>>> >>>> Sorry to jump in late, but I think Bowen missed the >>> most >>> >>>> important point >>> >>>> >>>> from Chesnay's previous message in the summary. The >>> ultimate >>> >>>> reason for >>> >>>> >>>> all the problems is that the tests take close to 2 >>> hours to >>> >>>> run already. >>> >>>> >>>> I fully support this claim: "Unless people start >>> caring about >>> >>>> test times >>> >>>> >>>> before adding them, this issue cannot be solved" >>> >>>> >>>> >>> >>>> >>>> This is also another reason why using user's Travis >>> account >>> >>>> won't help. >>> >>>> >>>> Every few weeks we reach the user's time limit for >>> a single >>> >>>> profile. >>> >>>> >>>> This makes the user's builds simply fail, until we >>> either >>> >>>> properly >>> >>>> >>>> decrease the time the tests take (which I am not >>> sure we ever >>> >>>> did) or >>> >>>> >>>> postpone the problem by splitting into more >>> profiles. (Note >>> >>>> that the ASF >>> >>>> >>>> Travis account has higher time limits) >>> >>>> >>>> >>> >>>> >>>> Best, >>> >>>> >>>> >>> >>>> >>>> Dawid >>> >>>> >>>> >>> >>>> >>>> On 26/06/2019 09:36, Robert Metzger wrote: >>> >>>> >>>>> Do we know if using "the best" available hardware >>> would >>> >>>> improve the >>> >>>> >> build >>> >>>> >>>>> times? >>> >>>> >>>>> Imagine we would run the build on machines with >>> plenty of >>> >>>> main memory >>> >>>> >> to >>> >>>> >>>>> mount everything to ramdisk + the latest CPU >>> architecture? >>> >>>> >>>>> >>> >>>> >>>>> Throwing hardware at the problem could help reduce >>> the time >>> >>>> of an >>> >>>> >>>>> individual build, and using our own infrastructure >>> would >>> >>>> remove our >>> >>>> >>>>> dependency on Apache's Travis account (with the >>> obvious >>> >>>> downside of >>> >>>> >>>> having >>> >>>> >>>>> to maintain the infrastructure) >>> >>>> >>>>> We could use an open source travis alternative, to >>> have a >>> >>>> similar >>> >>>> >>>>> experience and make the migration easy. >>> >>>> >>>>> >>> >>>> >>>>> >>> >>>> >>>>> On Wed, Jun 26, 2019 at 9:34 AM Chesnay Schepler >>> >>>> <[hidden email] <mailto:[hidden email]> >>> <mailto:[hidden email] <mailto:[hidden email]>>> >>> >>>> >>>> wrote: >>> >>>> >>>>>> >From what I gathered, there's no special >>> sauce that the >>> >>>> Zeppelin >>> >>>> >>>>>> project uses which actually integrates a users >>> Travis >>> >>>> account into the >>> >>>> >>>> PR. >>> >>>> >>>>>> They just disabled Travis for PRs. And that's >>> kind of it. >>> >>>> >>>>>> >>> >>>> >>>>>> Naturally we can do this (duh) and safe the ASF a >>> fair >>> >>>> amount of >>> >>>> >>>>>> resources, but there are downsides: >>> >>>> >>>>>> >>> >>>> >>>>>> The discoverability of the Travis check takes a >>> nose-dive. >>> >>>> Either we >>> >>>> >>>>>> require every contributor to always, an every >>> commit, also >>> >>>> post a >>> >>>> >> Travis >>> >>>> >>>>>> build, or we have the reviewer sift through the >>> >>>> contributors account >>> >>>> >> to >>> >>>> >>>>>> find it. >>> >>>> >>>>>> >>> >>>> >>>>>> This is rather cumbersome. Additionally, it's >>> also not >>> >>>> equivalent to >>> >>>> >>>>>> having a PR build. >>> >>>> >>>>>> >>> >>>> >>>>>> A normal branch build takes a branch as is and >>> tests it. A >>> >>>> PR build >>> >>>> >>>>>> merges the branch into master, and then runs it. >>> (Fun fact: >>> >>>> This is >>> >>>> >> why >>> >>>> >>>>>> a PR without merge conflicts is not being run on >>> Travis.) >>> >>>> >>>>>> >>> >>>> >>>>>> And ultimately, everyone can already make use >>> of this >>> >>>> approach anyway. >>> >>>> >>>>>> >>> >>>> >>>>>> On 25/06/2019 08:02, Jark Wu wrote: >>> >>>> >>>>>>> Hi Jeff, >>> >>>> >>>>>>> >>> >>>> >>>>>>> Thanks for sharing the Zeppelin approach. I >>> think it's a >>> >>>> good idea to >>> >>>> >>>>>>> leverage user's travis account. >>> >>>> >>>>>>> In this way, we can have almost unlimited >>> concurrent build >>> >>>> jobs and >>> >>>> >>>>>>> developers can restart build by themselves >>> (currently only >>> >>>> committers >>> >>>> >>>>>>> can restart PR's build). >>> >>>> >>>>>>> >>> >>>> >>>>>>> But I'm still not very clear how to integrate >>> user's >>> >>>> travis build >>> >>>> >> into >>> >>>> >>>>>>> the Flink pull request's build automatically. >>> Can you >>> >>>> explain more in >>> >>>> >>>>>>> detail? >>> >>>> >>>>>>> >>> >>>> >>>>>>> Another question: does travis only build >>> branches for user >>> >>>> account? >>> >>>> >>>>>>> My concern is that builds for PRs will rebase >>> user's >>> >>>> commits against >>> >>>> >>>>>>> current master branch. >>> >>>> >>>>>>> This will help us to find problems before >>> merge. Builds >>> >>>> for branches >>> >>>> >>>>>>> will lose the impact of new commits in master. >>> >>>> >>>>>>> How does Zeppelin solve this problem? >>> >>>> >>>>>>> >>> >>>> >>>>>>> Thanks again for sharing the idea. >>> >>>> >>>>>>> >>> >>>> >>>>>>> Regards, >>> >>>> >>>>>>> Jark >>> >>>> >>>>>>> >>> >>>> >>>>>>> On Tue, 25 Jun 2019 at 11:01, Jeff Zhang >>> <[hidden email] <mailto:[hidden email]> >>> >>>> <mailto:[hidden email] <mailto:[hidden email]>> >>> >>>> >>>>>>> <mailto:[hidden email] >>> <mailto:[hidden email]> <mailto:[hidden email] >>> <mailto:[hidden email]>>>> wrote: >>> >>>> >>>>>>> >>> >>>> >>>>>>> Hi Folks, >>> >>>> >>>>>>> >>> >>>> >>>>>>> Zeppelin meet this kind of issue before, we >>> solve >>> >>>> it by >>> >>>> >> delegating >>> >>>> >>>>>>> each >>> >>>> >>>>>>> one's PR build to his travis account >>> (Everyone can >>> >>>> have 5 free >>> >>>> >>>>>>> slot for >>> >>>> >>>>>>> travis build). >>> >>>> >>>>>>> Apache account travis build is only triggered >>> when >>> >>>> PR is merged. >>> >>>> >>>>>>> >>> >>>> >>>>>>> >>> >>>> >>>>>>> >>> >>>> >>>>>>> Kurt Young <[hidden email] >>> <mailto:[hidden email]> >>> >>>> <mailto:[hidden email] <mailto:[hidden email]>> >>> <mailto:[hidden email] <mailto:[hidden email]> >>> >>>> <mailto:[hidden email] <mailto:[hidden email]>>>> >>> >>>> >>>>>>> 于2019年6月25日周二 上午10:16写道: >>> >>>> >>>>>>> >>> >>>> >>>>>>> > (Forgot to cc George) >>> >>>> >>>>>>> > >>> >>>> >>>>>>> > Best, >>> >>>> >>>>>>> > Kurt >>> >>>> >>>>>>> > >>> >>>> >>>>>>> > >>> >>>> >>>>>>> > On Tue, Jun 25, 2019 at 10:16 AM Kurt Young >>> >>>> <[hidden email] <mailto:[hidden email]> >>> <mailto:[hidden email] <mailto:[hidden email]>> >>> >>>> >>>>>>> <mailto:[hidden email] >>> <mailto:[hidden email]> <mailto:[hidden email] >>> <mailto:[hidden email]>>>> >>> >>>> wrote: >>> >>>> >>>>>>> > >>> >>>> >>>>>>> > > Hi Bowen, >>> >>>> >>>>>>> > > >>> >>>> >>>>>>> > > Thanks for bringing this up. We >>> actually have >>> >>>> discussed >>> >>>> >> about >>> >>>> >>>>>>> this, and I >>> >>>> >>>>>>> > > think Till and George have >>> >>>> >>>>>>> > > already spend sometime investigating >>> it. I have >>> >>>> cced both of >>> >>>> >>>>>>> them, and >>> >>>> >>>>>>> > > maybe they can share >>> >>>> >>>>>>> > > their findings. >>> >>>> >>>>>>> > > >>> >>>> >>>>>>> > > Best, >>> >>>> >>>>>>> > > Kurt >>> >>>> >>>>>>> > > >>> >>>> >>>>>>> > > >>> >>>> >>>>>>> > > On Tue, Jun 25, 2019 at 10:08 AM Jark Wu >>> >>>> <[hidden email] <mailto:[hidden email]> >>> <mailto:[hidden email] <mailto:[hidden email]>> >>> >>>> >>>>>>> <mailto:[hidden email] >>> <mailto:[hidden email]> <mailto:[hidden email] >>> <mailto:[hidden email]>>>> >>> >>>> wrote: >>> >>>> >>>>>>> > > >>> >>>> >>>>>>> > >> Hi Bowen, >>> >>>> >>>>>>> > >> >>> >>>> >>>>>>> > >> Thanks for bringing this. We also >>> suffered from >>> >>>> the long >>> >>>> >>>>>>> build time. >>> >>>> >>>>>>> > >> I agree that we should focus on >>> solving build >>> >>>> capacity >>> >>>> >>>>>>> problem in the >>> >>>> >>>>>>> > >> thread. >>> >>>> >>>>>>> > >> >>> >>>> >>>>>>> > >> My observation is there is only one >>> build is >>> >>>> running, all >>> >>>> >> the >>> >>>> >>>>>>> others >>> >>>> >>>>>>> > >> (other >>> >>>> >>>>>>> > >> PRs, master) are pending. >>> >>>> >>>>>>> > >> The pricing plan[1] of travis shows >>> it can >>> >>>> support >>> >>>> >> concurrent >>> >>>> >>>>>>> build >>> >>>> >>>>>>> > jobs. >>> >>>> >>>>>>> > >> But I don't know which plan we are >>> using, might >>> >>>> be the free >>> >>>> >>>>>>> plan for >>> >>>> >>>>>>> > open >>> >>>> >>>>>>> > >> source. >>> >>>> >>>>>>> > >> >>> >>>> >>>>>>> > >> I cc-ed Chesnay who may have some >>> experience on >>> >>>> Travis. >>> >>>> >>>>>>> > >> >>> >>>> >>>>>>> > >> Regards, >>> >>>> >>>>>>> > >> Jark >>> >>>> >>>>>>> > >> >>> >>>> >>>>>>> > >> [1]: https://travis-ci.com/plans >>> >>>> >>>>>>> > >> >>> >>>> >>>>>>> > >> On Tue, 25 Jun 2019 at 08:11, Bowen Li < >>> >>>> >> [hidden email] <mailto:[hidden email]> >>> <mailto:[hidden email] <mailto:[hidden email]>> >>> >>>> >>>>>>> <mailto:[hidden email] >>> <mailto:[hidden email]> >>> >>>> <mailto:[hidden email] >>> <mailto:[hidden email]>>>> wrote: >>> >>>> >>>>>>> > >> >>> >>>> >>>>>>> > >> > Hi Steven, >>> >>>> >>>>>>> > >> > >>> >>>> >>>>>>> > >> > I think you may not read what I >>> wrote. The >>> >>>> discussion is >>> >>>> >>>> about >>> >>>> >>>>>>> > "unstable >>> >>>> >>>>>>> > >> > build **capacity**", in another word >>> >>>> "unstable / lack of >>> >>>> >>>> build >>> >>>> >>>>>>> > >> resources", >>> >>>> >>>>>>> > >> > not "unstable build". >>> >>>> >>>>>>> > >> > >>> >>>> >>>>>>> > >> > On Mon, Jun 24, 2019 at 4:40 PM >>> Steven Wu >>> >>>> >>>>>>> <[hidden email] >>> <mailto:[hidden email]> <mailto:[hidden email] >>> <mailto:[hidden email]>> >>> >>>> <mailto:[hidden email] >>> <mailto:[hidden email]> <mailto:[hidden email] >>> <mailto:[hidden email]>>>> >>> >>>> >>>>>>> > wrote: >>> >>>> >>>>>>> > >> > >>> >>>> >>>>>>> > >> > > long and sometimes unstable build is >>> >>>> definitely a pain >>> >>>> >>>>>> point. >>> >>>> >>>>>>> > >> > > >>> >>>> >>>>>>> > >> > > I suspect the build failure here in >>> >>>> >> flink-connector-kafka >>> >>>> >>>>>>> is not >>> >>>> >>>>>>> > >> related >>> >>>> >>>>>>> > >> > to >>> >>>> >>>>>>> > >> > > my change. but there is no easy >>> re-run the >>> >>>> build on >>> >>>> >>>>>>> travis UI. >>> >>>> >>>>>>> > >> > > search showed a trick of >>> close-and-open the >>> >>>> PR will >>> >>>> >>>>>>> trigger rebuild. >>> >>>> >>>>>>> > >> but >>> >>>> >>>>>>> > >> > > that could add noises to the PR >>> activities. >>> >>>> >>>>>>> > >> > > >>> >>>> https://travis-ci.org/apache/flink/jobs/545555519 >>> >>>> >>>>>>> > >> > > >>> >>>> >>>>>>> > >> > > travis-ci for my personal repo >>> often failed >>> >>>> with >>> >>>> >>>>>>> exceeding time >>> >>>> >>>>>>> > limit >>> >>>> >>>>>>> > >> > after >>> >>>> >>>>>>> > >> > > 4+ hours. >>> >>>> >>>>>>> > >> > > The job exceeded the maximum time >>> limit for >>> >>>> jobs, and >>> >>>> >> has >>> >>>> >>>>>>> been >>> >>>> >>>>>>> > >> > terminated. >>> >>>> >>>>>>> > >> > > >>> >>>> >>>>>>> > >> > > On Mon, Jun 24, 2019 at 4:15 PM >>> Bowen Li >>> >>>> >>>>>>> <[hidden email] >>> <mailto:[hidden email]> <mailto:[hidden email] >>> <mailto:[hidden email]>> >>> >>>> <mailto:[hidden email] <mailto:[hidden email]> >>> <mailto:[hidden email] <mailto:[hidden email]>>>> >>> >>>> >>>>>>> > wrote: >>> >>>> >>>>>>> > >> > > >>> >>>> >>>>>>> > >> > > > >>> >>>> https://travis-ci.org/apache/flink/builds/549681530 >>> >>>> >>>>>>> This build >>> >>>> >>>>>>> > >> > request >>> >>>> >>>>>>> > >> > > > has >>> >>>> >>>>>>> > >> > > > been sitting at **HEAD of the >>> queue** >>> >>>> since I first >>> >>>> >> saw >>> >>>> >>>>>>> it at PST >>> >>>> >>>>>>> > >> > 10:30am >>> >>>> >>>>>>> > >> > > > (not sure how long it's been >>> there before >>> >>>> 10:30am). >>> >>>> >>>>>>> It's PST >>> >>>> >>>>>>> > 4:12pm >>> >>>> >>>>>>> > >> now >>> >>>> >>>>>>> > >> > > and >>> >>>> >>>>>>> > >> > > > it hasn't started yet. >>> >>>> >>>>>>> > >> > > > >>> >>>> >>>>>>> > >> > > > On Mon, Jun 24, 2019 at 2:48 PM >>> Bowen Li >>> >>>> >>>>>>> <[hidden email] >>> <mailto:[hidden email]> <mailto:[hidden email] >>> <mailto:[hidden email]>> >>> >>>> <mailto:[hidden email] <mailto:[hidden email]> >>> <mailto:[hidden email] <mailto:[hidden email]>>>> >>> >>>> >>>>>>> > >> wrote: >>> >>>> >>>>>>> > >> > > > >>> >>>> >>>>>>> > >> > > > > Hi devs, >>> >>>> >>>>>>> > >> > > > > >>> >>>> >>>>>>> > >> > > > > I've been experiencing the pain >>> >>>> resulting from lack >>> >>>> >>>>>>> of stable >>> >>>> >>>>>>> > >> build >>> >>>> >>>>>>> > >> > > > > capacity on Travis for Flink >>> PRs [1]. >>> >>>> >> Specifically, I >>> >>>> >>>>>>> noticed >>> >>>> >>>>>>> > >> often >>> >>>> >>>>>>> > >> > > that >>> >>>> >>>>>>> > >> > > > no >>> >>>> >>>>>>> > >> > > > > build in the queue is making any >>> >>>> progress for >>> >>>> >> hours, >>> >>>> >>>> and >>> >>>> >>>>>>> > suddenly >>> >>>> >>>>>>> > >> 5 >>> >>>> >>>>>>> > >> > or >>> >>>> >>>>>>> > >> > > 6 >>> >>>> >>>>>>> > >> > > > > builds kick off all together >>> after the >>> >>>> long pause. >>> >>>> >>>>>>> I'm at PST >>> >>>> >>>>>>> > >> > (UTC-08) >>> >>>> >>>>>>> > >> > > > time >>> >>>> >>>>>>> > >> > > > > zone, and I've seen pause can >>> be as >>> >>>> long as 6 hours >>> >>>> >>>>>>> from PST 9am >>> >>>> >>>>>>> > >> to >>> >>>> >>>>>>> > >> > 3pm >>> >>>> >>>>>>> > >> > > > > (let alone the time needed to >>> drain the >>> >>>> queue >>> >>>> >>>>>>> afterwards). >>> >>>> >>>>>>> > >> > > > > >>> >>>> >>>>>>> > >> > > > > I think this has greatly >>> impacted our >>> >>>> productivity. >>> >>>> >>>> I've >>> >>>> >>>>>>> > >> experienced >>> >>>> >>>>>>> > >> > > that >>> >>>> >>>>>>> > >> > > > > PRs submitted in the early >>> morning of >>> >>>> PST time zone >>> >>>> >>>>>>> won't finish >>> >>>> >>>>>>> > >> > their >>> >>>> >>>>>>> > >> > > > > build until late night of the >>> same day. >>> >>>> >>>>>>> > >> > > > > >>> >>>> >>>>>>> > >> > > > > So my questions are: >>> >>>> >>>>>>> > >> > > > > >>> >>>> >>>>>>> > >> > > > > - Has anyone else experienced >>> the same >>> >>>> problem or >>> >>>> >>>>>>> have similar >>> >>>> >>>>>>> > >> > > > observation >>> >>>> >>>>>>> > >> > > > > on TravisCI? (I suspect it >>> has things >>> >>>> to do with >>> >>>> >> time >>> >>>> >>>>>>> zone) >>> >>>> >>>>>>> > >> > > > > >>> >>>> >>>>>>> > >> > > > > - What pricing plan of >>> TravisCI is >>> >>>> Flink currently >>> >>>> >>>>>>> using? Is it >>> >>>> >>>>>>> > >> the >>> >>>> >>>>>>> > >> > > free >>> >>>> >>>>>>> > >> > > > > plan for open source >>> projects? What >>> >>>> are the >>> >>>> >>>>>>> guaranteed build >>> >>>> >>>>>>> > >> capacity >>> >>>> >>>>>>> > >> > > of >>> >>>> >>>>>>> > >> > > > > the current plan? >>> >>>> >>>>>>> > >> > > > > >>> >>>> >>>>>>> > >> > > > > - If the current pricing plan >>> (either >>> >>>> free or paid) >>> >>>> >>>>>> can't >>> >>>> >>>>>>> > provide >>> >>>> >>>>>>> > >> > > stable >>> >>>> >>>>>>> > >> > > > > build capacity, can we >>> upgrade to a >>> >>>> higher priced >>> >>>> >>>>>>> plan with >>> >>>> >>>>>>> > larger >>> >>>> >>>>>>> > >> > and >>> >>>> >>>>>>> > >> > > > more >>> >>>> >>>>>>> > >> > > > > stable build capacity? >>> >>>> >>>>>>> > >> > > > > >>> >>>> >>>>>>> > >> > > > > BTW, another factor that >>> contribute to >>> >>>> the >>> >>>> >>>>>>> productivity problem >>> >>>> >>>>>>> > is >>> >>>> >>>>>>> > >> > that >>> >>>> >>>>>>> > >> > > > > our build is slow - we run >>> full build >>> >>>> for every PR >>> >>>> >>>> and a >>> >>>> >>>>>>> > >> successful >>> >>>> >>>>>>> > >> > > full >>> >>>> >>>>>>> > >> > > > > build takes ~5h. We >>> definitely have >>> >>>> more options to >>> >>>> >>>>>>> solve it, >>> >>>> >>>>>>> > for >>> >>>> >>>>>>> > >> > > > instance, >>> >>>> >>>>>>> > >> > > > > modularize the build graphs >>> and reuse >>> >>>> artifacts >>> >>>> >> from >>> >>>> >>>> the >>> >>>> >>>>>>> > previous >>> >>>> >>>>>>> > >> > > build. >>> >>>> >>>>>>> > >> > > > > But I think that can be a big >>> effort >>> >>>> which is much >>> >>>> >>>>>>> harder to >>> >>>> >>>>>>> > >> > accomplish >>> >>>> >>>>>>> > >> > > > in >>> >>>> >>>>>>> > >> > > > > a short period of time and >>> may deserve >>> >>>> its own >>> >>>> >>>> separate >>> >>>> >>>>>>> > >> discussion. >>> >>>> >>>>>>> > >> > > > > >>> >>>> >>>>>>> > >> > > > > [1] >>> >>>> >> https://travis-ci.org/apache/flink/pull_requests >>> >>>> >>>>>>> > >> > > > > >>> >>>> >>>>>>> > >> > > > > >>> >>>> >>>>>>> > >> > > > >>> >>>> >>>>>>> > >> > > >>> >>>> >>>>>>> > >> > >>> >>>> >>>>>>> > >> >>> >>>> >>>>>>> > > >>> >>>> >>>>>>> > >>> >>>> >>>>>>> >>> >>>> >>>>>>> >>> >>>> >>>>>>> -- >>> >>>> >>>>>>> Best Regards >>> >>>> >>>>>>> >>> >>>> >>>>>>> Jeff Zhang >>> >>>> >>>>>>> >>> >>>> >> >>> >>>> >>> >>> >>> >> >>> >> >> > |
+1.
And thanks a lot to Chesnay for pushing this. Best, Hequn On Thu, Jul 4, 2019 at 8:07 PM Chesnay Schepler <[hidden email]> wrote: > Note that the Flinkbot approach isn't that trivial either; we can't > _just_ trigger builds for a branch in the apache repo, but would first > have to clone the branch/pr into a separate repository (that is owned by > the github account that the travis account would be tied to). > > One roadblock after the next showing up... > > On 04/07/2019 11:59, Chesnay Schepler wrote: > > Small update with mostly bad news: > > > > INFRA doesn't know whether it is possible, and referred my to Travis > > support. > > They did point out that it could be problematic in regards to > > read/write permissions for the repository. > > > > From my own findings /so far/ with a test repo/organization, it does > > not appear possible to configure the Travis account used for a > > specific repository. > > > > So yeah, if we go down this route we may have to pimp the Flinkbot to > > trigger builds through the Travis REST API. > > > > On 04/07/2019 10:46, Chesnay Schepler wrote: > >> I've raised a JIRA > >> <https://issues.apache.org/jira/browse/INFRA-18703>with INFRA to > >> inquire whether it would be possible to switch to a different Travis > >> account, and if so what steps would need to be taken. > >> We need a proper confirmation from INFRA since we are not in full > >> control of the flink repository (for example, we cannot access the > >> settings page). > >> > >> If this is indeed possible, Ververica is willing sponsor a Travis > >> account for the Flink project. > >> This would provide us with more than enough resources than we need. > >> > >> Since this makes the project more reliant on resources provided by > >> external companies I would like to vote on this. > >> > >> Please vote on this proposal, as follows: > >> [ ] +1, Approve the migration to a Ververica-sponsored Travis > >> account, provided that INFRA approves > >> [ ] -1, Do not approach the migration to a Ververica-sponsored Travis > >> account > >> > >> The vote will be open for at least 24h, and until we have > >> confirmation from INFRA. The voting period may be shorter than the > >> usual 3 days since our current is effectively not working. > >> > >> On 04/07/2019 06:51, Bowen Li wrote: > >>> Re: > Are they using their own Travis CI pool, or did the switch to > >>> an entirely different CI service? > >>> > >>> I reached out to Wes and Krisztián from Apache Arrow PMC. They are > >>> currently moving away from ASF's Travis to their own in-house metal > >>> machines at [1] with custom CI application at [2]. They've seen > >>> significant improvement w.r.t both much higher performance and > >>> basically no resource waiting time, "night-and-day" difference > >>> quoting Wes. > >>> > >>> Re: > If we can just switch to our own Travis pool, just for our > >>> project, then this might be something we can do fairly quickly? > >>> > >>> I believe so, according to [3] and [4] > >>> > >>> > >>> [1] https://ci.ursalabs.org/ <https://ci.ursalabs.org/#/> > >>> [2] https://github.com/ursa-labs/ursabot > >>> [3] > >>> > https://docs.travis-ci.com/user/migrate/open-source-repository-migration > >>> > >>> [4] > >>> https://docs.travis-ci.com/user/migrate/open-source-on-travis-ci-com > >>> > >>> > >>> > >>> On Wed, Jul 3, 2019 at 12:01 AM Chesnay Schepler <[hidden email] > >>> <mailto:[hidden email]>> wrote: > >>> > >>> Are they using their own Travis CI pool, or did the switch to an > >>> entirely different CI service? > >>> > >>> If we can just switch to our own Travis pool, just for our > >>> project, then > >>> this might be something we can do fairly quickly? > >>> > >>> On 03/07/2019 05:55, Bowen Li wrote: > >>> > I responded in the INFRA ticket [1] that I believe they are > >>> using a wrong > >>> > metric against Flink and the total build time is a completely > >>> different > >>> > thing than guaranteed build capacity. > >>> > > >>> > My response: > >>> > > >>> > "As mentioned above, since I started to pay attention to Flink's > >>> build > >>> > queue a few tens of days ago, I'm in Seattle and I saw no build > >>> was kicking > >>> > off in PST daytime in weekdays for Flink. Our teammates in China > >>> and Europe > >>> > have also reported similar observations. So we need to evaluate > >>> how the > >>> > large total build time came from - if 1) your number and 2) our > >>> > observations from three locations that cover pretty much a full > >>> day, are > >>> > all true, I **guess** one reason can be that - highly likely the > >>> extra > >>> > build time came from weekends when other Apache projects may be > >>> idle and > >>> > Flink just drains hard its congested queue. > >>> > > >>> > Please be aware of that we're not complaining about the lack of > >>> resources > >>> > in general, I'm complaining about the lack of **stable, > >>> dedicated** > >>> > resources. An example for the latter one is, currently even if > >>> no build is > >>> > in Flink's queue and I submit a request to be the queue head > >>> in PST > >>> > morning, my build won't even start in 6-8+h. That is an absurd > >>> amount of > >>> > waiting time. > >>> > > >>> > That's saying, if ASF INFRA decides to adopt a quota system and > >>> grants > >>> > Flink five DEDICATED servers that runs all the time only for > >>> Flink, that'll > >>> > be PERFECT and can totally solve our problem now. > >>> > > >>> > Please be aware of that we're not complaining about the lack of > >>> resources > >>> > in general, I'm complaining about the lack of **stable, > >>> dedicated** > >>> > resources. An example for the latter one is, currently even if > >>> no build is > >>> > in Flink's queue and I submit a request to be the queue head > >>> in PST > >>> > morning, my build won't even start in 6-8+h. That is an absurd > >>> amount of > >>> > waiting time. > >>> > > >>> > > >>> > That's saying, if ASF INFRA decides to adopt a quota system and > >>> grants > >>> > Flink five DEDICATED servers that runs all the time only for > >>> Flink, that'll > >>> > be PERFECT and can totally solve our problem now. > >>> > > >>> > I feel what's missing in the ASF INFRA's Travis resource pool is > >>> some level > >>> > of build capacity SLAs and certainty" > >>> > > >>> > > >>> > Again, I believe there are differences in nature of these two > >>> problems, > >>> > long build time v.s. lack of dedicated build resource. That's > >>> saying, > >>> > shortening build time may relieve the situation, and may not. > >>> I'm sightly > >>> > negative on disabling IT cases for PRs, due to the downside is > >>> that we are > >>> > at risk of any potential bugs in PR that UTs doesn't catch, and > >>> may cost a > >>> > lot more to fix and if it slows others down or even block > >>> others, but am > >>> > open to others opinions on it. > >>> > > >>> > AFAICT from INFRA ticket[1], donating to ASF INFRA won't be > >>> feasible to > >>> > solve our problem since INFRA's pool is fully shared and they > >>> have no > >>> > control and finer insights over resource allocation to a > >>> specific Apache > >>> > project. As mentioned in [1], Apache Arrow is moving away from > >>> ASF INFRA > >>> > Travis pool (they are actually surprised Flink hasn't plan to do > >>> so). I > >>> > know that Spark is on its own build infra. If we all agree that > >>> funding our > >>> > own build infra, I'd be glad to help investigate any potential > >>> options > >>> > after releasing 1.9 since I'm super busy with 1.9 now. > >>> > > >>> > [1] https://issues.apache.org/jira/browse/INFRA-18533 > >>> > > >>> > > >>> > > >>> > On Tue, Jul 2, 2019 at 4:46 AM Chesnay Schepler > >>> <[hidden email] <mailto:[hidden email]>> wrote: > >>> > > >>> >> As a short-term stopgap, since we can assume this issue to > >>> become much > >>> >> worse in the following days/weeks, we could disable IT cases in > >>> PRs and > >>> >> only run them on master. > >>> >> > >>> >> On 02/07/2019 12:03, Chesnay Schepler wrote: > >>> >>> People really have to stop thinking that just because > >>> something works > >>> >>> for us it is also a good solution. > >>> >>> Also, please remember that our builds run for 2h from start to > >>> finish, > >>> >>> and not the 14 _minutes_ it takes for zeppelin. > >>> >>> We are dealing with an entirely different scale here, both in > >>> terms of > >>> >>> build times and number of builds. > >>> >>> > >>> >>> In this very thread people have been complaining about long > >>> queue > >>> >>> times for their builds. Surprise, other Apache projects have > >>> been > >>> >>> suffering the very same thing due to us not controlling our > >>> build > >>> >>> times. While switching services (be it Jenkins, CircleCI or > >>> whatever) > >>> >>> will possibly work for us (and these options are actually > >>> attractive, > >>> >>> like CircleCI's proper support for build artifacts), it will > >>> also > >>> >>> result in us likely negatively affecting other projects in > >>> significant > >>> >>> ways. > >>> >>> > >>> >>> Sure, the Jenkins setup has a good user experience for us, at > >>> the cost > >>> >>> of blocking Jenkins workers for a _lot_ of time. Right now we > >>> have 25 > >>> >>> PR's in our queue; that's possibly 50h we'd consume of Jenkins > >>> >>> resources, and the European contributors haven't even really > >>> started yet. > >>> >>> > >>> >>> FYI, the latest INFRA response from INFRA-18533: > >>> >>> > >>> >>> "Our rough metrics shows that Flink used over 5800 hours of > >>> build time > >>> >>> last month. That is equal to EIGHT servers running 24/7 for > >>> the ENTIRE > >>> >>> MONTH. EIGHT. nonstop. > >>> >>> When we discovered this last night, we discussed it some and > >>> are going > >>> >>> to tune down Flink to allow only five executors maximum. We > >>> cannot > >>> >>> allow Flink to consume so much of a Foundation shared > >>> resource." > >>> >>> > >>> >>> So yes, we either > >>> >>> a) have to heavily reduce our CI usage or > >>> >>> b) fund our own, either maintaining it ourselves or donating > >>> to Apache. > >>> >>> > >>> >>> On 02/07/2019 05:11, Bowen Li wrote: > >>> >>>> By looking at the git history of the Jenkins script, its core > >>> part > >>> >>>> was finished in March 2017 (and only two minor update in > >>> 2017/2018), > >>> >>>> so it's been running for over two years now and feels like > >>> Zepplin > >>> >>>> community has been quite happy with it. @Jeff Zhang > >>> >>>> <mailto:[hidden email] <mailto:[hidden email]>> can you > >>> share your insights and user > >>> >>>> experience with the Jenkins+Travis approach? > >>> >>>> > >>> >>>> Things like: > >>> >>>> > >>> >>>> - has the approach completely solved the resource capacity > >>> problem > >>> >>>> for Zepplin community? is Zepplin community happy with the > >>> result? > >>> >>>> - is the whole configuration chain stable (e.g. uptime) > >>> enough? > >>> >>>> - how often do you need to maintain the Jenkins infra? how > >>> many > >>> >>>> people are usually involved in maintenance and bug-fixes? > >>> >>>> > >>> >>>> The downside of this approach seems mostly to be on the > >>> maintenance > >>> >>>> to me - maintain the script and Jenkins infra. > >>> >>>> > >>> >>>> ** Having Our Own Travis-CI.com Account ** > >>> >>>> > >>> >>>> Another alternative I've been thinking of is to have our own > >>> >>>> travis-ci.com <http://travis-ci.com> <http://travis-ci.com> > >>> account with paid dedicated > >>> >>>> resources. Note travis-ci.org <http://travis-ci.org> > >>> <http://travis-ci.org> is the free > >>> >>>> version and travis-ci.com <http://travis-ci.com> > >>> <http://travis-ci.com> is the commercial > >>> >>>> version. We currently use a shared resource pool managed by > >>> ASK INFRA > >>> >>>> team on travis-ci.org <http://travis-ci.org> > >>> <http://travis-ci.org>, but we have no control > >>> >>>> over it - we can't see how it's configured, how much > >>> resources are > >>> >>>> available, how resources are allocated among Apache projects, > >>> etc. > >>> >>>> The nice thing about having an account on travis-ci.com > >>> <http://travis-ci.com> > >>> >>>> <http://travis-ci.com> are: > >>> >>>> > >>> >>>> - relatively low cost with much better resource guarantee > >>> than what > >>> >>>> we currently have [1]: $249/month with 5 dedicated > >>> concurrency, > >>> >>>> $489/month with 10 concurrency > >>> >>>> - low maintenance work compared to using Jenkins > >>> >>>> - (potentially) no migration cost according to Travis's doc > >>> [2] > >>> >>>> (pending verification) > >>> >>>> - full control over the build capacity/configuration > >>> compared to > >>> >>>> using ASF INFRA's pool > >>> >>>> > >>> >>>> I'd be surprised if we as such a vibrant community cannot > >>> find and > >>> >>>> fund $249*12=$2988 a year in exchange for a much better > >>> developer > >>> >>>> experience and much higher productivity. > >>> >>>> > >>> >>>> [1] https://travis-ci.com/plans > >>> >>>> [2] > >>> >>>> > >>> >> > >>> > https://docs.travis-ci.com/user/migrate/open-source-repository-migration > >>> > >>> >>>> On Sat, Jun 29, 2019 at 8:39 AM Chesnay Schepler > >>> <[hidden email] <mailto:[hidden email]> > >>> >>>> <mailto:[hidden email] <mailto:[hidden email]>>> > >>> wrote: > >>> >>>> > >>> >>>> So yes, the Jenkins job keeps pulling the state from > >>> Travis until it > >>> >>>> finishes. > >>> >>>> > >>> >>>> Note sure I'm comfortable with the idea of using Jenkins > >>> workers > >>> >>>> just to > >>> >>>> idle for a several hours. > >>> >>>> > >>> >>>> On 29/06/2019 14:56, Jeff Zhang wrote: > >>> >>>> > Here's what zeppelin community did, we make a python > >>> script to > >>> >>>> check the > >>> >>>> > build status of pull request. > >>> >>>> > Here's script: > >>> >>>> > > >>> https://github.com/apache/zeppelin/blob/master/travis_check.py > >>> >>>> > > >>> >>>> > And this is the script we used in Jenkins build job. > >>> >>>> > > >>> >>>> > if [ -f "travis_check.py" ]; then > >>> >>>> > git log -n 1 > >>> >>>> > STATUS=$(curl -s $BUILD_URL | grep -e "GitHub pull > >>> >>>> request.*from.*" | sed > >>> >>>> > 's/.*GitHub pull request <a > >>> >>>> > href=\"\(https[^"]*\).*from[^"]*.\(https[^"]*\).*/\1 > >>> \2/g') > >>> >>>> > AUTHOR=$(echo $STATUS | sed 's/.*[/]\(.*\)$/\1/g') > >>> >>>> > PR=$(echo $STATUS | awk '{print $1}' | sed > >>> >>>> 's/.*[/]\(.*\)$/\1/g') > >>> >>>> > #COMMIT=$(git log -n 1 | grep "^Merge:" | awk > >>> '{print $3}') > >>> >>>> > #if [ -z $COMMIT ]; then > >>> >>>> > # COMMIT=$(curl -s > >>> >>>> https://api.github.com/repos/apache/zeppelin/pulls/$PR > >>> >>>> > | grep -e "\"label\":" -e "\"ref\":" -e "\"sha\":" | > >>> tr '\n' ' ' > >>> >>>> | sed > >>> >>>> > 's/\(.*sha[^,]*,\)\(.*ref.*\)/\1 = \2/g' | tr = '\n' | > >>> grep -v > >>> >>>> "apache:" | > >>> >>>> > sed 's/.*sha.[^"]*["]\([^"]*\).*/\1/g') > >>> >>>> > #fi > >>> >>>> > > >>> >>>> > # get commit hash from PR > >>> >>>> > COMMIT=$(curl -s > >>> >>>> https://api.github.com/repos/apache/zeppelin/pulls/$PR | > >>> >>>> > grep -e "\"label\":" -e "\"ref\":" -e "\"sha\":" | tr > >>> '\n' ' ' > >>> >>>> | sed > >>> >>>> > 's/\(.*sha[^,]*,\)\(.*ref.*\)/\1 = \2/g' | tr = '\n' | > >>> grep -v > >>> >>>> "apache:" | > >>> >>>> > sed 's/.*sha.[^"]*["]\([^"]*\).*/\1/g') > >>> >>>> > sleep 30 # sleep few moment to wait travis starts > >>> the build > >>> >>>> > RET_CODE=0 > >>> >>>> > python ./travis_check.py ${AUTHOR} ${COMMIT} || > >>> RET_CODE=$? > >>> >>>> > if [ $RET_CODE -eq 2 ]; then # try with repository > >>> name when > >>> >>>> travis-ci is > >>> >>>> > not available in the account > >>> >>>> > RET_CODE=0 > >>> >>>> > AUTHOR=$(curl -s > >>> >>>> https://api.github.com/repos/apache/zeppelin/pulls/$PR > >>> >>>> > | grep '"full_name":' | grep -v "apache/zeppelin" | sed > >>> >>>> > 's/.*[:][^"]*["]\([^/]*\).*/\1/g') > >>> >>>> > python ./travis_check.py ${AUTHOR} ${COMMIT} || > >>> RET_CODE=$? > >>> >>>> > fi > >>> >>>> > > >>> >>>> > if [ $RET_CODE -eq 2 ]; then # fail with can't find > >>> build > >>> >>>> information in > >>> >>>> > the travis > >>> >>>> > set +x > >>> >>>> > echo > >>> "-----------------------------------------------------" > >>> >>>> > echo "Looks like travis-ci is not configured for > >>> your fork." > >>> >>>> > echo "Please setup by swich on 'zeppelin' > >>> repository at > >>> >>>> > https://travis-ci.org/profile and travis-ci." > >>> >>>> > echo "And then make sure 'Build branch updates' > >>> option is > >>> >>>> enabled in > >>> >>>> > the settings > >>> https://travis-ci.org/${AUTHOR}/zeppelin/settings > >>> <https://travis-ci.org/$%7BAUTHOR%7D/zeppelin/settings> > >>> >>>> <https://travis-ci.org/$%7BAUTHOR%7D/zeppelin/settings>." > >>> >>>> > echo "" > >>> >>>> > echo "To trigger CI after setup, you will need > >>> ammend your > >>> >>>> last commit > >>> >>>> > with" > >>> >>>> > echo "git commit --amend" > >>> >>>> > echo "git push your-remote HEAD --force" > >>> >>>> > echo "" > >>> >>>> > echo "See > >>> >>>> > > >>> >>>> > >>> >> > >>> > http://zeppelin.apache.org/contribution/contributions.html#continuous-integration > >>> >>>> > ." > >>> >>>> > fi > >>> >>>> > > >>> >>>> > exit $RET_CODE > >>> >>>> > else > >>> >>>> > set +x > >>> >>>> > echo "travis_check.py does not exists" > >>> >>>> > exit 1 > >>> >>>> > fi > >>> >>>> > > >>> >>>> > Chesnay Schepler <[hidden email] > >>> <mailto:[hidden email]> > >>> >>>> <mailto:[hidden email] <mailto:[hidden email]>>> > >>> 于2019年6月29日周六 下午3:17写道: > >>> >>>> > > >>> >>>> >> Does this imply that a Jenkins job is active as long > >>> as the > >>> >>>> Travis build > >>> >>>> >> runs? > >>> >>>> >> > >>> >>>> >> On 26/06/2019 21:28, Bowen Li wrote: > >>> >>>> >>> Hi, > >>> >>>> >>> > >>> >>>> >>> @Dawid, I think the "long test running" as I > >>> mentioned in the > >>> >>>> first > >>> >>>> >> email, > >>> >>>> >>> also as you guys said, belongs to "a big effort > >>> which is much > >>> >>>> harder to > >>> >>>> >>> accomplish in a short period of time and may deserve > >>> its own > >>> >>>> separate > >>> >>>> >>> discussion". Thus I didn't include it in what we can > >>> do in a > >>> >>>> foreseeable > >>> >>>> >>> short term. > >>> >>>> >>> > >>> >>>> >>> Besides, I don't think that's the ultimate reason > >>> for lack of > >>> >>>> build > >>> >>>> >>> resources. Even if the build is shortened to > >>> something like > >>> >>>> 2h, the > >>> >>>> >>> problems of no build machine works about 6 or more > >>> hours in > >>> >>>> PST daytime > >>> >>>> >>> that I described will still happen, because no > >>> machine from > >>> >>>> ASF INFRA's > >>> >>>> >>> pool is allocated to Flink. As I have paid close > >>> attention to > >>> >>>> the build > >>> >>>> >>> queue in the past few weekdays, it's a pretty clear > >>> pattern now. > >>> >>>> >>> > >>> >>>> >>> **The ultimate root cause** for that is - we don't > >>> have any > >>> >>>> **dedicated** > >>> >>>> >>> build resources that we can stably rely on. I'm > >>> actually ok to > >>> >>>> wait for a > >>> >>>> >>> long time if there are build requests running, it > >>> means at > >>> >>>> least we are > >>> >>>> >>> making progress. But I'm not ok with no build > >>> resource. A > >>> >>>> better place I > >>> >>>> >>> think we should aim at in short term is to always > >>> have at > >>> >>>> least a central > >>> >>>> >>> pool (can be 3 or 5) of machines dedicated to build > >>> Flink at > >>> >>>> any time, or > >>> >>>> >>> maybe use users resources. > >>> >>>> >>> > >>> >>>> >>> @Chesnay @Robert I synced with Jeff offline that > >>> Zeppelin > >>> >>>> community is > >>> >>>> >>> using a Jenkins job to automatically build on users' > >>> travis > >>> >>>> account and > >>> >>>> >>> link the result back to github PR. I guess the > >>> Jenkins job > >>> >>>> would fetch > >>> >>>> >>> latest upstream master and build the PR against it. > >>> Jeff has > >>> >>>> filed > >>> >>>> >> tickets > >>> >>>> >>> to learn and get access to the Jenkins infra. It'll > >>> better to > >>> >>>> fully > >>> >>>> >>> understand it first before judging this approach. > >>> >>>> >>> > >>> >>>> >>> I also heard good things about CircleCI, and ASF > >>> INFRA seems > >>> >>>> to have a > >>> >>>> >> pool > >>> >>>> >>> of build capacity there too. Can be an alternative > >>> to consider. > >>> >>>> >>> > >>> >>>> >>> > >>> >>>> >>> > >>> >>>> >>> > >>> >>>> >>> > >>> >>>> >>> > >>> >>>> >>> > >>> >>>> >>> > >>> >>>> >>> > >>> >>>> >>> On Wed, Jun 26, 2019 at 12:44 AM Dawid Wysakowicz < > >>> >>>> >> [hidden email] > >>> <mailto:[hidden email]> <mailto:[hidden email] > >>> <mailto:[hidden email]>>> > >>> >>>> >>> wrote: > >>> >>>> >>> > >>> >>>> >>>> Sorry to jump in late, but I think Bowen missed the > >>> most > >>> >>>> important point > >>> >>>> >>>> from Chesnay's previous message in the summary. The > >>> ultimate > >>> >>>> reason for > >>> >>>> >>>> all the problems is that the tests take close to 2 > >>> hours to > >>> >>>> run already. > >>> >>>> >>>> I fully support this claim: "Unless people start > >>> caring about > >>> >>>> test times > >>> >>>> >>>> before adding them, this issue cannot be solved" > >>> >>>> >>>> > >>> >>>> >>>> This is also another reason why using user's Travis > >>> account > >>> >>>> won't help. > >>> >>>> >>>> Every few weeks we reach the user's time limit for > >>> a single > >>> >>>> profile. > >>> >>>> >>>> This makes the user's builds simply fail, until we > >>> either > >>> >>>> properly > >>> >>>> >>>> decrease the time the tests take (which I am not > >>> sure we ever > >>> >>>> did) or > >>> >>>> >>>> postpone the problem by splitting into more > >>> profiles. (Note > >>> >>>> that the ASF > >>> >>>> >>>> Travis account has higher time limits) > >>> >>>> >>>> > >>> >>>> >>>> Best, > >>> >>>> >>>> > >>> >>>> >>>> Dawid > >>> >>>> >>>> > >>> >>>> >>>> On 26/06/2019 09:36, Robert Metzger wrote: > >>> >>>> >>>>> Do we know if using "the best" available hardware > >>> would > >>> >>>> improve the > >>> >>>> >> build > >>> >>>> >>>>> times? > >>> >>>> >>>>> Imagine we would run the build on machines with > >>> plenty of > >>> >>>> main memory > >>> >>>> >> to > >>> >>>> >>>>> mount everything to ramdisk + the latest CPU > >>> architecture? > >>> >>>> >>>>> > >>> >>>> >>>>> Throwing hardware at the problem could help reduce > >>> the time > >>> >>>> of an > >>> >>>> >>>>> individual build, and using our own infrastructure > >>> would > >>> >>>> remove our > >>> >>>> >>>>> dependency on Apache's Travis account (with the > >>> obvious > >>> >>>> downside of > >>> >>>> >>>> having > >>> >>>> >>>>> to maintain the infrastructure) > >>> >>>> >>>>> We could use an open source travis alternative, to > >>> have a > >>> >>>> similar > >>> >>>> >>>>> experience and make the migration easy. > >>> >>>> >>>>> > >>> >>>> >>>>> > >>> >>>> >>>>> On Wed, Jun 26, 2019 at 9:34 AM Chesnay Schepler > >>> >>>> <[hidden email] <mailto:[hidden email]> > >>> <mailto:[hidden email] <mailto:[hidden email]>>> > >>> >>>> >>>> wrote: > >>> >>>> >>>>>> >From what I gathered, there's no special > >>> sauce that the > >>> >>>> Zeppelin > >>> >>>> >>>>>> project uses which actually integrates a users > >>> Travis > >>> >>>> account into the > >>> >>>> >>>> PR. > >>> >>>> >>>>>> They just disabled Travis for PRs. And that's > >>> kind of it. > >>> >>>> >>>>>> > >>> >>>> >>>>>> Naturally we can do this (duh) and safe the ASF a > >>> fair > >>> >>>> amount of > >>> >>>> >>>>>> resources, but there are downsides: > >>> >>>> >>>>>> > >>> >>>> >>>>>> The discoverability of the Travis check takes a > >>> nose-dive. > >>> >>>> Either we > >>> >>>> >>>>>> require every contributor to always, an every > >>> commit, also > >>> >>>> post a > >>> >>>> >> Travis > >>> >>>> >>>>>> build, or we have the reviewer sift through the > >>> >>>> contributors account > >>> >>>> >> to > >>> >>>> >>>>>> find it. > >>> >>>> >>>>>> > >>> >>>> >>>>>> This is rather cumbersome. Additionally, it's > >>> also not > >>> >>>> equivalent to > >>> >>>> >>>>>> having a PR build. > >>> >>>> >>>>>> > >>> >>>> >>>>>> A normal branch build takes a branch as is and > >>> tests it. A > >>> >>>> PR build > >>> >>>> >>>>>> merges the branch into master, and then runs it. > >>> (Fun fact: > >>> >>>> This is > >>> >>>> >> why > >>> >>>> >>>>>> a PR without merge conflicts is not being run on > >>> Travis.) > >>> >>>> >>>>>> > >>> >>>> >>>>>> And ultimately, everyone can already make use > >>> of this > >>> >>>> approach anyway. > >>> >>>> >>>>>> > >>> >>>> >>>>>> On 25/06/2019 08:02, Jark Wu wrote: > >>> >>>> >>>>>>> Hi Jeff, > >>> >>>> >>>>>>> > >>> >>>> >>>>>>> Thanks for sharing the Zeppelin approach. I > >>> think it's a > >>> >>>> good idea to > >>> >>>> >>>>>>> leverage user's travis account. > >>> >>>> >>>>>>> In this way, we can have almost unlimited > >>> concurrent build > >>> >>>> jobs and > >>> >>>> >>>>>>> developers can restart build by themselves > >>> (currently only > >>> >>>> committers > >>> >>>> >>>>>>> can restart PR's build). > >>> >>>> >>>>>>> > >>> >>>> >>>>>>> But I'm still not very clear how to integrate > >>> user's > >>> >>>> travis build > >>> >>>> >> into > >>> >>>> >>>>>>> the Flink pull request's build automatically. > >>> Can you > >>> >>>> explain more in > >>> >>>> >>>>>>> detail? > >>> >>>> >>>>>>> > >>> >>>> >>>>>>> Another question: does travis only build > >>> branches for user > >>> >>>> account? > >>> >>>> >>>>>>> My concern is that builds for PRs will rebase > >>> user's > >>> >>>> commits against > >>> >>>> >>>>>>> current master branch. > >>> >>>> >>>>>>> This will help us to find problems before > >>> merge. Builds > >>> >>>> for branches > >>> >>>> >>>>>>> will lose the impact of new commits in master. > >>> >>>> >>>>>>> How does Zeppelin solve this problem? > >>> >>>> >>>>>>> > >>> >>>> >>>>>>> Thanks again for sharing the idea. > >>> >>>> >>>>>>> > >>> >>>> >>>>>>> Regards, > >>> >>>> >>>>>>> Jark > >>> >>>> >>>>>>> > >>> >>>> >>>>>>> On Tue, 25 Jun 2019 at 11:01, Jeff Zhang > >>> <[hidden email] <mailto:[hidden email]> > >>> >>>> <mailto:[hidden email] <mailto:[hidden email]>> > >>> >>>> >>>>>>> <mailto:[hidden email] > >>> <mailto:[hidden email]> <mailto:[hidden email] > >>> <mailto:[hidden email]>>>> wrote: > >>> >>>> >>>>>>> > >>> >>>> >>>>>>> Hi Folks, > >>> >>>> >>>>>>> > >>> >>>> >>>>>>> Zeppelin meet this kind of issue before, we > >>> solve > >>> >>>> it by > >>> >>>> >> delegating > >>> >>>> >>>>>>> each > >>> >>>> >>>>>>> one's PR build to his travis account > >>> (Everyone can > >>> >>>> have 5 free > >>> >>>> >>>>>>> slot for > >>> >>>> >>>>>>> travis build). > >>> >>>> >>>>>>> Apache account travis build is only triggered > >>> when > >>> >>>> PR is merged. > >>> >>>> >>>>>>> > >>> >>>> >>>>>>> > >>> >>>> >>>>>>> > >>> >>>> >>>>>>> Kurt Young <[hidden email] > >>> <mailto:[hidden email]> > >>> >>>> <mailto:[hidden email] <mailto:[hidden email]>> > >>> <mailto:[hidden email] <mailto:[hidden email]> > >>> >>>> <mailto:[hidden email] <mailto:[hidden email]>>>> > >>> >>>> >>>>>>> 于2019年6月25日周二 上午10:16写道: > >>> >>>> >>>>>>> > >>> >>>> >>>>>>> > (Forgot to cc George) > >>> >>>> >>>>>>> > > >>> >>>> >>>>>>> > Best, > >>> >>>> >>>>>>> > Kurt > >>> >>>> >>>>>>> > > >>> >>>> >>>>>>> > > >>> >>>> >>>>>>> > On Tue, Jun 25, 2019 at 10:16 AM Kurt Young > >>> >>>> <[hidden email] <mailto:[hidden email]> > >>> <mailto:[hidden email] <mailto:[hidden email]>> > >>> >>>> >>>>>>> <mailto:[hidden email] > >>> <mailto:[hidden email]> <mailto:[hidden email] > >>> <mailto:[hidden email]>>>> > >>> >>>> wrote: > >>> >>>> >>>>>>> > > >>> >>>> >>>>>>> > > Hi Bowen, > >>> >>>> >>>>>>> > > > >>> >>>> >>>>>>> > > Thanks for bringing this up. We > >>> actually have > >>> >>>> discussed > >>> >>>> >> about > >>> >>>> >>>>>>> this, and I > >>> >>>> >>>>>>> > > think Till and George have > >>> >>>> >>>>>>> > > already spend sometime investigating > >>> it. I have > >>> >>>> cced both of > >>> >>>> >>>>>>> them, and > >>> >>>> >>>>>>> > > maybe they can share > >>> >>>> >>>>>>> > > their findings. > >>> >>>> >>>>>>> > > > >>> >>>> >>>>>>> > > Best, > >>> >>>> >>>>>>> > > Kurt > >>> >>>> >>>>>>> > > > >>> >>>> >>>>>>> > > > >>> >>>> >>>>>>> > > On Tue, Jun 25, 2019 at 10:08 AM Jark Wu > >>> >>>> <[hidden email] <mailto:[hidden email]> > >>> <mailto:[hidden email] <mailto:[hidden email]>> > >>> >>>> >>>>>>> <mailto:[hidden email] > >>> <mailto:[hidden email]> <mailto:[hidden email] > >>> <mailto:[hidden email]>>>> > >>> >>>> wrote: > >>> >>>> >>>>>>> > > > >>> >>>> >>>>>>> > >> Hi Bowen, > >>> >>>> >>>>>>> > >> > >>> >>>> >>>>>>> > >> Thanks for bringing this. We also > >>> suffered from > >>> >>>> the long > >>> >>>> >>>>>>> build time. > >>> >>>> >>>>>>> > >> I agree that we should focus on > >>> solving build > >>> >>>> capacity > >>> >>>> >>>>>>> problem in the > >>> >>>> >>>>>>> > >> thread. > >>> >>>> >>>>>>> > >> > >>> >>>> >>>>>>> > >> My observation is there is only one > >>> build is > >>> >>>> running, all > >>> >>>> >> the > >>> >>>> >>>>>>> others > >>> >>>> >>>>>>> > >> (other > >>> >>>> >>>>>>> > >> PRs, master) are pending. > >>> >>>> >>>>>>> > >> The pricing plan[1] of travis shows > >>> it can > >>> >>>> support > >>> >>>> >> concurrent > >>> >>>> >>>>>>> build > >>> >>>> >>>>>>> > jobs. > >>> >>>> >>>>>>> > >> But I don't know which plan we are > >>> using, might > >>> >>>> be the free > >>> >>>> >>>>>>> plan for > >>> >>>> >>>>>>> > open > >>> >>>> >>>>>>> > >> source. > >>> >>>> >>>>>>> > >> > >>> >>>> >>>>>>> > >> I cc-ed Chesnay who may have some > >>> experience on > >>> >>>> Travis. > >>> >>>> >>>>>>> > >> > >>> >>>> >>>>>>> > >> Regards, > >>> >>>> >>>>>>> > >> Jark > >>> >>>> >>>>>>> > >> > >>> >>>> >>>>>>> > >> [1]: https://travis-ci.com/plans > >>> >>>> >>>>>>> > >> > >>> >>>> >>>>>>> > >> On Tue, 25 Jun 2019 at 08:11, Bowen Li < > >>> >>>> >> [hidden email] <mailto:[hidden email]> > >>> <mailto:[hidden email] <mailto:[hidden email]>> > >>> >>>> >>>>>>> <mailto:[hidden email] > >>> <mailto:[hidden email]> > >>> >>>> <mailto:[hidden email] > >>> <mailto:[hidden email]>>>> wrote: > >>> >>>> >>>>>>> > >> > >>> >>>> >>>>>>> > >> > Hi Steven, > >>> >>>> >>>>>>> > >> > > >>> >>>> >>>>>>> > >> > I think you may not read what I > >>> wrote. The > >>> >>>> discussion is > >>> >>>> >>>> about > >>> >>>> >>>>>>> > "unstable > >>> >>>> >>>>>>> > >> > build **capacity**", in another word > >>> >>>> "unstable / lack of > >>> >>>> >>>> build > >>> >>>> >>>>>>> > >> resources", > >>> >>>> >>>>>>> > >> > not "unstable build". > >>> >>>> >>>>>>> > >> > > >>> >>>> >>>>>>> > >> > On Mon, Jun 24, 2019 at 4:40 PM > >>> Steven Wu > >>> >>>> >>>>>>> <[hidden email] > >>> <mailto:[hidden email]> <mailto:[hidden email] > >>> <mailto:[hidden email]>> > >>> >>>> <mailto:[hidden email] > >>> <mailto:[hidden email]> <mailto:[hidden email] > >>> <mailto:[hidden email]>>>> > >>> >>>> >>>>>>> > wrote: > >>> >>>> >>>>>>> > >> > > >>> >>>> >>>>>>> > >> > > long and sometimes unstable build is > >>> >>>> definitely a pain > >>> >>>> >>>>>> point. > >>> >>>> >>>>>>> > >> > > > >>> >>>> >>>>>>> > >> > > I suspect the build failure here in > >>> >>>> >> flink-connector-kafka > >>> >>>> >>>>>>> is not > >>> >>>> >>>>>>> > >> related > >>> >>>> >>>>>>> > >> > to > >>> >>>> >>>>>>> > >> > > my change. but there is no easy > >>> re-run the > >>> >>>> build on > >>> >>>> >>>>>>> travis UI. > >>> >>>> >>>>>>> > >> > > search showed a trick of > >>> close-and-open the > >>> >>>> PR will > >>> >>>> >>>>>>> trigger rebuild. > >>> >>>> >>>>>>> > >> but > >>> >>>> >>>>>>> > >> > > that could add noises to the PR > >>> activities. > >>> >>>> >>>>>>> > >> > > > >>> >>>> https://travis-ci.org/apache/flink/jobs/545555519 > >>> >>>> >>>>>>> > >> > > > >>> >>>> >>>>>>> > >> > > travis-ci for my personal repo > >>> often failed > >>> >>>> with > >>> >>>> >>>>>>> exceeding time > >>> >>>> >>>>>>> > limit > >>> >>>> >>>>>>> > >> > after > >>> >>>> >>>>>>> > >> > > 4+ hours. > >>> >>>> >>>>>>> > >> > > The job exceeded the maximum time > >>> limit for > >>> >>>> jobs, and > >>> >>>> >> has > >>> >>>> >>>>>>> been > >>> >>>> >>>>>>> > >> > terminated. > >>> >>>> >>>>>>> > >> > > > >>> >>>> >>>>>>> > >> > > On Mon, Jun 24, 2019 at 4:15 PM > >>> Bowen Li > >>> >>>> >>>>>>> <[hidden email] > >>> <mailto:[hidden email]> <mailto:[hidden email] > >>> <mailto:[hidden email]>> > >>> >>>> <mailto:[hidden email] <mailto:[hidden email]> > >>> <mailto:[hidden email] <mailto:[hidden email]>>>> > >>> >>>> >>>>>>> > wrote: > >>> >>>> >>>>>>> > >> > > > >>> >>>> >>>>>>> > >> > > > > >>> >>>> https://travis-ci.org/apache/flink/builds/549681530 > >>> >>>> >>>>>>> This build > >>> >>>> >>>>>>> > >> > request > >>> >>>> >>>>>>> > >> > > > has > >>> >>>> >>>>>>> > >> > > > been sitting at **HEAD of the > >>> queue** > >>> >>>> since I first > >>> >>>> >> saw > >>> >>>> >>>>>>> it at PST > >>> >>>> >>>>>>> > >> > 10:30am > >>> >>>> >>>>>>> > >> > > > (not sure how long it's been > >>> there before > >>> >>>> 10:30am). > >>> >>>> >>>>>>> It's PST > >>> >>>> >>>>>>> > 4:12pm > >>> >>>> >>>>>>> > >> now > >>> >>>> >>>>>>> > >> > > and > >>> >>>> >>>>>>> > >> > > > it hasn't started yet. > >>> >>>> >>>>>>> > >> > > > > >>> >>>> >>>>>>> > >> > > > On Mon, Jun 24, 2019 at 2:48 PM > >>> Bowen Li > >>> >>>> >>>>>>> <[hidden email] > >>> <mailto:[hidden email]> <mailto:[hidden email] > >>> <mailto:[hidden email]>> > >>> >>>> <mailto:[hidden email] <mailto:[hidden email]> > >>> <mailto:[hidden email] <mailto:[hidden email]>>>> > >>> >>>> >>>>>>> > >> wrote: > >>> >>>> >>>>>>> > >> > > > > >>> >>>> >>>>>>> > >> > > > > Hi devs, > >>> >>>> >>>>>>> > >> > > > > > >>> >>>> >>>>>>> > >> > > > > I've been experiencing the pain > >>> >>>> resulting from lack > >>> >>>> >>>>>>> of stable > >>> >>>> >>>>>>> > >> build > >>> >>>> >>>>>>> > >> > > > > capacity on Travis for Flink > >>> PRs [1]. > >>> >>>> >> Specifically, I > >>> >>>> >>>>>>> noticed > >>> >>>> >>>>>>> > >> often > >>> >>>> >>>>>>> > >> > > that > >>> >>>> >>>>>>> > >> > > > no > >>> >>>> >>>>>>> > >> > > > > build in the queue is making any > >>> >>>> progress for > >>> >>>> >> hours, > >>> >>>> >>>> and > >>> >>>> >>>>>>> > suddenly > >>> >>>> >>>>>>> > >> 5 > >>> >>>> >>>>>>> > >> > or > >>> >>>> >>>>>>> > >> > > 6 > >>> >>>> >>>>>>> > >> > > > > builds kick off all together > >>> after the > >>> >>>> long pause. > >>> >>>> >>>>>>> I'm at PST > >>> >>>> >>>>>>> > >> > (UTC-08) > >>> >>>> >>>>>>> > >> > > > time > >>> >>>> >>>>>>> > >> > > > > zone, and I've seen pause can > >>> be as > >>> >>>> long as 6 hours > >>> >>>> >>>>>>> from PST 9am > >>> >>>> >>>>>>> > >> to > >>> >>>> >>>>>>> > >> > 3pm > >>> >>>> >>>>>>> > >> > > > > (let alone the time needed to > >>> drain the > >>> >>>> queue > >>> >>>> >>>>>>> afterwards). > >>> >>>> >>>>>>> > >> > > > > > >>> >>>> >>>>>>> > >> > > > > I think this has greatly > >>> impacted our > >>> >>>> productivity. > >>> >>>> >>>> I've > >>> >>>> >>>>>>> > >> experienced > >>> >>>> >>>>>>> > >> > > that > >>> >>>> >>>>>>> > >> > > > > PRs submitted in the early > >>> morning of > >>> >>>> PST time zone > >>> >>>> >>>>>>> won't finish > >>> >>>> >>>>>>> > >> > their > >>> >>>> >>>>>>> > >> > > > > build until late night of the > >>> same day. > >>> >>>> >>>>>>> > >> > > > > > >>> >>>> >>>>>>> > >> > > > > So my questions are: > >>> >>>> >>>>>>> > >> > > > > > >>> >>>> >>>>>>> > >> > > > > - Has anyone else experienced > >>> the same > >>> >>>> problem or > >>> >>>> >>>>>>> have similar > >>> >>>> >>>>>>> > >> > > > observation > >>> >>>> >>>>>>> > >> > > > > on TravisCI? (I suspect it > >>> has things > >>> >>>> to do with > >>> >>>> >> time > >>> >>>> >>>>>>> zone) > >>> >>>> >>>>>>> > >> > > > > > >>> >>>> >>>>>>> > >> > > > > - What pricing plan of > >>> TravisCI is > >>> >>>> Flink currently > >>> >>>> >>>>>>> using? Is it > >>> >>>> >>>>>>> > >> the > >>> >>>> >>>>>>> > >> > > free > >>> >>>> >>>>>>> > >> > > > > plan for open source > >>> projects? What > >>> >>>> are the > >>> >>>> >>>>>>> guaranteed build > >>> >>>> >>>>>>> > >> capacity > >>> >>>> >>>>>>> > >> > > of > >>> >>>> >>>>>>> > >> > > > > the current plan? > >>> >>>> >>>>>>> > >> > > > > > >>> >>>> >>>>>>> > >> > > > > - If the current pricing plan > >>> (either > >>> >>>> free or paid) > >>> >>>> >>>>>> can't > >>> >>>> >>>>>>> > provide > >>> >>>> >>>>>>> > >> > > stable > >>> >>>> >>>>>>> > >> > > > > build capacity, can we > >>> upgrade to a > >>> >>>> higher priced > >>> >>>> >>>>>>> plan with > >>> >>>> >>>>>>> > larger > >>> >>>> >>>>>>> > >> > and > >>> >>>> >>>>>>> > >> > > > more > >>> >>>> >>>>>>> > >> > > > > stable build capacity? > >>> >>>> >>>>>>> > >> > > > > > >>> >>>> >>>>>>> > >> > > > > BTW, another factor that > >>> contribute to > >>> >>>> the > >>> >>>> >>>>>>> productivity problem > >>> >>>> >>>>>>> > is > >>> >>>> >>>>>>> > >> > that > >>> >>>> >>>>>>> > >> > > > > our build is slow - we run > >>> full build > >>> >>>> for every PR > >>> >>>> >>>> and a > >>> >>>> >>>>>>> > >> successful > >>> >>>> >>>>>>> > >> > > full > >>> >>>> >>>>>>> > >> > > > > build takes ~5h. We > >>> definitely have > >>> >>>> more options to > >>> >>>> >>>>>>> solve it, > >>> >>>> >>>>>>> > for > >>> >>>> >>>>>>> > >> > > > instance, > >>> >>>> >>>>>>> > >> > > > > modularize the build graphs > >>> and reuse > >>> >>>> artifacts > >>> >>>> >> from > >>> >>>> >>>> the > >>> >>>> >>>>>>> > previous > >>> >>>> >>>>>>> > >> > > build. > >>> >>>> >>>>>>> > >> > > > > But I think that can be a big > >>> effort > >>> >>>> which is much > >>> >>>> >>>>>>> harder to > >>> >>>> >>>>>>> > >> > accomplish > >>> >>>> >>>>>>> > >> > > > in > >>> >>>> >>>>>>> > >> > > > > a short period of time and > >>> may deserve > >>> >>>> its own > >>> >>>> >>>> separate > >>> >>>> >>>>>>> > >> discussion. > >>> >>>> >>>>>>> > >> > > > > > >>> >>>> >>>>>>> > >> > > > > [1] > >>> >>>> >> https://travis-ci.org/apache/flink/pull_requests > >>> >>>> >>>>>>> > >> > > > > > >>> >>>> >>>>>>> > >> > > > > > >>> >>>> >>>>>>> > >> > > > > >>> >>>> >>>>>>> > >> > > > >>> >>>> >>>>>>> > >> > > >>> >>>> >>>>>>> > >> > >>> >>>> >>>>>>> > > > >>> >>>> >>>>>>> > > >>> >>>> >>>>>>> > >>> >>>> >>>>>>> > >>> >>>> >>>>>>> -- > >>> >>>> >>>>>>> Best Regards > >>> >>>> >>>>>>> > >>> >>>> >>>>>>> Jeff Zhang > >>> >>>> >>>>>>> > >>> >>>> >> > >>> >>>> > >>> >>> > >>> >> > >>> > >> > >> > > > > |
+1 on approval of the migration to our own Travis account. The foreseeable
benefits of the whole community's productivity and iteration speed would be significant! I think using Flinkbot or Travis REST API would be an implementation details. Once we determine the overall direction, details can be figured out. Good news is that, upon my research on how Arrow and Spark integrate their own in-house CI services with github repo, they are both using bots with Github API. See a typical PR check for those projects at [1] and [2]. Thus, we are **not alone** on this path. Specifically for Apache Arrow, they have 'Ursabot', similar to our Flinkbot, as I shared the link in the discussion. [3] lays out how Usrabot works and integrates with Github API to trigger build. I think their documentations is a bit outdated though - the doc says it cannot report back build status to github, but from [1] we can see that the build status are actually reported. @Chesnay thanks for taking actions on this. Though I don't have access to settings of Flink's github repo, I will continue to help push this initiative in whichever way I can. Wes and Krisztián from Arrow are also very friendly and helpful, and I can connect you to them to learn their experience. [1] https://github.com/apache/arrow/pull/4809 [2] https://github.com/apache/spark/pull/25053 [3] https://github.com/ursa-labs/ursabot#driving-ursabot On Thu, Jul 4, 2019 at 6:42 AM Hequn Cheng <[hidden email]> wrote: > +1. > > And thanks a lot to Chesnay for pushing this. > > Best, Hequn > > On Thu, Jul 4, 2019 at 8:07 PM Chesnay Schepler <[hidden email]> > wrote: > >> Note that the Flinkbot approach isn't that trivial either; we can't >> _just_ trigger builds for a branch in the apache repo, but would first >> have to clone the branch/pr into a separate repository (that is owned by >> the github account that the travis account would be tied to). >> >> One roadblock after the next showing up... >> >> On 04/07/2019 11:59, Chesnay Schepler wrote: >> > Small update with mostly bad news: >> > >> > INFRA doesn't know whether it is possible, and referred my to Travis >> > support. >> > They did point out that it could be problematic in regards to >> > read/write permissions for the repository. >> > >> > From my own findings /so far/ with a test repo/organization, it does >> > not appear possible to configure the Travis account used for a >> > specific repository. >> > >> > So yeah, if we go down this route we may have to pimp the Flinkbot to >> > trigger builds through the Travis REST API. >> > >> > On 04/07/2019 10:46, Chesnay Schepler wrote: >> >> I've raised a JIRA >> >> <https://issues.apache.org/jira/browse/INFRA-18703>with INFRA to >> >> inquire whether it would be possible to switch to a different Travis >> >> account, and if so what steps would need to be taken. >> >> We need a proper confirmation from INFRA since we are not in full >> >> control of the flink repository (for example, we cannot access the >> >> settings page). >> >> >> >> If this is indeed possible, Ververica is willing sponsor a Travis >> >> account for the Flink project. >> >> This would provide us with more than enough resources than we need. >> >> >> >> Since this makes the project more reliant on resources provided by >> >> external companies I would like to vote on this. >> >> >> >> Please vote on this proposal, as follows: >> >> [ ] +1, Approve the migration to a Ververica-sponsored Travis >> >> account, provided that INFRA approves >> >> [ ] -1, Do not approach the migration to a Ververica-sponsored Travis >> >> account >> >> >> >> The vote will be open for at least 24h, and until we have >> >> confirmation from INFRA. The voting period may be shorter than the >> >> usual 3 days since our current is effectively not working. >> >> >> >> On 04/07/2019 06:51, Bowen Li wrote: >> >>> Re: > Are they using their own Travis CI pool, or did the switch to >> >>> an entirely different CI service? >> >>> >> >>> I reached out to Wes and Krisztián from Apache Arrow PMC. They are >> >>> currently moving away from ASF's Travis to their own in-house metal >> >>> machines at [1] with custom CI application at [2]. They've seen >> >>> significant improvement w.r.t both much higher performance and >> >>> basically no resource waiting time, "night-and-day" difference >> >>> quoting Wes. >> >>> >> >>> Re: > If we can just switch to our own Travis pool, just for our >> >>> project, then this might be something we can do fairly quickly? >> >>> >> >>> I believe so, according to [3] and [4] >> >>> >> >>> >> >>> [1] https://ci.ursalabs.org/ <https://ci.ursalabs.org/#/> >> >>> [2] https://github.com/ursa-labs/ursabot >> >>> [3] >> >>> >> https://docs.travis-ci.com/user/migrate/open-source-repository-migration >> >>> >> >>> [4] >> >>> https://docs.travis-ci.com/user/migrate/open-source-on-travis-ci-com >> >>> >> >>> >> >>> >> >>> On Wed, Jul 3, 2019 at 12:01 AM Chesnay Schepler <[hidden email] >> >>> <mailto:[hidden email]>> wrote: >> >>> >> >>> Are they using their own Travis CI pool, or did the switch to an >> >>> entirely different CI service? >> >>> >> >>> If we can just switch to our own Travis pool, just for our >> >>> project, then >> >>> this might be something we can do fairly quickly? >> >>> >> >>> On 03/07/2019 05:55, Bowen Li wrote: >> >>> > I responded in the INFRA ticket [1] that I believe they are >> >>> using a wrong >> >>> > metric against Flink and the total build time is a completely >> >>> different >> >>> > thing than guaranteed build capacity. >> >>> > >> >>> > My response: >> >>> > >> >>> > "As mentioned above, since I started to pay attention to Flink's >> >>> build >> >>> > queue a few tens of days ago, I'm in Seattle and I saw no build >> >>> was kicking >> >>> > off in PST daytime in weekdays for Flink. Our teammates in China >> >>> and Europe >> >>> > have also reported similar observations. So we need to evaluate >> >>> how the >> >>> > large total build time came from - if 1) your number and 2) our >> >>> > observations from three locations that cover pretty much a full >> >>> day, are >> >>> > all true, I **guess** one reason can be that - highly likely the >> >>> extra >> >>> > build time came from weekends when other Apache projects may be >> >>> idle and >> >>> > Flink just drains hard its congested queue. >> >>> > >> >>> > Please be aware of that we're not complaining about the lack of >> >>> resources >> >>> > in general, I'm complaining about the lack of **stable, >> >>> dedicated** >> >>> > resources. An example for the latter one is, currently even if >> >>> no build is >> >>> > in Flink's queue and I submit a request to be the queue head >> >>> in PST >> >>> > morning, my build won't even start in 6-8+h. That is an absurd >> >>> amount of >> >>> > waiting time. >> >>> > >> >>> > That's saying, if ASF INFRA decides to adopt a quota system and >> >>> grants >> >>> > Flink five DEDICATED servers that runs all the time only for >> >>> Flink, that'll >> >>> > be PERFECT and can totally solve our problem now. >> >>> > >> >>> > Please be aware of that we're not complaining about the lack of >> >>> resources >> >>> > in general, I'm complaining about the lack of **stable, >> >>> dedicated** >> >>> > resources. An example for the latter one is, currently even if >> >>> no build is >> >>> > in Flink's queue and I submit a request to be the queue head >> >>> in PST >> >>> > morning, my build won't even start in 6-8+h. That is an absurd >> >>> amount of >> >>> > waiting time. >> >>> > >> >>> > >> >>> > That's saying, if ASF INFRA decides to adopt a quota system and >> >>> grants >> >>> > Flink five DEDICATED servers that runs all the time only for >> >>> Flink, that'll >> >>> > be PERFECT and can totally solve our problem now. >> >>> > >> >>> > I feel what's missing in the ASF INFRA's Travis resource pool is >> >>> some level >> >>> > of build capacity SLAs and certainty" >> >>> > >> >>> > >> >>> > Again, I believe there are differences in nature of these two >> >>> problems, >> >>> > long build time v.s. lack of dedicated build resource. That's >> >>> saying, >> >>> > shortening build time may relieve the situation, and may not. >> >>> I'm sightly >> >>> > negative on disabling IT cases for PRs, due to the downside is >> >>> that we are >> >>> > at risk of any potential bugs in PR that UTs doesn't catch, and >> >>> may cost a >> >>> > lot more to fix and if it slows others down or even block >> >>> others, but am >> >>> > open to others opinions on it. >> >>> > >> >>> > AFAICT from INFRA ticket[1], donating to ASF INFRA won't be >> >>> feasible to >> >>> > solve our problem since INFRA's pool is fully shared and they >> >>> have no >> >>> > control and finer insights over resource allocation to a >> >>> specific Apache >> >>> > project. As mentioned in [1], Apache Arrow is moving away from >> >>> ASF INFRA >> >>> > Travis pool (they are actually surprised Flink hasn't plan to do >> >>> so). I >> >>> > know that Spark is on its own build infra. If we all agree that >> >>> funding our >> >>> > own build infra, I'd be glad to help investigate any potential >> >>> options >> >>> > after releasing 1.9 since I'm super busy with 1.9 now. >> >>> > >> >>> > [1] https://issues.apache.org/jira/browse/INFRA-18533 >> >>> > >> >>> > >> >>> > >> >>> > On Tue, Jul 2, 2019 at 4:46 AM Chesnay Schepler >> >>> <[hidden email] <mailto:[hidden email]>> wrote: >> >>> > >> >>> >> As a short-term stopgap, since we can assume this issue to >> >>> become much >> >>> >> worse in the following days/weeks, we could disable IT cases in >> >>> PRs and >> >>> >> only run them on master. >> >>> >> >> >>> >> On 02/07/2019 12:03, Chesnay Schepler wrote: >> >>> >>> People really have to stop thinking that just because >> >>> something works >> >>> >>> for us it is also a good solution. >> >>> >>> Also, please remember that our builds run for 2h from start to >> >>> finish, >> >>> >>> and not the 14 _minutes_ it takes for zeppelin. >> >>> >>> We are dealing with an entirely different scale here, both in >> >>> terms of >> >>> >>> build times and number of builds. >> >>> >>> >> >>> >>> In this very thread people have been complaining about long >> >>> queue >> >>> >>> times for their builds. Surprise, other Apache projects have >> >>> been >> >>> >>> suffering the very same thing due to us not controlling our >> >>> build >> >>> >>> times. While switching services (be it Jenkins, CircleCI or >> >>> whatever) >> >>> >>> will possibly work for us (and these options are actually >> >>> attractive, >> >>> >>> like CircleCI's proper support for build artifacts), it will >> >>> also >> >>> >>> result in us likely negatively affecting other projects in >> >>> significant >> >>> >>> ways. >> >>> >>> >> >>> >>> Sure, the Jenkins setup has a good user experience for us, at >> >>> the cost >> >>> >>> of blocking Jenkins workers for a _lot_ of time. Right now we >> >>> have 25 >> >>> >>> PR's in our queue; that's possibly 50h we'd consume of Jenkins >> >>> >>> resources, and the European contributors haven't even really >> >>> started yet. >> >>> >>> >> >>> >>> FYI, the latest INFRA response from INFRA-18533: >> >>> >>> >> >>> >>> "Our rough metrics shows that Flink used over 5800 hours of >> >>> build time >> >>> >>> last month. That is equal to EIGHT servers running 24/7 for >> >>> the ENTIRE >> >>> >>> MONTH. EIGHT. nonstop. >> >>> >>> When we discovered this last night, we discussed it some and >> >>> are going >> >>> >>> to tune down Flink to allow only five executors maximum. We >> >>> cannot >> >>> >>> allow Flink to consume so much of a Foundation shared >> >>> resource." >> >>> >>> >> >>> >>> So yes, we either >> >>> >>> a) have to heavily reduce our CI usage or >> >>> >>> b) fund our own, either maintaining it ourselves or donating >> >>> to Apache. >> >>> >>> >> >>> >>> On 02/07/2019 05:11, Bowen Li wrote: >> >>> >>>> By looking at the git history of the Jenkins script, its core >> >>> part >> >>> >>>> was finished in March 2017 (and only two minor update in >> >>> 2017/2018), >> >>> >>>> so it's been running for over two years now and feels like >> >>> Zepplin >> >>> >>>> community has been quite happy with it. @Jeff Zhang >> >>> >>>> <mailto:[hidden email] <mailto:[hidden email]>> can you >> >>> share your insights and user >> >>> >>>> experience with the Jenkins+Travis approach? >> >>> >>>> >> >>> >>>> Things like: >> >>> >>>> >> >>> >>>> - has the approach completely solved the resource capacity >> >>> problem >> >>> >>>> for Zepplin community? is Zepplin community happy with the >> >>> result? >> >>> >>>> - is the whole configuration chain stable (e.g. uptime) >> >>> enough? >> >>> >>>> - how often do you need to maintain the Jenkins infra? how >> >>> many >> >>> >>>> people are usually involved in maintenance and bug-fixes? >> >>> >>>> >> >>> >>>> The downside of this approach seems mostly to be on the >> >>> maintenance >> >>> >>>> to me - maintain the script and Jenkins infra. >> >>> >>>> >> >>> >>>> ** Having Our Own Travis-CI.com Account ** >> >>> >>>> >> >>> >>>> Another alternative I've been thinking of is to have our own >> >>> >>>> travis-ci.com <http://travis-ci.com> <http://travis-ci.com> >> >>> account with paid dedicated >> >>> >>>> resources. Note travis-ci.org <http://travis-ci.org> >> >>> <http://travis-ci.org> is the free >> >>> >>>> version and travis-ci.com <http://travis-ci.com> >> >>> <http://travis-ci.com> is the commercial >> >>> >>>> version. We currently use a shared resource pool managed by >> >>> ASK INFRA >> >>> >>>> team on travis-ci.org <http://travis-ci.org> >> >>> <http://travis-ci.org>, but we have no control >> >>> >>>> over it - we can't see how it's configured, how much >> >>> resources are >> >>> >>>> available, how resources are allocated among Apache projects, >> >>> etc. >> >>> >>>> The nice thing about having an account on travis-ci.com >> >>> <http://travis-ci.com> >> >>> >>>> <http://travis-ci.com> are: >> >>> >>>> >> >>> >>>> - relatively low cost with much better resource guarantee >> >>> than what >> >>> >>>> we currently have [1]: $249/month with 5 dedicated >> >>> concurrency, >> >>> >>>> $489/month with 10 concurrency >> >>> >>>> - low maintenance work compared to using Jenkins >> >>> >>>> - (potentially) no migration cost according to Travis's doc >> >>> [2] >> >>> >>>> (pending verification) >> >>> >>>> - full control over the build capacity/configuration >> >>> compared to >> >>> >>>> using ASF INFRA's pool >> >>> >>>> >> >>> >>>> I'd be surprised if we as such a vibrant community cannot >> >>> find and >> >>> >>>> fund $249*12=$2988 a year in exchange for a much better >> >>> developer >> >>> >>>> experience and much higher productivity. >> >>> >>>> >> >>> >>>> [1] https://travis-ci.com/plans >> >>> >>>> [2] >> >>> >>>> >> >>> >> >> >>> >> https://docs.travis-ci.com/user/migrate/open-source-repository-migration >> >>> >> >>> >>>> On Sat, Jun 29, 2019 at 8:39 AM Chesnay Schepler >> >>> <[hidden email] <mailto:[hidden email]> >> >>> >>>> <mailto:[hidden email] <mailto:[hidden email]>>> >> >>> wrote: >> >>> >>>> >> >>> >>>> So yes, the Jenkins job keeps pulling the state from >> >>> Travis until it >> >>> >>>> finishes. >> >>> >>>> >> >>> >>>> Note sure I'm comfortable with the idea of using Jenkins >> >>> workers >> >>> >>>> just to >> >>> >>>> idle for a several hours. >> >>> >>>> >> >>> >>>> On 29/06/2019 14:56, Jeff Zhang wrote: >> >>> >>>> > Here's what zeppelin community did, we make a python >> >>> script to >> >>> >>>> check the >> >>> >>>> > build status of pull request. >> >>> >>>> > Here's script: >> >>> >>>> > >> >>> https://github.com/apache/zeppelin/blob/master/travis_check.py >> >>> >>>> > >> >>> >>>> > And this is the script we used in Jenkins build job. >> >>> >>>> > >> >>> >>>> > if [ -f "travis_check.py" ]; then >> >>> >>>> > git log -n 1 >> >>> >>>> > STATUS=$(curl -s $BUILD_URL | grep -e "GitHub pull >> >>> >>>> request.*from.*" | sed >> >>> >>>> > 's/.*GitHub pull request <a >> >>> >>>> > href=\"\(https[^"]*\).*from[^"]*.\(https[^"]*\).*/\1 >> >>> \2/g') >> >>> >>>> > AUTHOR=$(echo $STATUS | sed 's/.*[/]\(.*\)$/\1/g') >> >>> >>>> > PR=$(echo $STATUS | awk '{print $1}' | sed >> >>> >>>> 's/.*[/]\(.*\)$/\1/g') >> >>> >>>> > #COMMIT=$(git log -n 1 | grep "^Merge:" | awk >> >>> '{print $3}') >> >>> >>>> > #if [ -z $COMMIT ]; then >> >>> >>>> > # COMMIT=$(curl -s >> >>> >>>> https://api.github.com/repos/apache/zeppelin/pulls/$PR >> >>> >>>> > | grep -e "\"label\":" -e "\"ref\":" -e "\"sha\":" | >> >>> tr '\n' ' ' >> >>> >>>> | sed >> >>> >>>> > 's/\(.*sha[^,]*,\)\(.*ref.*\)/\1 = \2/g' | tr = '\n' | >> >>> grep -v >> >>> >>>> "apache:" | >> >>> >>>> > sed 's/.*sha.[^"]*["]\([^"]*\).*/\1/g') >> >>> >>>> > #fi >> >>> >>>> > >> >>> >>>> > # get commit hash from PR >> >>> >>>> > COMMIT=$(curl -s >> >>> >>>> https://api.github.com/repos/apache/zeppelin/pulls/$PR | >> >>> >>>> > grep -e "\"label\":" -e "\"ref\":" -e "\"sha\":" | tr >> >>> '\n' ' ' >> >>> >>>> | sed >> >>> >>>> > 's/\(.*sha[^,]*,\)\(.*ref.*\)/\1 = \2/g' | tr = '\n' | >> >>> grep -v >> >>> >>>> "apache:" | >> >>> >>>> > sed 's/.*sha.[^"]*["]\([^"]*\).*/\1/g') >> >>> >>>> > sleep 30 # sleep few moment to wait travis starts >> >>> the build >> >>> >>>> > RET_CODE=0 >> >>> >>>> > python ./travis_check.py ${AUTHOR} ${COMMIT} || >> >>> RET_CODE=$? >> >>> >>>> > if [ $RET_CODE -eq 2 ]; then # try with repository >> >>> name when >> >>> >>>> travis-ci is >> >>> >>>> > not available in the account >> >>> >>>> > RET_CODE=0 >> >>> >>>> > AUTHOR=$(curl -s >> >>> >>>> https://api.github.com/repos/apache/zeppelin/pulls/$PR >> >>> >>>> > | grep '"full_name":' | grep -v "apache/zeppelin" | >> sed >> >>> >>>> > 's/.*[:][^"]*["]\([^/]*\).*/\1/g') >> >>> >>>> > python ./travis_check.py ${AUTHOR} ${COMMIT} || >> >>> RET_CODE=$? >> >>> >>>> > fi >> >>> >>>> > >> >>> >>>> > if [ $RET_CODE -eq 2 ]; then # fail with can't find >> >>> build >> >>> >>>> information in >> >>> >>>> > the travis >> >>> >>>> > set +x >> >>> >>>> > echo >> >>> "-----------------------------------------------------" >> >>> >>>> > echo "Looks like travis-ci is not configured for >> >>> your fork." >> >>> >>>> > echo "Please setup by swich on 'zeppelin' >> >>> repository at >> >>> >>>> > https://travis-ci.org/profile and travis-ci." >> >>> >>>> > echo "And then make sure 'Build branch updates' >> >>> option is >> >>> >>>> enabled in >> >>> >>>> > the settings >> >>> https://travis-ci.org/${AUTHOR}/zeppelin/settings >> >>> <https://travis-ci.org/$%7BAUTHOR%7D/zeppelin/settings> >> >>> >>>> <https://travis-ci.org/$%7BAUTHOR%7D/zeppelin/settings>." >> >>> >>>> > echo "" >> >>> >>>> > echo "To trigger CI after setup, you will need >> >>> ammend your >> >>> >>>> last commit >> >>> >>>> > with" >> >>> >>>> > echo "git commit --amend" >> >>> >>>> > echo "git push your-remote HEAD --force" >> >>> >>>> > echo "" >> >>> >>>> > echo "See >> >>> >>>> > >> >>> >>>> >> >>> >> >> >>> >> http://zeppelin.apache.org/contribution/contributions.html#continuous-integration >> >>> >>>> > ." >> >>> >>>> > fi >> >>> >>>> > >> >>> >>>> > exit $RET_CODE >> >>> >>>> > else >> >>> >>>> > set +x >> >>> >>>> > echo "travis_check.py does not exists" >> >>> >>>> > exit 1 >> >>> >>>> > fi >> >>> >>>> > >> >>> >>>> > Chesnay Schepler <[hidden email] >> >>> <mailto:[hidden email]> >> >>> >>>> <mailto:[hidden email] <mailto:[hidden email] >> >>> >> >>> 于2019年6月29日周六 下午3:17写道: >> >>> >>>> > >> >>> >>>> >> Does this imply that a Jenkins job is active as long >> >>> as the >> >>> >>>> Travis build >> >>> >>>> >> runs? >> >>> >>>> >> >> >>> >>>> >> On 26/06/2019 21:28, Bowen Li wrote: >> >>> >>>> >>> Hi, >> >>> >>>> >>> >> >>> >>>> >>> @Dawid, I think the "long test running" as I >> >>> mentioned in the >> >>> >>>> first >> >>> >>>> >> email, >> >>> >>>> >>> also as you guys said, belongs to "a big effort >> >>> which is much >> >>> >>>> harder to >> >>> >>>> >>> accomplish in a short period of time and may deserve >> >>> its own >> >>> >>>> separate >> >>> >>>> >>> discussion". Thus I didn't include it in what we can >> >>> do in a >> >>> >>>> foreseeable >> >>> >>>> >>> short term. >> >>> >>>> >>> >> >>> >>>> >>> Besides, I don't think that's the ultimate reason >> >>> for lack of >> >>> >>>> build >> >>> >>>> >>> resources. Even if the build is shortened to >> >>> something like >> >>> >>>> 2h, the >> >>> >>>> >>> problems of no build machine works about 6 or more >> >>> hours in >> >>> >>>> PST daytime >> >>> >>>> >>> that I described will still happen, because no >> >>> machine from >> >>> >>>> ASF INFRA's >> >>> >>>> >>> pool is allocated to Flink. As I have paid close >> >>> attention to >> >>> >>>> the build >> >>> >>>> >>> queue in the past few weekdays, it's a pretty clear >> >>> pattern now. >> >>> >>>> >>> >> >>> >>>> >>> **The ultimate root cause** for that is - we don't >> >>> have any >> >>> >>>> **dedicated** >> >>> >>>> >>> build resources that we can stably rely on. I'm >> >>> actually ok to >> >>> >>>> wait for a >> >>> >>>> >>> long time if there are build requests running, it >> >>> means at >> >>> >>>> least we are >> >>> >>>> >>> making progress. But I'm not ok with no build >> >>> resource. A >> >>> >>>> better place I >> >>> >>>> >>> think we should aim at in short term is to always >> >>> have at >> >>> >>>> least a central >> >>> >>>> >>> pool (can be 3 or 5) of machines dedicated to build >> >>> Flink at >> >>> >>>> any time, or >> >>> >>>> >>> maybe use users resources. >> >>> >>>> >>> >> >>> >>>> >>> @Chesnay @Robert I synced with Jeff offline that >> >>> Zeppelin >> >>> >>>> community is >> >>> >>>> >>> using a Jenkins job to automatically build on users' >> >>> travis >> >>> >>>> account and >> >>> >>>> >>> link the result back to github PR. I guess the >> >>> Jenkins job >> >>> >>>> would fetch >> >>> >>>> >>> latest upstream master and build the PR against it. >> >>> Jeff has >> >>> >>>> filed >> >>> >>>> >> tickets >> >>> >>>> >>> to learn and get access to the Jenkins infra. It'll >> >>> better to >> >>> >>>> fully >> >>> >>>> >>> understand it first before judging this approach. >> >>> >>>> >>> >> >>> >>>> >>> I also heard good things about CircleCI, and ASF >> >>> INFRA seems >> >>> >>>> to have a >> >>> >>>> >> pool >> >>> >>>> >>> of build capacity there too. Can be an alternative >> >>> to consider. >> >>> >>>> >>> >> >>> >>>> >>> >> >>> >>>> >>> >> >>> >>>> >>> >> >>> >>>> >>> >> >>> >>>> >>> >> >>> >>>> >>> >> >>> >>>> >>> >> >>> >>>> >>> >> >>> >>>> >>> On Wed, Jun 26, 2019 at 12:44 AM Dawid Wysakowicz < >> >>> >>>> >> [hidden email] >> >>> <mailto:[hidden email]> <mailto:[hidden email] >> >>> <mailto:[hidden email]>>> >> >>> >>>> >>> wrote: >> >>> >>>> >>> >> >>> >>>> >>>> Sorry to jump in late, but I think Bowen missed the >> >>> most >> >>> >>>> important point >> >>> >>>> >>>> from Chesnay's previous message in the summary. The >> >>> ultimate >> >>> >>>> reason for >> >>> >>>> >>>> all the problems is that the tests take close to 2 >> >>> hours to >> >>> >>>> run already. >> >>> >>>> >>>> I fully support this claim: "Unless people start >> >>> caring about >> >>> >>>> test times >> >>> >>>> >>>> before adding them, this issue cannot be solved" >> >>> >>>> >>>> >> >>> >>>> >>>> This is also another reason why using user's Travis >> >>> account >> >>> >>>> won't help. >> >>> >>>> >>>> Every few weeks we reach the user's time limit for >> >>> a single >> >>> >>>> profile. >> >>> >>>> >>>> This makes the user's builds simply fail, until we >> >>> either >> >>> >>>> properly >> >>> >>>> >>>> decrease the time the tests take (which I am not >> >>> sure we ever >> >>> >>>> did) or >> >>> >>>> >>>> postpone the problem by splitting into more >> >>> profiles. (Note >> >>> >>>> that the ASF >> >>> >>>> >>>> Travis account has higher time limits) >> >>> >>>> >>>> >> >>> >>>> >>>> Best, >> >>> >>>> >>>> >> >>> >>>> >>>> Dawid >> >>> >>>> >>>> >> >>> >>>> >>>> On 26/06/2019 09:36, Robert Metzger wrote: >> >>> >>>> >>>>> Do we know if using "the best" available hardware >> >>> would >> >>> >>>> improve the >> >>> >>>> >> build >> >>> >>>> >>>>> times? >> >>> >>>> >>>>> Imagine we would run the build on machines with >> >>> plenty of >> >>> >>>> main memory >> >>> >>>> >> to >> >>> >>>> >>>>> mount everything to ramdisk + the latest CPU >> >>> architecture? >> >>> >>>> >>>>> >> >>> >>>> >>>>> Throwing hardware at the problem could help reduce >> >>> the time >> >>> >>>> of an >> >>> >>>> >>>>> individual build, and using our own infrastructure >> >>> would >> >>> >>>> remove our >> >>> >>>> >>>>> dependency on Apache's Travis account (with the >> >>> obvious >> >>> >>>> downside of >> >>> >>>> >>>> having >> >>> >>>> >>>>> to maintain the infrastructure) >> >>> >>>> >>>>> We could use an open source travis alternative, to >> >>> have a >> >>> >>>> similar >> >>> >>>> >>>>> experience and make the migration easy. >> >>> >>>> >>>>> >> >>> >>>> >>>>> >> >>> >>>> >>>>> On Wed, Jun 26, 2019 at 9:34 AM Chesnay Schepler >> >>> >>>> <[hidden email] <mailto:[hidden email]> >> >>> <mailto:[hidden email] <mailto:[hidden email]>>> >> >>> >>>> >>>> wrote: >> >>> >>>> >>>>>> >From what I gathered, there's no special >> >>> sauce that the >> >>> >>>> Zeppelin >> >>> >>>> >>>>>> project uses which actually integrates a users >> >>> Travis >> >>> >>>> account into the >> >>> >>>> >>>> PR. >> >>> >>>> >>>>>> They just disabled Travis for PRs. And that's >> >>> kind of it. >> >>> >>>> >>>>>> >> >>> >>>> >>>>>> Naturally we can do this (duh) and safe the ASF a >> >>> fair >> >>> >>>> amount of >> >>> >>>> >>>>>> resources, but there are downsides: >> >>> >>>> >>>>>> >> >>> >>>> >>>>>> The discoverability of the Travis check takes a >> >>> nose-dive. >> >>> >>>> Either we >> >>> >>>> >>>>>> require every contributor to always, an every >> >>> commit, also >> >>> >>>> post a >> >>> >>>> >> Travis >> >>> >>>> >>>>>> build, or we have the reviewer sift through the >> >>> >>>> contributors account >> >>> >>>> >> to >> >>> >>>> >>>>>> find it. >> >>> >>>> >>>>>> >> >>> >>>> >>>>>> This is rather cumbersome. Additionally, it's >> >>> also not >> >>> >>>> equivalent to >> >>> >>>> >>>>>> having a PR build. >> >>> >>>> >>>>>> >> >>> >>>> >>>>>> A normal branch build takes a branch as is and >> >>> tests it. A >> >>> >>>> PR build >> >>> >>>> >>>>>> merges the branch into master, and then runs it. >> >>> (Fun fact: >> >>> >>>> This is >> >>> >>>> >> why >> >>> >>>> >>>>>> a PR without merge conflicts is not being run on >> >>> Travis.) >> >>> >>>> >>>>>> >> >>> >>>> >>>>>> And ultimately, everyone can already make use >> >>> of this >> >>> >>>> approach anyway. >> >>> >>>> >>>>>> >> >>> >>>> >>>>>> On 25/06/2019 08:02, Jark Wu wrote: >> >>> >>>> >>>>>>> Hi Jeff, >> >>> >>>> >>>>>>> >> >>> >>>> >>>>>>> Thanks for sharing the Zeppelin approach. I >> >>> think it's a >> >>> >>>> good idea to >> >>> >>>> >>>>>>> leverage user's travis account. >> >>> >>>> >>>>>>> In this way, we can have almost unlimited >> >>> concurrent build >> >>> >>>> jobs and >> >>> >>>> >>>>>>> developers can restart build by themselves >> >>> (currently only >> >>> >>>> committers >> >>> >>>> >>>>>>> can restart PR's build). >> >>> >>>> >>>>>>> >> >>> >>>> >>>>>>> But I'm still not very clear how to integrate >> >>> user's >> >>> >>>> travis build >> >>> >>>> >> into >> >>> >>>> >>>>>>> the Flink pull request's build automatically. >> >>> Can you >> >>> >>>> explain more in >> >>> >>>> >>>>>>> detail? >> >>> >>>> >>>>>>> >> >>> >>>> >>>>>>> Another question: does travis only build >> >>> branches for user >> >>> >>>> account? >> >>> >>>> >>>>>>> My concern is that builds for PRs will rebase >> >>> user's >> >>> >>>> commits against >> >>> >>>> >>>>>>> current master branch. >> >>> >>>> >>>>>>> This will help us to find problems before >> >>> merge. Builds >> >>> >>>> for branches >> >>> >>>> >>>>>>> will lose the impact of new commits in master. >> >>> >>>> >>>>>>> How does Zeppelin solve this problem? >> >>> >>>> >>>>>>> >> >>> >>>> >>>>>>> Thanks again for sharing the idea. >> >>> >>>> >>>>>>> >> >>> >>>> >>>>>>> Regards, >> >>> >>>> >>>>>>> Jark >> >>> >>>> >>>>>>> >> >>> >>>> >>>>>>> On Tue, 25 Jun 2019 at 11:01, Jeff Zhang >> >>> <[hidden email] <mailto:[hidden email]> >> >>> >>>> <mailto:[hidden email] <mailto:[hidden email]>> >> >>> >>>> >>>>>>> <mailto:[hidden email] >> >>> <mailto:[hidden email]> <mailto:[hidden email] >> >>> <mailto:[hidden email]>>>> wrote: >> >>> >>>> >>>>>>> >> >>> >>>> >>>>>>> Hi Folks, >> >>> >>>> >>>>>>> >> >>> >>>> >>>>>>> Zeppelin meet this kind of issue before, we >> >>> solve >> >>> >>>> it by >> >>> >>>> >> delegating >> >>> >>>> >>>>>>> each >> >>> >>>> >>>>>>> one's PR build to his travis account >> >>> (Everyone can >> >>> >>>> have 5 free >> >>> >>>> >>>>>>> slot for >> >>> >>>> >>>>>>> travis build). >> >>> >>>> >>>>>>> Apache account travis build is only triggered >> >>> when >> >>> >>>> PR is merged. >> >>> >>>> >>>>>>> >> >>> >>>> >>>>>>> >> >>> >>>> >>>>>>> >> >>> >>>> >>>>>>> Kurt Young <[hidden email] >> >>> <mailto:[hidden email]> >> >>> >>>> <mailto:[hidden email] <mailto:[hidden email]>> >> >>> <mailto:[hidden email] <mailto:[hidden email]> >> >>> >>>> <mailto:[hidden email] <mailto:[hidden email]>>>> >> >>> >>>> >>>>>>> 于2019年6月25日周二 上午10:16写道: >> >>> >>>> >>>>>>> >> >>> >>>> >>>>>>> > (Forgot to cc George) >> >>> >>>> >>>>>>> > >> >>> >>>> >>>>>>> > Best, >> >>> >>>> >>>>>>> > Kurt >> >>> >>>> >>>>>>> > >> >>> >>>> >>>>>>> > >> >>> >>>> >>>>>>> > On Tue, Jun 25, 2019 at 10:16 AM Kurt Young >> >>> >>>> <[hidden email] <mailto:[hidden email]> >> >>> <mailto:[hidden email] <mailto:[hidden email]>> >> >>> >>>> >>>>>>> <mailto:[hidden email] >> >>> <mailto:[hidden email]> <mailto:[hidden email] >> >>> <mailto:[hidden email]>>>> >> >>> >>>> wrote: >> >>> >>>> >>>>>>> > >> >>> >>>> >>>>>>> > > Hi Bowen, >> >>> >>>> >>>>>>> > > >> >>> >>>> >>>>>>> > > Thanks for bringing this up. We >> >>> actually have >> >>> >>>> discussed >> >>> >>>> >> about >> >>> >>>> >>>>>>> this, and I >> >>> >>>> >>>>>>> > > think Till and George have >> >>> >>>> >>>>>>> > > already spend sometime investigating >> >>> it. I have >> >>> >>>> cced both of >> >>> >>>> >>>>>>> them, and >> >>> >>>> >>>>>>> > > maybe they can share >> >>> >>>> >>>>>>> > > their findings. >> >>> >>>> >>>>>>> > > >> >>> >>>> >>>>>>> > > Best, >> >>> >>>> >>>>>>> > > Kurt >> >>> >>>> >>>>>>> > > >> >>> >>>> >>>>>>> > > >> >>> >>>> >>>>>>> > > On Tue, Jun 25, 2019 at 10:08 AM Jark Wu >> >>> >>>> <[hidden email] <mailto:[hidden email]> >> >>> <mailto:[hidden email] <mailto:[hidden email]>> >> >>> >>>> >>>>>>> <mailto:[hidden email] >> >>> <mailto:[hidden email]> <mailto:[hidden email] >> >>> <mailto:[hidden email]>>>> >> >>> >>>> wrote: >> >>> >>>> >>>>>>> > > >> >>> >>>> >>>>>>> > >> Hi Bowen, >> >>> >>>> >>>>>>> > >> >> >>> >>>> >>>>>>> > >> Thanks for bringing this. We also >> >>> suffered from >> >>> >>>> the long >> >>> >>>> >>>>>>> build time. >> >>> >>>> >>>>>>> > >> I agree that we should focus on >> >>> solving build >> >>> >>>> capacity >> >>> >>>> >>>>>>> problem in the >> >>> >>>> >>>>>>> > >> thread. >> >>> >>>> >>>>>>> > >> >> >>> >>>> >>>>>>> > >> My observation is there is only one >> >>> build is >> >>> >>>> running, all >> >>> >>>> >> the >> >>> >>>> >>>>>>> others >> >>> >>>> >>>>>>> > >> (other >> >>> >>>> >>>>>>> > >> PRs, master) are pending. >> >>> >>>> >>>>>>> > >> The pricing plan[1] of travis shows >> >>> it can >> >>> >>>> support >> >>> >>>> >> concurrent >> >>> >>>> >>>>>>> build >> >>> >>>> >>>>>>> > jobs. >> >>> >>>> >>>>>>> > >> But I don't know which plan we are >> >>> using, might >> >>> >>>> be the free >> >>> >>>> >>>>>>> plan for >> >>> >>>> >>>>>>> > open >> >>> >>>> >>>>>>> > >> source. >> >>> >>>> >>>>>>> > >> >> >>> >>>> >>>>>>> > >> I cc-ed Chesnay who may have some >> >>> experience on >> >>> >>>> Travis. >> >>> >>>> >>>>>>> > >> >> >>> >>>> >>>>>>> > >> Regards, >> >>> >>>> >>>>>>> > >> Jark >> >>> >>>> >>>>>>> > >> >> >>> >>>> >>>>>>> > >> [1]: https://travis-ci.com/plans >> >>> >>>> >>>>>>> > >> >> >>> >>>> >>>>>>> > >> On Tue, 25 Jun 2019 at 08:11, Bowen Li < >> >>> >>>> >> [hidden email] <mailto:[hidden email]> >> >>> <mailto:[hidden email] <mailto:[hidden email]>> >> >>> >>>> >>>>>>> <mailto:[hidden email] >> >>> <mailto:[hidden email]> >> >>> >>>> <mailto:[hidden email] >> >>> <mailto:[hidden email]>>>> wrote: >> >>> >>>> >>>>>>> > >> >> >>> >>>> >>>>>>> > >> > Hi Steven, >> >>> >>>> >>>>>>> > >> > >> >>> >>>> >>>>>>> > >> > I think you may not read what I >> >>> wrote. The >> >>> >>>> discussion is >> >>> >>>> >>>> about >> >>> >>>> >>>>>>> > "unstable >> >>> >>>> >>>>>>> > >> > build **capacity**", in another word >> >>> >>>> "unstable / lack of >> >>> >>>> >>>> build >> >>> >>>> >>>>>>> > >> resources", >> >>> >>>> >>>>>>> > >> > not "unstable build". >> >>> >>>> >>>>>>> > >> > >> >>> >>>> >>>>>>> > >> > On Mon, Jun 24, 2019 at 4:40 PM >> >>> Steven Wu >> >>> >>>> >>>>>>> <[hidden email] >> >>> <mailto:[hidden email]> <mailto:[hidden email] >> >>> <mailto:[hidden email]>> >> >>> >>>> <mailto:[hidden email] >> >>> <mailto:[hidden email]> <mailto:[hidden email] >> >>> <mailto:[hidden email]>>>> >> >>> >>>> >>>>>>> > wrote: >> >>> >>>> >>>>>>> > >> > >> >>> >>>> >>>>>>> > >> > > long and sometimes unstable build is >> >>> >>>> definitely a pain >> >>> >>>> >>>>>> point. >> >>> >>>> >>>>>>> > >> > > >> >>> >>>> >>>>>>> > >> > > I suspect the build failure here in >> >>> >>>> >> flink-connector-kafka >> >>> >>>> >>>>>>> is not >> >>> >>>> >>>>>>> > >> related >> >>> >>>> >>>>>>> > >> > to >> >>> >>>> >>>>>>> > >> > > my change. but there is no easy >> >>> re-run the >> >>> >>>> build on >> >>> >>>> >>>>>>> travis UI. >> >>> >>>> >>>>>>> > >> > > search showed a trick of >> >>> close-and-open the >> >>> >>>> PR will >> >>> >>>> >>>>>>> trigger rebuild. >> >>> >>>> >>>>>>> > >> but >> >>> >>>> >>>>>>> > >> > > that could add noises to the PR >> >>> activities. >> >>> >>>> >>>>>>> > >> > > >> >>> >>>> https://travis-ci.org/apache/flink/jobs/545555519 >> >>> >>>> >>>>>>> > >> > > >> >>> >>>> >>>>>>> > >> > > travis-ci for my personal repo >> >>> often failed >> >>> >>>> with >> >>> >>>> >>>>>>> exceeding time >> >>> >>>> >>>>>>> > limit >> >>> >>>> >>>>>>> > >> > after >> >>> >>>> >>>>>>> > >> > > 4+ hours. >> >>> >>>> >>>>>>> > >> > > The job exceeded the maximum time >> >>> limit for >> >>> >>>> jobs, and >> >>> >>>> >> has >> >>> >>>> >>>>>>> been >> >>> >>>> >>>>>>> > >> > terminated. >> >>> >>>> >>>>>>> > >> > > >> >>> >>>> >>>>>>> > >> > > On Mon, Jun 24, 2019 at 4:15 PM >> >>> Bowen Li >> >>> >>>> >>>>>>> <[hidden email] >> >>> <mailto:[hidden email]> <mailto:[hidden email] >> >>> <mailto:[hidden email]>> >> >>> >>>> <mailto:[hidden email] <mailto:[hidden email] >> > >> >>> <mailto:[hidden email] <mailto:[hidden email]>>>> >> >>> >>>> >>>>>>> > wrote: >> >>> >>>> >>>>>>> > >> > > >> >>> >>>> >>>>>>> > >> > > > >> >>> >>>> https://travis-ci.org/apache/flink/builds/549681530 >> >>> >>>> >>>>>>> This build >> >>> >>>> >>>>>>> > >> > request >> >>> >>>> >>>>>>> > >> > > > has >> >>> >>>> >>>>>>> > >> > > > been sitting at **HEAD of the >> >>> queue** >> >>> >>>> since I first >> >>> >>>> >> saw >> >>> >>>> >>>>>>> it at PST >> >>> >>>> >>>>>>> > >> > 10:30am >> >>> >>>> >>>>>>> > >> > > > (not sure how long it's been >> >>> there before >> >>> >>>> 10:30am). >> >>> >>>> >>>>>>> It's PST >> >>> >>>> >>>>>>> > 4:12pm >> >>> >>>> >>>>>>> > >> now >> >>> >>>> >>>>>>> > >> > > and >> >>> >>>> >>>>>>> > >> > > > it hasn't started yet. >> >>> >>>> >>>>>>> > >> > > > >> >>> >>>> >>>>>>> > >> > > > On Mon, Jun 24, 2019 at 2:48 PM >> >>> Bowen Li >> >>> >>>> >>>>>>> <[hidden email] >> >>> <mailto:[hidden email]> <mailto:[hidden email] >> >>> <mailto:[hidden email]>> >> >>> >>>> <mailto:[hidden email] <mailto:[hidden email] >> > >> >>> <mailto:[hidden email] <mailto:[hidden email]>>>> >> >>> >>>> >>>>>>> > >> wrote: >> >>> >>>> >>>>>>> > >> > > > >> >>> >>>> >>>>>>> > >> > > > > Hi devs, >> >>> >>>> >>>>>>> > >> > > > > >> >>> >>>> >>>>>>> > >> > > > > I've been experiencing the pain >> >>> >>>> resulting from lack >> >>> >>>> >>>>>>> of stable >> >>> >>>> >>>>>>> > >> build >> >>> >>>> >>>>>>> > >> > > > > capacity on Travis for Flink >> >>> PRs [1]. >> >>> >>>> >> Specifically, I >> >>> >>>> >>>>>>> noticed >> >>> >>>> >>>>>>> > >> often >> >>> >>>> >>>>>>> > >> > > that >> >>> >>>> >>>>>>> > >> > > > no >> >>> >>>> >>>>>>> > >> > > > > build in the queue is making any >> >>> >>>> progress for >> >>> >>>> >> hours, >> >>> >>>> >>>> and >> >>> >>>> >>>>>>> > suddenly >> >>> >>>> >>>>>>> > >> 5 >> >>> >>>> >>>>>>> > >> > or >> >>> >>>> >>>>>>> > >> > > 6 >> >>> >>>> >>>>>>> > >> > > > > builds kick off all together >> >>> after the >> >>> >>>> long pause. >> >>> >>>> >>>>>>> I'm at PST >> >>> >>>> >>>>>>> > >> > (UTC-08) >> >>> >>>> >>>>>>> > >> > > > time >> >>> >>>> >>>>>>> > >> > > > > zone, and I've seen pause can >> >>> be as >> >>> >>>> long as 6 hours >> >>> >>>> >>>>>>> from PST 9am >> >>> >>>> >>>>>>> > >> to >> >>> >>>> >>>>>>> > >> > 3pm >> >>> >>>> >>>>>>> > >> > > > > (let alone the time needed to >> >>> drain the >> >>> >>>> queue >> >>> >>>> >>>>>>> afterwards). >> >>> >>>> >>>>>>> > >> > > > > >> >>> >>>> >>>>>>> > >> > > > > I think this has greatly >> >>> impacted our >> >>> >>>> productivity. >> >>> >>>> >>>> I've >> >>> >>>> >>>>>>> > >> experienced >> >>> >>>> >>>>>>> > >> > > that >> >>> >>>> >>>>>>> > >> > > > > PRs submitted in the early >> >>> morning of >> >>> >>>> PST time zone >> >>> >>>> >>>>>>> won't finish >> >>> >>>> >>>>>>> > >> > their >> >>> >>>> >>>>>>> > >> > > > > build until late night of the >> >>> same day. >> >>> >>>> >>>>>>> > >> > > > > >> >>> >>>> >>>>>>> > >> > > > > So my questions are: >> >>> >>>> >>>>>>> > >> > > > > >> >>> >>>> >>>>>>> > >> > > > > - Has anyone else experienced >> >>> the same >> >>> >>>> problem or >> >>> >>>> >>>>>>> have similar >> >>> >>>> >>>>>>> > >> > > > observation >> >>> >>>> >>>>>>> > >> > > > > on TravisCI? (I suspect it >> >>> has things >> >>> >>>> to do with >> >>> >>>> >> time >> >>> >>>> >>>>>>> zone) >> >>> >>>> >>>>>>> > >> > > > > >> >>> >>>> >>>>>>> > >> > > > > - What pricing plan of >> >>> TravisCI is >> >>> >>>> Flink currently >> >>> >>>> >>>>>>> using? Is it >> >>> >>>> >>>>>>> > >> the >> >>> >>>> >>>>>>> > >> > > free >> >>> >>>> >>>>>>> > >> > > > > plan for open source >> >>> projects? What >> >>> >>>> are the >> >>> >>>> >>>>>>> guaranteed build >> >>> >>>> >>>>>>> > >> capacity >> >>> >>>> >>>>>>> > >> > > of >> >>> >>>> >>>>>>> > >> > > > > the current plan? >> >>> >>>> >>>>>>> > >> > > > > >> >>> >>>> >>>>>>> > >> > > > > - If the current pricing plan >> >>> (either >> >>> >>>> free or paid) >> >>> >>>> >>>>>> can't >> >>> >>>> >>>>>>> > provide >> >>> >>>> >>>>>>> > >> > > stable >> >>> >>>> >>>>>>> > >> > > > > build capacity, can we >> >>> upgrade to a >> >>> >>>> higher priced >> >>> >>>> >>>>>>> plan with >> >>> >>>> >>>>>>> > larger >> >>> >>>> >>>>>>> > >> > and >> >>> >>>> >>>>>>> > >> > > > more >> >>> >>>> >>>>>>> > >> > > > > stable build capacity? >> >>> >>>> >>>>>>> > >> > > > > >> >>> >>>> >>>>>>> > >> > > > > BTW, another factor that >> >>> contribute to >> >>> >>>> the >> >>> >>>> >>>>>>> productivity problem >> >>> >>>> >>>>>>> > is >> >>> >>>> >>>>>>> > >> > that >> >>> >>>> >>>>>>> > >> > > > > our build is slow - we run >> >>> full build >> >>> >>>> for every PR >> >>> >>>> >>>> and a >> >>> >>>> >>>>>>> > >> successful >> >>> >>>> >>>>>>> > >> > > full >> >>> >>>> >>>>>>> > >> > > > > build takes ~5h. We >> >>> definitely have >> >>> >>>> more options to >> >>> >>>> >>>>>>> solve it, >> >>> >>>> >>>>>>> > for >> >>> >>>> >>>>>>> > >> > > > instance, >> >>> >>>> >>>>>>> > >> > > > > modularize the build graphs >> >>> and reuse >> >>> >>>> artifacts >> >>> >>>> >> from >> >>> >>>> >>>> the >> >>> >>>> >>>>>>> > previous >> >>> >>>> >>>>>>> > >> > > build. >> >>> >>>> >>>>>>> > >> > > > > But I think that can be a big >> >>> effort >> >>> >>>> which is much >> >>> >>>> >>>>>>> harder to >> >>> >>>> >>>>>>> > >> > accomplish >> >>> >>>> >>>>>>> > >> > > > in >> >>> >>>> >>>>>>> > >> > > > > a short period of time and >> >>> may deserve >> >>> >>>> its own >> >>> >>>> >>>> separate >> >>> >>>> >>>>>>> > >> discussion. >> >>> >>>> >>>>>>> > >> > > > > >> >>> >>>> >>>>>>> > >> > > > > [1] >> >>> >>>> >> https://travis-ci.org/apache/flink/pull_requests >> >>> >>>> >>>>>>> > >> > > > > >> >>> >>>> >>>>>>> > >> > > > > >> >>> >>>> >>>>>>> > >> > > > >> >>> >>>> >>>>>>> > >> > > >> >>> >>>> >>>>>>> > >> > >> >>> >>>> >>>>>>> > >> >> >>> >>>> >>>>>>> > > >> >>> >>>> >>>>>>> > >> >>> >>>> >>>>>>> >> >>> >>>> >>>>>>> >> >>> >>>> >>>>>>> -- >> >>> >>>> >>>>>>> Best Regards >> >>> >>>> >>>>>>> >> >>> >>>> >>>>>>> Jeff Zhang >> >>> >>>> >>>>>>> >> >>> >>>> >> >> >>> >>>> >> >>> >>> >> >>> >> >> >>> >> >> >> >> >> > >> >> |
In reply to this post by Hequn Cheng
+1 for the migration.
Best, Congxian Hequn Cheng <[hidden email]> 于2019年7月4日周四 下午9:42写道: > +1. > > And thanks a lot to Chesnay for pushing this. > > Best, Hequn > > On Thu, Jul 4, 2019 at 8:07 PM Chesnay Schepler <[hidden email]> > wrote: > > > Note that the Flinkbot approach isn't that trivial either; we can't > > _just_ trigger builds for a branch in the apache repo, but would first > > have to clone the branch/pr into a separate repository (that is owned by > > the github account that the travis account would be tied to). > > > > One roadblock after the next showing up... > > > > On 04/07/2019 11:59, Chesnay Schepler wrote: > > > Small update with mostly bad news: > > > > > > INFRA doesn't know whether it is possible, and referred my to Travis > > > support. > > > They did point out that it could be problematic in regards to > > > read/write permissions for the repository. > > > > > > From my own findings /so far/ with a test repo/organization, it does > > > not appear possible to configure the Travis account used for a > > > specific repository. > > > > > > So yeah, if we go down this route we may have to pimp the Flinkbot to > > > trigger builds through the Travis REST API. > > > > > > On 04/07/2019 10:46, Chesnay Schepler wrote: > > >> I've raised a JIRA > > >> <https://issues.apache.org/jira/browse/INFRA-18703>with INFRA to > > >> inquire whether it would be possible to switch to a different Travis > > >> account, and if so what steps would need to be taken. > > >> We need a proper confirmation from INFRA since we are not in full > > >> control of the flink repository (for example, we cannot access the > > >> settings page). > > >> > > >> If this is indeed possible, Ververica is willing sponsor a Travis > > >> account for the Flink project. > > >> This would provide us with more than enough resources than we need. > > >> > > >> Since this makes the project more reliant on resources provided by > > >> external companies I would like to vote on this. > > >> > > >> Please vote on this proposal, as follows: > > >> [ ] +1, Approve the migration to a Ververica-sponsored Travis > > >> account, provided that INFRA approves > > >> [ ] -1, Do not approach the migration to a Ververica-sponsored Travis > > >> account > > >> > > >> The vote will be open for at least 24h, and until we have > > >> confirmation from INFRA. The voting period may be shorter than the > > >> usual 3 days since our current is effectively not working. > > >> > > >> On 04/07/2019 06:51, Bowen Li wrote: > > >>> Re: > Are they using their own Travis CI pool, or did the switch to > > >>> an entirely different CI service? > > >>> > > >>> I reached out to Wes and Krisztián from Apache Arrow PMC. They are > > >>> currently moving away from ASF's Travis to their own in-house metal > > >>> machines at [1] with custom CI application at [2]. They've seen > > >>> significant improvement w.r.t both much higher performance and > > >>> basically no resource waiting time, "night-and-day" difference > > >>> quoting Wes. > > >>> > > >>> Re: > If we can just switch to our own Travis pool, just for our > > >>> project, then this might be something we can do fairly quickly? > > >>> > > >>> I believe so, according to [3] and [4] > > >>> > > >>> > > >>> [1] https://ci.ursalabs.org/ <https://ci.ursalabs.org/#/> > > >>> [2] https://github.com/ursa-labs/ursabot > > >>> [3] > > >>> > > https://docs.travis-ci.com/user/migrate/open-source-repository-migration > > >>> > > >>> [4] > > >>> https://docs.travis-ci.com/user/migrate/open-source-on-travis-ci-com > > >>> > > >>> > > >>> > > >>> On Wed, Jul 3, 2019 at 12:01 AM Chesnay Schepler <[hidden email] > > >>> <mailto:[hidden email]>> wrote: > > >>> > > >>> Are they using their own Travis CI pool, or did the switch to an > > >>> entirely different CI service? > > >>> > > >>> If we can just switch to our own Travis pool, just for our > > >>> project, then > > >>> this might be something we can do fairly quickly? > > >>> > > >>> On 03/07/2019 05:55, Bowen Li wrote: > > >>> > I responded in the INFRA ticket [1] that I believe they are > > >>> using a wrong > > >>> > metric against Flink and the total build time is a completely > > >>> different > > >>> > thing than guaranteed build capacity. > > >>> > > > >>> > My response: > > >>> > > > >>> > "As mentioned above, since I started to pay attention to > Flink's > > >>> build > > >>> > queue a few tens of days ago, I'm in Seattle and I saw no build > > >>> was kicking > > >>> > off in PST daytime in weekdays for Flink. Our teammates in > China > > >>> and Europe > > >>> > have also reported similar observations. So we need to evaluate > > >>> how the > > >>> > large total build time came from - if 1) your number and 2) our > > >>> > observations from three locations that cover pretty much a full > > >>> day, are > > >>> > all true, I **guess** one reason can be that - highly likely > the > > >>> extra > > >>> > build time came from weekends when other Apache projects may be > > >>> idle and > > >>> > Flink just drains hard its congested queue. > > >>> > > > >>> > Please be aware of that we're not complaining about the lack of > > >>> resources > > >>> > in general, I'm complaining about the lack of **stable, > > >>> dedicated** > > >>> > resources. An example for the latter one is, currently even if > > >>> no build is > > >>> > in Flink's queue and I submit a request to be the queue head > > >>> in PST > > >>> > morning, my build won't even start in 6-8+h. That is an absurd > > >>> amount of > > >>> > waiting time. > > >>> > > > >>> > That's saying, if ASF INFRA decides to adopt a quota system and > > >>> grants > > >>> > Flink five DEDICATED servers that runs all the time only for > > >>> Flink, that'll > > >>> > be PERFECT and can totally solve our problem now. > > >>> > > > >>> > Please be aware of that we're not complaining about the lack of > > >>> resources > > >>> > in general, I'm complaining about the lack of **stable, > > >>> dedicated** > > >>> > resources. An example for the latter one is, currently even if > > >>> no build is > > >>> > in Flink's queue and I submit a request to be the queue head > > >>> in PST > > >>> > morning, my build won't even start in 6-8+h. That is an absurd > > >>> amount of > > >>> > waiting time. > > >>> > > > >>> > > > >>> > That's saying, if ASF INFRA decides to adopt a quota system and > > >>> grants > > >>> > Flink five DEDICATED servers that runs all the time only for > > >>> Flink, that'll > > >>> > be PERFECT and can totally solve our problem now. > > >>> > > > >>> > I feel what's missing in the ASF INFRA's Travis resource pool > is > > >>> some level > > >>> > of build capacity SLAs and certainty" > > >>> > > > >>> > > > >>> > Again, I believe there are differences in nature of these two > > >>> problems, > > >>> > long build time v.s. lack of dedicated build resource. That's > > >>> saying, > > >>> > shortening build time may relieve the situation, and may not. > > >>> I'm sightly > > >>> > negative on disabling IT cases for PRs, due to the downside is > > >>> that we are > > >>> > at risk of any potential bugs in PR that UTs doesn't catch, and > > >>> may cost a > > >>> > lot more to fix and if it slows others down or even block > > >>> others, but am > > >>> > open to others opinions on it. > > >>> > > > >>> > AFAICT from INFRA ticket[1], donating to ASF INFRA won't be > > >>> feasible to > > >>> > solve our problem since INFRA's pool is fully shared and they > > >>> have no > > >>> > control and finer insights over resource allocation to a > > >>> specific Apache > > >>> > project. As mentioned in [1], Apache Arrow is moving away from > > >>> ASF INFRA > > >>> > Travis pool (they are actually surprised Flink hasn't plan to > do > > >>> so). I > > >>> > know that Spark is on its own build infra. If we all agree that > > >>> funding our > > >>> > own build infra, I'd be glad to help investigate any potential > > >>> options > > >>> > after releasing 1.9 since I'm super busy with 1.9 now. > > >>> > > > >>> > [1] https://issues.apache.org/jira/browse/INFRA-18533 > > >>> > > > >>> > > > >>> > > > >>> > On Tue, Jul 2, 2019 at 4:46 AM Chesnay Schepler > > >>> <[hidden email] <mailto:[hidden email]>> wrote: > > >>> > > > >>> >> As a short-term stopgap, since we can assume this issue to > > >>> become much > > >>> >> worse in the following days/weeks, we could disable IT cases > in > > >>> PRs and > > >>> >> only run them on master. > > >>> >> > > >>> >> On 02/07/2019 12:03, Chesnay Schepler wrote: > > >>> >>> People really have to stop thinking that just because > > >>> something works > > >>> >>> for us it is also a good solution. > > >>> >>> Also, please remember that our builds run for 2h from start > to > > >>> finish, > > >>> >>> and not the 14 _minutes_ it takes for zeppelin. > > >>> >>> We are dealing with an entirely different scale here, both in > > >>> terms of > > >>> >>> build times and number of builds. > > >>> >>> > > >>> >>> In this very thread people have been complaining about long > > >>> queue > > >>> >>> times for their builds. Surprise, other Apache projects have > > >>> been > > >>> >>> suffering the very same thing due to us not controlling our > > >>> build > > >>> >>> times. While switching services (be it Jenkins, CircleCI or > > >>> whatever) > > >>> >>> will possibly work for us (and these options are actually > > >>> attractive, > > >>> >>> like CircleCI's proper support for build artifacts), it will > > >>> also > > >>> >>> result in us likely negatively affecting other projects in > > >>> significant > > >>> >>> ways. > > >>> >>> > > >>> >>> Sure, the Jenkins setup has a good user experience for us, at > > >>> the cost > > >>> >>> of blocking Jenkins workers for a _lot_ of time. Right now we > > >>> have 25 > > >>> >>> PR's in our queue; that's possibly 50h we'd consume of > Jenkins > > >>> >>> resources, and the European contributors haven't even really > > >>> started yet. > > >>> >>> > > >>> >>> FYI, the latest INFRA response from INFRA-18533: > > >>> >>> > > >>> >>> "Our rough metrics shows that Flink used over 5800 hours of > > >>> build time > > >>> >>> last month. That is equal to EIGHT servers running 24/7 for > > >>> the ENTIRE > > >>> >>> MONTH. EIGHT. nonstop. > > >>> >>> When we discovered this last night, we discussed it some and > > >>> are going > > >>> >>> to tune down Flink to allow only five executors maximum. We > > >>> cannot > > >>> >>> allow Flink to consume so much of a Foundation shared > > >>> resource." > > >>> >>> > > >>> >>> So yes, we either > > >>> >>> a) have to heavily reduce our CI usage or > > >>> >>> b) fund our own, either maintaining it ourselves or donating > > >>> to Apache. > > >>> >>> > > >>> >>> On 02/07/2019 05:11, Bowen Li wrote: > > >>> >>>> By looking at the git history of the Jenkins script, its > core > > >>> part > > >>> >>>> was finished in March 2017 (and only two minor update in > > >>> 2017/2018), > > >>> >>>> so it's been running for over two years now and feels like > > >>> Zepplin > > >>> >>>> community has been quite happy with it. @Jeff Zhang > > >>> >>>> <mailto:[hidden email] <mailto:[hidden email]>> can you > > >>> share your insights and user > > >>> >>>> experience with the Jenkins+Travis approach? > > >>> >>>> > > >>> >>>> Things like: > > >>> >>>> > > >>> >>>> - has the approach completely solved the resource capacity > > >>> problem > > >>> >>>> for Zepplin community? is Zepplin community happy with the > > >>> result? > > >>> >>>> - is the whole configuration chain stable (e.g. uptime) > > >>> enough? > > >>> >>>> - how often do you need to maintain the Jenkins infra? how > > >>> many > > >>> >>>> people are usually involved in maintenance and bug-fixes? > > >>> >>>> > > >>> >>>> The downside of this approach seems mostly to be on the > > >>> maintenance > > >>> >>>> to me - maintain the script and Jenkins infra. > > >>> >>>> > > >>> >>>> ** Having Our Own Travis-CI.com Account ** > > >>> >>>> > > >>> >>>> Another alternative I've been thinking of is to have our own > > >>> >>>> travis-ci.com <http://travis-ci.com> <http://travis-ci.com> > > >>> account with paid dedicated > > >>> >>>> resources. Note travis-ci.org <http://travis-ci.org> > > >>> <http://travis-ci.org> is the free > > >>> >>>> version and travis-ci.com <http://travis-ci.com> > > >>> <http://travis-ci.com> is the commercial > > >>> >>>> version. We currently use a shared resource pool managed by > > >>> ASK INFRA > > >>> >>>> team on travis-ci.org <http://travis-ci.org> > > >>> <http://travis-ci.org>, but we have no control > > >>> >>>> over it - we can't see how it's configured, how much > > >>> resources are > > >>> >>>> available, how resources are allocated among Apache > projects, > > >>> etc. > > >>> >>>> The nice thing about having an account on travis-ci.com > > >>> <http://travis-ci.com> > > >>> >>>> <http://travis-ci.com> are: > > >>> >>>> > > >>> >>>> - relatively low cost with much better resource guarantee > > >>> than what > > >>> >>>> we currently have [1]: $249/month with 5 dedicated > > >>> concurrency, > > >>> >>>> $489/month with 10 concurrency > > >>> >>>> - low maintenance work compared to using Jenkins > > >>> >>>> - (potentially) no migration cost according to Travis's doc > > >>> [2] > > >>> >>>> (pending verification) > > >>> >>>> - full control over the build capacity/configuration > > >>> compared to > > >>> >>>> using ASF INFRA's pool > > >>> >>>> > > >>> >>>> I'd be surprised if we as such a vibrant community cannot > > >>> find and > > >>> >>>> fund $249*12=$2988 a year in exchange for a much better > > >>> developer > > >>> >>>> experience and much higher productivity. > > >>> >>>> > > >>> >>>> [1] https://travis-ci.com/plans > > >>> >>>> [2] > > >>> >>>> > > >>> >> > > >>> > > https://docs.travis-ci.com/user/migrate/open-source-repository-migration > > >>> > > >>> >>>> On Sat, Jun 29, 2019 at 8:39 AM Chesnay Schepler > > >>> <[hidden email] <mailto:[hidden email]> > > >>> >>>> <mailto:[hidden email] <mailto:[hidden email]>>> > > >>> wrote: > > >>> >>>> > > >>> >>>> So yes, the Jenkins job keeps pulling the state from > > >>> Travis until it > > >>> >>>> finishes. > > >>> >>>> > > >>> >>>> Note sure I'm comfortable with the idea of using > Jenkins > > >>> workers > > >>> >>>> just to > > >>> >>>> idle for a several hours. > > >>> >>>> > > >>> >>>> On 29/06/2019 14:56, Jeff Zhang wrote: > > >>> >>>> > Here's what zeppelin community did, we make a python > > >>> script to > > >>> >>>> check the > > >>> >>>> > build status of pull request. > > >>> >>>> > Here's script: > > >>> >>>> > > > >>> https://github.com/apache/zeppelin/blob/master/travis_check.py > > >>> >>>> > > > >>> >>>> > And this is the script we used in Jenkins build job. > > >>> >>>> > > > >>> >>>> > if [ -f "travis_check.py" ]; then > > >>> >>>> > git log -n 1 > > >>> >>>> > STATUS=$(curl -s $BUILD_URL | grep -e "GitHub pull > > >>> >>>> request.*from.*" | sed > > >>> >>>> > 's/.*GitHub pull request <a > > >>> >>>> > href=\"\(https[^"]*\).*from[^"]*.\(https[^"]*\).*/\1 > > >>> \2/g') > > >>> >>>> > AUTHOR=$(echo $STATUS | sed 's/.*[/]\(.*\)$/\1/g') > > >>> >>>> > PR=$(echo $STATUS | awk '{print $1}' | sed > > >>> >>>> 's/.*[/]\(.*\)$/\1/g') > > >>> >>>> > #COMMIT=$(git log -n 1 | grep "^Merge:" | awk > > >>> '{print $3}') > > >>> >>>> > #if [ -z $COMMIT ]; then > > >>> >>>> > # COMMIT=$(curl -s > > >>> >>>> https://api.github.com/repos/apache/zeppelin/pulls/$PR > > >>> >>>> > | grep -e "\"label\":" -e "\"ref\":" -e "\"sha\":" | > > >>> tr '\n' ' ' > > >>> >>>> | sed > > >>> >>>> > 's/\(.*sha[^,]*,\)\(.*ref.*\)/\1 = \2/g' | tr = '\n' > | > > >>> grep -v > > >>> >>>> "apache:" | > > >>> >>>> > sed 's/.*sha.[^"]*["]\([^"]*\).*/\1/g') > > >>> >>>> > #fi > > >>> >>>> > > > >>> >>>> > # get commit hash from PR > > >>> >>>> > COMMIT=$(curl -s > > >>> >>>> https://api.github.com/repos/apache/zeppelin/pulls/$PR | > > >>> >>>> > grep -e "\"label\":" -e "\"ref\":" -e "\"sha\":" | tr > > >>> '\n' ' ' > > >>> >>>> | sed > > >>> >>>> > 's/\(.*sha[^,]*,\)\(.*ref.*\)/\1 = \2/g' | tr = '\n' > | > > >>> grep -v > > >>> >>>> "apache:" | > > >>> >>>> > sed 's/.*sha.[^"]*["]\([^"]*\).*/\1/g') > > >>> >>>> > sleep 30 # sleep few moment to wait travis starts > > >>> the build > > >>> >>>> > RET_CODE=0 > > >>> >>>> > python ./travis_check.py ${AUTHOR} ${COMMIT} || > > >>> RET_CODE=$? > > >>> >>>> > if [ $RET_CODE -eq 2 ]; then # try with repository > > >>> name when > > >>> >>>> travis-ci is > > >>> >>>> > not available in the account > > >>> >>>> > RET_CODE=0 > > >>> >>>> > AUTHOR=$(curl -s > > >>> >>>> https://api.github.com/repos/apache/zeppelin/pulls/$PR > > >>> >>>> > | grep '"full_name":' | grep -v "apache/zeppelin" | > sed > > >>> >>>> > 's/.*[:][^"]*["]\([^/]*\).*/\1/g') > > >>> >>>> > python ./travis_check.py ${AUTHOR} ${COMMIT} || > > >>> RET_CODE=$? > > >>> >>>> > fi > > >>> >>>> > > > >>> >>>> > if [ $RET_CODE -eq 2 ]; then # fail with can't > find > > >>> build > > >>> >>>> information in > > >>> >>>> > the travis > > >>> >>>> > set +x > > >>> >>>> > echo > > >>> "-----------------------------------------------------" > > >>> >>>> > echo "Looks like travis-ci is not configured for > > >>> your fork." > > >>> >>>> > echo "Please setup by swich on 'zeppelin' > > >>> repository at > > >>> >>>> > https://travis-ci.org/profile and travis-ci." > > >>> >>>> > echo "And then make sure 'Build branch updates' > > >>> option is > > >>> >>>> enabled in > > >>> >>>> > the settings > > >>> https://travis-ci.org/${AUTHOR}/zeppelin/settings > > >>> <https://travis-ci.org/$%7BAUTHOR%7D/zeppelin/settings> > > >>> >>>> <https://travis-ci.org/$%7BAUTHOR%7D/zeppelin/settings>." > > >>> >>>> > echo "" > > >>> >>>> > echo "To trigger CI after setup, you will need > > >>> ammend your > > >>> >>>> last commit > > >>> >>>> > with" > > >>> >>>> > echo "git commit --amend" > > >>> >>>> > echo "git push your-remote HEAD --force" > > >>> >>>> > echo "" > > >>> >>>> > echo "See > > >>> >>>> > > > >>> >>>> > > >>> >> > > >>> > > > http://zeppelin.apache.org/contribution/contributions.html#continuous-integration > > >>> >>>> > ." > > >>> >>>> > fi > > >>> >>>> > > > >>> >>>> > exit $RET_CODE > > >>> >>>> > else > > >>> >>>> > set +x > > >>> >>>> > echo "travis_check.py does not exists" > > >>> >>>> > exit 1 > > >>> >>>> > fi > > >>> >>>> > > > >>> >>>> > Chesnay Schepler <[hidden email] > > >>> <mailto:[hidden email]> > > >>> >>>> <mailto:[hidden email] <mailto:[hidden email] > >>> > > >>> 于2019年6月29日周六 下午3:17写道: > > >>> >>>> > > > >>> >>>> >> Does this imply that a Jenkins job is active as long > > >>> as the > > >>> >>>> Travis build > > >>> >>>> >> runs? > > >>> >>>> >> > > >>> >>>> >> On 26/06/2019 21:28, Bowen Li wrote: > > >>> >>>> >>> Hi, > > >>> >>>> >>> > > >>> >>>> >>> @Dawid, I think the "long test running" as I > > >>> mentioned in the > > >>> >>>> first > > >>> >>>> >> email, > > >>> >>>> >>> also as you guys said, belongs to "a big effort > > >>> which is much > > >>> >>>> harder to > > >>> >>>> >>> accomplish in a short period of time and may > deserve > > >>> its own > > >>> >>>> separate > > >>> >>>> >>> discussion". Thus I didn't include it in what we > can > > >>> do in a > > >>> >>>> foreseeable > > >>> >>>> >>> short term. > > >>> >>>> >>> > > >>> >>>> >>> Besides, I don't think that's the ultimate reason > > >>> for lack of > > >>> >>>> build > > >>> >>>> >>> resources. Even if the build is shortened to > > >>> something like > > >>> >>>> 2h, the > > >>> >>>> >>> problems of no build machine works about 6 or more > > >>> hours in > > >>> >>>> PST daytime > > >>> >>>> >>> that I described will still happen, because no > > >>> machine from > > >>> >>>> ASF INFRA's > > >>> >>>> >>> pool is allocated to Flink. As I have paid close > > >>> attention to > > >>> >>>> the build > > >>> >>>> >>> queue in the past few weekdays, it's a pretty clear > > >>> pattern now. > > >>> >>>> >>> > > >>> >>>> >>> **The ultimate root cause** for that is - we don't > > >>> have any > > >>> >>>> **dedicated** > > >>> >>>> >>> build resources that we can stably rely on. I'm > > >>> actually ok to > > >>> >>>> wait for a > > >>> >>>> >>> long time if there are build requests running, it > > >>> means at > > >>> >>>> least we are > > >>> >>>> >>> making progress. But I'm not ok with no build > > >>> resource. A > > >>> >>>> better place I > > >>> >>>> >>> think we should aim at in short term is to always > > >>> have at > > >>> >>>> least a central > > >>> >>>> >>> pool (can be 3 or 5) of machines dedicated to build > > >>> Flink at > > >>> >>>> any time, or > > >>> >>>> >>> maybe use users resources. > > >>> >>>> >>> > > >>> >>>> >>> @Chesnay @Robert I synced with Jeff offline that > > >>> Zeppelin > > >>> >>>> community is > > >>> >>>> >>> using a Jenkins job to automatically build on > users' > > >>> travis > > >>> >>>> account and > > >>> >>>> >>> link the result back to github PR. I guess the > > >>> Jenkins job > > >>> >>>> would fetch > > >>> >>>> >>> latest upstream master and build the PR against it. > > >>> Jeff has > > >>> >>>> filed > > >>> >>>> >> tickets > > >>> >>>> >>> to learn and get access to the Jenkins infra. It'll > > >>> better to > > >>> >>>> fully > > >>> >>>> >>> understand it first before judging this approach. > > >>> >>>> >>> > > >>> >>>> >>> I also heard good things about CircleCI, and ASF > > >>> INFRA seems > > >>> >>>> to have a > > >>> >>>> >> pool > > >>> >>>> >>> of build capacity there too. Can be an alternative > > >>> to consider. > > >>> >>>> >>> > > >>> >>>> >>> > > >>> >>>> >>> > > >>> >>>> >>> > > >>> >>>> >>> > > >>> >>>> >>> > > >>> >>>> >>> > > >>> >>>> >>> > > >>> >>>> >>> > > >>> >>>> >>> On Wed, Jun 26, 2019 at 12:44 AM Dawid Wysakowicz < > > >>> >>>> >> [hidden email] > > >>> <mailto:[hidden email]> <mailto:[hidden email] > > >>> <mailto:[hidden email]>>> > > >>> >>>> >>> wrote: > > >>> >>>> >>> > > >>> >>>> >>>> Sorry to jump in late, but I think Bowen missed > the > > >>> most > > >>> >>>> important point > > >>> >>>> >>>> from Chesnay's previous message in the summary. > The > > >>> ultimate > > >>> >>>> reason for > > >>> >>>> >>>> all the problems is that the tests take close to 2 > > >>> hours to > > >>> >>>> run already. > > >>> >>>> >>>> I fully support this claim: "Unless people start > > >>> caring about > > >>> >>>> test times > > >>> >>>> >>>> before adding them, this issue cannot be solved" > > >>> >>>> >>>> > > >>> >>>> >>>> This is also another reason why using user's > Travis > > >>> account > > >>> >>>> won't help. > > >>> >>>> >>>> Every few weeks we reach the user's time limit for > > >>> a single > > >>> >>>> profile. > > >>> >>>> >>>> This makes the user's builds simply fail, until we > > >>> either > > >>> >>>> properly > > >>> >>>> >>>> decrease the time the tests take (which I am not > > >>> sure we ever > > >>> >>>> did) or > > >>> >>>> >>>> postpone the problem by splitting into more > > >>> profiles. (Note > > >>> >>>> that the ASF > > >>> >>>> >>>> Travis account has higher time limits) > > >>> >>>> >>>> > > >>> >>>> >>>> Best, > > >>> >>>> >>>> > > >>> >>>> >>>> Dawid > > >>> >>>> >>>> > > >>> >>>> >>>> On 26/06/2019 09:36, Robert Metzger wrote: > > >>> >>>> >>>>> Do we know if using "the best" available hardware > > >>> would > > >>> >>>> improve the > > >>> >>>> >> build > > >>> >>>> >>>>> times? > > >>> >>>> >>>>> Imagine we would run the build on machines with > > >>> plenty of > > >>> >>>> main memory > > >>> >>>> >> to > > >>> >>>> >>>>> mount everything to ramdisk + the latest CPU > > >>> architecture? > > >>> >>>> >>>>> > > >>> >>>> >>>>> Throwing hardware at the problem could help > reduce > > >>> the time > > >>> >>>> of an > > >>> >>>> >>>>> individual build, and using our own > infrastructure > > >>> would > > >>> >>>> remove our > > >>> >>>> >>>>> dependency on Apache's Travis account (with the > > >>> obvious > > >>> >>>> downside of > > >>> >>>> >>>> having > > >>> >>>> >>>>> to maintain the infrastructure) > > >>> >>>> >>>>> We could use an open source travis alternative, > to > > >>> have a > > >>> >>>> similar > > >>> >>>> >>>>> experience and make the migration easy. > > >>> >>>> >>>>> > > >>> >>>> >>>>> > > >>> >>>> >>>>> On Wed, Jun 26, 2019 at 9:34 AM Chesnay Schepler > > >>> >>>> <[hidden email] <mailto:[hidden email]> > > >>> <mailto:[hidden email] <mailto:[hidden email]>>> > > >>> >>>> >>>> wrote: > > >>> >>>> >>>>>> >From what I gathered, there's no special > > >>> sauce that the > > >>> >>>> Zeppelin > > >>> >>>> >>>>>> project uses which actually integrates a users > > >>> Travis > > >>> >>>> account into the > > >>> >>>> >>>> PR. > > >>> >>>> >>>>>> They just disabled Travis for PRs. And that's > > >>> kind of it. > > >>> >>>> >>>>>> > > >>> >>>> >>>>>> Naturally we can do this (duh) and safe the ASF > a > > >>> fair > > >>> >>>> amount of > > >>> >>>> >>>>>> resources, but there are downsides: > > >>> >>>> >>>>>> > > >>> >>>> >>>>>> The discoverability of the Travis check takes a > > >>> nose-dive. > > >>> >>>> Either we > > >>> >>>> >>>>>> require every contributor to always, an every > > >>> commit, also > > >>> >>>> post a > > >>> >>>> >> Travis > > >>> >>>> >>>>>> build, or we have the reviewer sift through the > > >>> >>>> contributors account > > >>> >>>> >> to > > >>> >>>> >>>>>> find it. > > >>> >>>> >>>>>> > > >>> >>>> >>>>>> This is rather cumbersome. Additionally, it's > > >>> also not > > >>> >>>> equivalent to > > >>> >>>> >>>>>> having a PR build. > > >>> >>>> >>>>>> > > >>> >>>> >>>>>> A normal branch build takes a branch as is and > > >>> tests it. A > > >>> >>>> PR build > > >>> >>>> >>>>>> merges the branch into master, and then runs it. > > >>> (Fun fact: > > >>> >>>> This is > > >>> >>>> >> why > > >>> >>>> >>>>>> a PR without merge conflicts is not being run on > > >>> Travis.) > > >>> >>>> >>>>>> > > >>> >>>> >>>>>> And ultimately, everyone can already make use > > >>> of this > > >>> >>>> approach anyway. > > >>> >>>> >>>>>> > > >>> >>>> >>>>>> On 25/06/2019 08:02, Jark Wu wrote: > > >>> >>>> >>>>>>> Hi Jeff, > > >>> >>>> >>>>>>> > > >>> >>>> >>>>>>> Thanks for sharing the Zeppelin approach. I > > >>> think it's a > > >>> >>>> good idea to > > >>> >>>> >>>>>>> leverage user's travis account. > > >>> >>>> >>>>>>> In this way, we can have almost unlimited > > >>> concurrent build > > >>> >>>> jobs and > > >>> >>>> >>>>>>> developers can restart build by themselves > > >>> (currently only > > >>> >>>> committers > > >>> >>>> >>>>>>> can restart PR's build). > > >>> >>>> >>>>>>> > > >>> >>>> >>>>>>> But I'm still not very clear how to integrate > > >>> user's > > >>> >>>> travis build > > >>> >>>> >> into > > >>> >>>> >>>>>>> the Flink pull request's build automatically. > > >>> Can you > > >>> >>>> explain more in > > >>> >>>> >>>>>>> detail? > > >>> >>>> >>>>>>> > > >>> >>>> >>>>>>> Another question: does travis only build > > >>> branches for user > > >>> >>>> account? > > >>> >>>> >>>>>>> My concern is that builds for PRs will rebase > > >>> user's > > >>> >>>> commits against > > >>> >>>> >>>>>>> current master branch. > > >>> >>>> >>>>>>> This will help us to find problems before > > >>> merge. Builds > > >>> >>>> for branches > > >>> >>>> >>>>>>> will lose the impact of new commits in master. > > >>> >>>> >>>>>>> How does Zeppelin solve this problem? > > >>> >>>> >>>>>>> > > >>> >>>> >>>>>>> Thanks again for sharing the idea. > > >>> >>>> >>>>>>> > > >>> >>>> >>>>>>> Regards, > > >>> >>>> >>>>>>> Jark > > >>> >>>> >>>>>>> > > >>> >>>> >>>>>>> On Tue, 25 Jun 2019 at 11:01, Jeff Zhang > > >>> <[hidden email] <mailto:[hidden email]> > > >>> >>>> <mailto:[hidden email] <mailto:[hidden email]>> > > >>> >>>> >>>>>>> <mailto:[hidden email] > > >>> <mailto:[hidden email]> <mailto:[hidden email] > > >>> <mailto:[hidden email]>>>> wrote: > > >>> >>>> >>>>>>> > > >>> >>>> >>>>>>> Hi Folks, > > >>> >>>> >>>>>>> > > >>> >>>> >>>>>>> Zeppelin meet this kind of issue before, we > > >>> solve > > >>> >>>> it by > > >>> >>>> >> delegating > > >>> >>>> >>>>>>> each > > >>> >>>> >>>>>>> one's PR build to his travis account > > >>> (Everyone can > > >>> >>>> have 5 free > > >>> >>>> >>>>>>> slot for > > >>> >>>> >>>>>>> travis build). > > >>> >>>> >>>>>>> Apache account travis build is only triggered > > >>> when > > >>> >>>> PR is merged. > > >>> >>>> >>>>>>> > > >>> >>>> >>>>>>> > > >>> >>>> >>>>>>> > > >>> >>>> >>>>>>> Kurt Young <[hidden email] > > >>> <mailto:[hidden email]> > > >>> >>>> <mailto:[hidden email] <mailto:[hidden email]>> > > >>> <mailto:[hidden email] <mailto:[hidden email]> > > >>> >>>> <mailto:[hidden email] <mailto:[hidden email]>>>> > > >>> >>>> >>>>>>> 于2019年6月25日周二 上午10:16写道: > > >>> >>>> >>>>>>> > > >>> >>>> >>>>>>> > (Forgot to cc George) > > >>> >>>> >>>>>>> > > > >>> >>>> >>>>>>> > Best, > > >>> >>>> >>>>>>> > Kurt > > >>> >>>> >>>>>>> > > > >>> >>>> >>>>>>> > > > >>> >>>> >>>>>>> > On Tue, Jun 25, 2019 at 10:16 AM Kurt Young > > >>> >>>> <[hidden email] <mailto:[hidden email]> > > >>> <mailto:[hidden email] <mailto:[hidden email]>> > > >>> >>>> >>>>>>> <mailto:[hidden email] > > >>> <mailto:[hidden email]> <mailto:[hidden email] > > >>> <mailto:[hidden email]>>>> > > >>> >>>> wrote: > > >>> >>>> >>>>>>> > > > >>> >>>> >>>>>>> > > Hi Bowen, > > >>> >>>> >>>>>>> > > > > >>> >>>> >>>>>>> > > Thanks for bringing this up. We > > >>> actually have > > >>> >>>> discussed > > >>> >>>> >> about > > >>> >>>> >>>>>>> this, and I > > >>> >>>> >>>>>>> > > think Till and George have > > >>> >>>> >>>>>>> > > already spend sometime investigating > > >>> it. I have > > >>> >>>> cced both of > > >>> >>>> >>>>>>> them, and > > >>> >>>> >>>>>>> > > maybe they can share > > >>> >>>> >>>>>>> > > their findings. > > >>> >>>> >>>>>>> > > > > >>> >>>> >>>>>>> > > Best, > > >>> >>>> >>>>>>> > > Kurt > > >>> >>>> >>>>>>> > > > > >>> >>>> >>>>>>> > > > > >>> >>>> >>>>>>> > > On Tue, Jun 25, 2019 at 10:08 AM Jark Wu > > >>> >>>> <[hidden email] <mailto:[hidden email]> > > >>> <mailto:[hidden email] <mailto:[hidden email]>> > > >>> >>>> >>>>>>> <mailto:[hidden email] > > >>> <mailto:[hidden email]> <mailto:[hidden email] > > >>> <mailto:[hidden email]>>>> > > >>> >>>> wrote: > > >>> >>>> >>>>>>> > > > > >>> >>>> >>>>>>> > >> Hi Bowen, > > >>> >>>> >>>>>>> > >> > > >>> >>>> >>>>>>> > >> Thanks for bringing this. We also > > >>> suffered from > > >>> >>>> the long > > >>> >>>> >>>>>>> build time. > > >>> >>>> >>>>>>> > >> I agree that we should focus on > > >>> solving build > > >>> >>>> capacity > > >>> >>>> >>>>>>> problem in the > > >>> >>>> >>>>>>> > >> thread. > > >>> >>>> >>>>>>> > >> > > >>> >>>> >>>>>>> > >> My observation is there is only one > > >>> build is > > >>> >>>> running, all > > >>> >>>> >> the > > >>> >>>> >>>>>>> others > > >>> >>>> >>>>>>> > >> (other > > >>> >>>> >>>>>>> > >> PRs, master) are pending. > > >>> >>>> >>>>>>> > >> The pricing plan[1] of travis shows > > >>> it can > > >>> >>>> support > > >>> >>>> >> concurrent > > >>> >>>> >>>>>>> build > > >>> >>>> >>>>>>> > jobs. > > >>> >>>> >>>>>>> > >> But I don't know which plan we are > > >>> using, might > > >>> >>>> be the free > > >>> >>>> >>>>>>> plan for > > >>> >>>> >>>>>>> > open > > >>> >>>> >>>>>>> > >> source. > > >>> >>>> >>>>>>> > >> > > >>> >>>> >>>>>>> > >> I cc-ed Chesnay who may have some > > >>> experience on > > >>> >>>> Travis. > > >>> >>>> >>>>>>> > >> > > >>> >>>> >>>>>>> > >> Regards, > > >>> >>>> >>>>>>> > >> Jark > > >>> >>>> >>>>>>> > >> > > >>> >>>> >>>>>>> > >> [1]: https://travis-ci.com/plans > > >>> >>>> >>>>>>> > >> > > >>> >>>> >>>>>>> > >> On Tue, 25 Jun 2019 at 08:11, Bowen Li < > > >>> >>>> >> [hidden email] <mailto:[hidden email]> > > >>> <mailto:[hidden email] <mailto:[hidden email]>> > > >>> >>>> >>>>>>> <mailto:[hidden email] > > >>> <mailto:[hidden email]> > > >>> >>>> <mailto:[hidden email] > > >>> <mailto:[hidden email]>>>> wrote: > > >>> >>>> >>>>>>> > >> > > >>> >>>> >>>>>>> > >> > Hi Steven, > > >>> >>>> >>>>>>> > >> > > > >>> >>>> >>>>>>> > >> > I think you may not read what I > > >>> wrote. The > > >>> >>>> discussion is > > >>> >>>> >>>> about > > >>> >>>> >>>>>>> > "unstable > > >>> >>>> >>>>>>> > >> > build **capacity**", in another word > > >>> >>>> "unstable / lack of > > >>> >>>> >>>> build > > >>> >>>> >>>>>>> > >> resources", > > >>> >>>> >>>>>>> > >> > not "unstable build". > > >>> >>>> >>>>>>> > >> > > > >>> >>>> >>>>>>> > >> > On Mon, Jun 24, 2019 at 4:40 PM > > >>> Steven Wu > > >>> >>>> >>>>>>> <[hidden email] > > >>> <mailto:[hidden email]> <mailto:[hidden email] > > >>> <mailto:[hidden email]>> > > >>> >>>> <mailto:[hidden email] > > >>> <mailto:[hidden email]> <mailto:[hidden email] > > >>> <mailto:[hidden email]>>>> > > >>> >>>> >>>>>>> > wrote: > > >>> >>>> >>>>>>> > >> > > > >>> >>>> >>>>>>> > >> > > long and sometimes unstable build is > > >>> >>>> definitely a pain > > >>> >>>> >>>>>> point. > > >>> >>>> >>>>>>> > >> > > > > >>> >>>> >>>>>>> > >> > > I suspect the build failure here in > > >>> >>>> >> flink-connector-kafka > > >>> >>>> >>>>>>> is not > > >>> >>>> >>>>>>> > >> related > > >>> >>>> >>>>>>> > >> > to > > >>> >>>> >>>>>>> > >> > > my change. but there is no easy > > >>> re-run the > > >>> >>>> build on > > >>> >>>> >>>>>>> travis UI. > > >>> >>>> >>>>>>> > >> > > search showed a trick of > > >>> close-and-open the > > >>> >>>> PR will > > >>> >>>> >>>>>>> trigger rebuild. > > >>> >>>> >>>>>>> > >> but > > >>> >>>> >>>>>>> > >> > > that could add noises to the PR > > >>> activities. > > >>> >>>> >>>>>>> > >> > > > > >>> >>>> https://travis-ci.org/apache/flink/jobs/545555519 > > >>> >>>> >>>>>>> > >> > > > > >>> >>>> >>>>>>> > >> > > travis-ci for my personal repo > > >>> often failed > > >>> >>>> with > > >>> >>>> >>>>>>> exceeding time > > >>> >>>> >>>>>>> > limit > > >>> >>>> >>>>>>> > >> > after > > >>> >>>> >>>>>>> > >> > > 4+ hours. > > >>> >>>> >>>>>>> > >> > > The job exceeded the maximum time > > >>> limit for > > >>> >>>> jobs, and > > >>> >>>> >> has > > >>> >>>> >>>>>>> been > > >>> >>>> >>>>>>> > >> > terminated. > > >>> >>>> >>>>>>> > >> > > > > >>> >>>> >>>>>>> > >> > > On Mon, Jun 24, 2019 at 4:15 PM > > >>> Bowen Li > > >>> >>>> >>>>>>> <[hidden email] > > >>> <mailto:[hidden email]> <mailto:[hidden email] > > >>> <mailto:[hidden email]>> > > >>> >>>> <mailto:[hidden email] <mailto: > [hidden email]> > > >>> <mailto:[hidden email] <mailto:[hidden email]>>>> > > >>> >>>> >>>>>>> > wrote: > > >>> >>>> >>>>>>> > >> > > > > >>> >>>> >>>>>>> > >> > > > > > >>> >>>> https://travis-ci.org/apache/flink/builds/549681530 > > >>> >>>> >>>>>>> This build > > >>> >>>> >>>>>>> > >> > request > > >>> >>>> >>>>>>> > >> > > > has > > >>> >>>> >>>>>>> > >> > > > been sitting at **HEAD of the > > >>> queue** > > >>> >>>> since I first > > >>> >>>> >> saw > > >>> >>>> >>>>>>> it at PST > > >>> >>>> >>>>>>> > >> > 10:30am > > >>> >>>> >>>>>>> > >> > > > (not sure how long it's been > > >>> there before > > >>> >>>> 10:30am). > > >>> >>>> >>>>>>> It's PST > > >>> >>>> >>>>>>> > 4:12pm > > >>> >>>> >>>>>>> > >> now > > >>> >>>> >>>>>>> > >> > > and > > >>> >>>> >>>>>>> > >> > > > it hasn't started yet. > > >>> >>>> >>>>>>> > >> > > > > > >>> >>>> >>>>>>> > >> > > > On Mon, Jun 24, 2019 at 2:48 PM > > >>> Bowen Li > > >>> >>>> >>>>>>> <[hidden email] > > >>> <mailto:[hidden email]> <mailto:[hidden email] > > >>> <mailto:[hidden email]>> > > >>> >>>> <mailto:[hidden email] <mailto: > [hidden email]> > > >>> <mailto:[hidden email] <mailto:[hidden email]>>>> > > >>> >>>> >>>>>>> > >> wrote: > > >>> >>>> >>>>>>> > >> > > > > > >>> >>>> >>>>>>> > >> > > > > Hi devs, > > >>> >>>> >>>>>>> > >> > > > > > > >>> >>>> >>>>>>> > >> > > > > I've been experiencing the pain > > >>> >>>> resulting from lack > > >>> >>>> >>>>>>> of stable > > >>> >>>> >>>>>>> > >> build > > >>> >>>> >>>>>>> > >> > > > > capacity on Travis for Flink > > >>> PRs [1]. > > >>> >>>> >> Specifically, I > > >>> >>>> >>>>>>> noticed > > >>> >>>> >>>>>>> > >> often > > >>> >>>> >>>>>>> > >> > > that > > >>> >>>> >>>>>>> > >> > > > no > > >>> >>>> >>>>>>> > >> > > > > build in the queue is making any > > >>> >>>> progress for > > >>> >>>> >> hours, > > >>> >>>> >>>> and > > >>> >>>> >>>>>>> > suddenly > > >>> >>>> >>>>>>> > >> 5 > > >>> >>>> >>>>>>> > >> > or > > >>> >>>> >>>>>>> > >> > > 6 > > >>> >>>> >>>>>>> > >> > > > > builds kick off all together > > >>> after the > > >>> >>>> long pause. > > >>> >>>> >>>>>>> I'm at PST > > >>> >>>> >>>>>>> > >> > (UTC-08) > > >>> >>>> >>>>>>> > >> > > > time > > >>> >>>> >>>>>>> > >> > > > > zone, and I've seen pause can > > >>> be as > > >>> >>>> long as 6 hours > > >>> >>>> >>>>>>> from PST 9am > > >>> >>>> >>>>>>> > >> to > > >>> >>>> >>>>>>> > >> > 3pm > > >>> >>>> >>>>>>> > >> > > > > (let alone the time needed to > > >>> drain the > > >>> >>>> queue > > >>> >>>> >>>>>>> afterwards). > > >>> >>>> >>>>>>> > >> > > > > > > >>> >>>> >>>>>>> > >> > > > > I think this has greatly > > >>> impacted our > > >>> >>>> productivity. > > >>> >>>> >>>> I've > > >>> >>>> >>>>>>> > >> experienced > > >>> >>>> >>>>>>> > >> > > that > > >>> >>>> >>>>>>> > >> > > > > PRs submitted in the early > > >>> morning of > > >>> >>>> PST time zone > > >>> >>>> >>>>>>> won't finish > > >>> >>>> >>>>>>> > >> > their > > >>> >>>> >>>>>>> > >> > > > > build until late night of the > > >>> same day. > > >>> >>>> >>>>>>> > >> > > > > > > >>> >>>> >>>>>>> > >> > > > > So my questions are: > > >>> >>>> >>>>>>> > >> > > > > > > >>> >>>> >>>>>>> > >> > > > > - Has anyone else experienced > > >>> the same > > >>> >>>> problem or > > >>> >>>> >>>>>>> have similar > > >>> >>>> >>>>>>> > >> > > > observation > > >>> >>>> >>>>>>> > >> > > > > on TravisCI? (I suspect it > > >>> has things > > >>> >>>> to do with > > >>> >>>> >> time > > >>> >>>> >>>>>>> zone) > > >>> >>>> >>>>>>> > >> > > > > > > >>> >>>> >>>>>>> > >> > > > > - What pricing plan of > > >>> TravisCI is > > >>> >>>> Flink currently > > >>> >>>> >>>>>>> using? Is it > > >>> >>>> >>>>>>> > >> the > > >>> >>>> >>>>>>> > >> > > free > > >>> >>>> >>>>>>> > >> > > > > plan for open source > > >>> projects? What > > >>> >>>> are the > > >>> >>>> >>>>>>> guaranteed build > > >>> >>>> >>>>>>> > >> capacity > > >>> >>>> >>>>>>> > >> > > of > > >>> >>>> >>>>>>> > >> > > > > the current plan? > > >>> >>>> >>>>>>> > >> > > > > > > >>> >>>> >>>>>>> > >> > > > > - If the current pricing plan > > >>> (either > > >>> >>>> free or paid) > > >>> >>>> >>>>>> can't > > >>> >>>> >>>>>>> > provide > > >>> >>>> >>>>>>> > >> > > stable > > >>> >>>> >>>>>>> > >> > > > > build capacity, can we > > >>> upgrade to a > > >>> >>>> higher priced > > >>> >>>> >>>>>>> plan with > > >>> >>>> >>>>>>> > larger > > >>> >>>> >>>>>>> > >> > and > > >>> >>>> >>>>>>> > >> > > > more > > >>> >>>> >>>>>>> > >> > > > > stable build capacity? > > >>> >>>> >>>>>>> > >> > > > > > > >>> >>>> >>>>>>> > >> > > > > BTW, another factor that > > >>> contribute to > > >>> >>>> the > > >>> >>>> >>>>>>> productivity problem > > >>> >>>> >>>>>>> > is > > >>> >>>> >>>>>>> > >> > that > > >>> >>>> >>>>>>> > >> > > > > our build is slow - we run > > >>> full build > > >>> >>>> for every PR > > >>> >>>> >>>> and a > > >>> >>>> >>>>>>> > >> successful > > >>> >>>> >>>>>>> > >> > > full > > >>> >>>> >>>>>>> > >> > > > > build takes ~5h. We > > >>> definitely have > > >>> >>>> more options to > > >>> >>>> >>>>>>> solve it, > > >>> >>>> >>>>>>> > for > > >>> >>>> >>>>>>> > >> > > > instance, > > >>> >>>> >>>>>>> > >> > > > > modularize the build graphs > > >>> and reuse > > >>> >>>> artifacts > > >>> >>>> >> from > > >>> >>>> >>>> the > > >>> >>>> >>>>>>> > previous > > >>> >>>> >>>>>>> > >> > > build. > > >>> >>>> >>>>>>> > >> > > > > But I think that can be a big > > >>> effort > > >>> >>>> which is much > > >>> >>>> >>>>>>> harder to > > >>> >>>> >>>>>>> > >> > accomplish > > >>> >>>> >>>>>>> > >> > > > in > > >>> >>>> >>>>>>> > >> > > > > a short period of time and > > >>> may deserve > > >>> >>>> its own > > >>> >>>> >>>> separate > > >>> >>>> >>>>>>> > >> discussion. > > >>> >>>> >>>>>>> > >> > > > > > > >>> >>>> >>>>>>> > >> > > > > [1] > > >>> >>>> >> https://travis-ci.org/apache/flink/pull_requests > > >>> >>>> >>>>>>> > >> > > > > > > >>> >>>> >>>>>>> > >> > > > > > > >>> >>>> >>>>>>> > >> > > > > > >>> >>>> >>>>>>> > >> > > > > >>> >>>> >>>>>>> > >> > > > >>> >>>> >>>>>>> > >> > > >>> >>>> >>>>>>> > > > > >>> >>>> >>>>>>> > > > >>> >>>> >>>>>>> > > >>> >>>> >>>>>>> > > >>> >>>> >>>>>>> -- > > >>> >>>> >>>>>>> Best Regards > > >>> >>>> >>>>>>> > > >>> >>>> >>>>>>> Jeff Zhang > > >>> >>>> >>>>>>> > > >>> >>>> >> > > >>> >>>> > > >>> >>> > > >>> >> > > >>> > > >> > > >> > > > > > > > > |
Free forum by Nabble | Edit this page |