Hi all!
As part of the Flink on ARM effort, there is a pull request that triggers a build on OpenLabs CI for each push and runs tests on ARM machines. Currently that build is roughly equivalent to what the "core" and "tests" profiles do on Travis. The result will be posted to the PR comments, similar to the Flink Bot's Travis build result. The build currently passes :-) so Flink seems to be okay on ARM. My suggestion would be to try and add this and gather some experience with it. The Travis build results should be our "ground truth" and the ARM CI (openlabs CI) would be "informational only" at the beginning, but helping us understand when we break ARM support. You can see this in the PR that adds the openlabs CI config: https://github.com/apache/flink/pull/9416 Any objections? Best, Stephan |
I'm wondering what we are supposed to do if the build fails?
We aren't providing and guides on setting up an arm dev environment; so reproducing it locally isn't possible. On 23/08/2019 17:55, Stephan Ewen wrote: > Hi all! > > As part of the Flink on ARM effort, there is a pull request that triggers a > build on OpenLabs CI for each push and runs tests on ARM machines. > > Currently that build is roughly equivalent to what the "core" and "tests" > profiles do on Travis. > The result will be posted to the PR comments, similar to the Flink Bot's > Travis build result. > The build currently passes :-) so Flink seems to be okay on ARM. > > My suggestion would be to try and add this and gather some experience with > it. > The Travis build results should be our "ground truth" and the ARM CI > (openlabs CI) would be "informational only" at the beginning, but helping > us understand when we break ARM support. > > You can see this in the PR that adds the openlabs CI config: > https://github.com/apache/flink/pull/9416 > > Any objections? > > Best, > Stephan > |
Thanks for Stephan to bring up this topic.
The package build jobs work well now. I have a simple online demo which is built and ran on a ARM VM. Feel free to have a try[1]. As the first step for ARM support, maybe it's good to add them now. While for the next step, the test part is still broken. It relates to some points we find: 1. Some unit tests are failed[1] by Java coding. These kind of failure can be fixed easily. 2. Some tests are failed by depending on third part libaraies[2]. It includes frocksdb, MapR Client and Netty. They don't have ARM release. a. Frocksdb: I'm testing it locally now by `make check_some` and `make jtest` similar with its travis job. There are 3 tests failed by `make check_some`. Please see the ticket for more details. Once the test pass, frocksdb can release ARM package then. b. MapR Client. This belongs to MapR company. At this moment, maybe we should skip MapR support for Flink ARM. c. Netty. Actually Netty runs well on our ARM machine. We will ask Netty community to release ARM support. If they do not want, OpenLab will handle a Maven Repository for some common libraries on ARM. For Chesnay's concern: Firstly, OpenLab team will keep maintaining and fixing ARM CI. It means that once build or test fails, we'll fix it at once. Secondly, OpenLab can provide ARM VMs to everyone for reproducing and testing. You just need to creat a Test Request issue in openlab[1]. Then we'll create ARM VMs for you, you can login and do the thing you want. Does it make sense? [1]: http://114.115.168.52:8081/#/overview [1]: https://issues.apache.org/jira/browse/FLINK-13449 https://issues.apache.org/jira/browse/FLINK-13450 [2]: https://issues.apache.org/jira/browse/FLINK-13598 [3]: https://github.com/theopenlab/openlab/issues/new/choose Chesnay Schepler <[hidden email]> 于2019年8月24日周六 上午12:10写道: > I'm wondering what we are supposed to do if the build fails? > We aren't providing and guides on setting up an arm dev environment; so > reproducing it locally isn't possible. > > On 23/08/2019 17:55, Stephan Ewen wrote: > > Hi all! > > > > As part of the Flink on ARM effort, there is a pull request that > triggers a > > build on OpenLabs CI for each push and runs tests on ARM machines. > > > > Currently that build is roughly equivalent to what the "core" and "tests" > > profiles do on Travis. > > The result will be posted to the PR comments, similar to the Flink Bot's > > Travis build result. > > The build currently passes :-) so Flink seems to be okay on ARM. > > > > My suggestion would be to try and add this and gather some experience > with > > it. > > The Travis build results should be our "ground truth" and the ARM CI > > (openlabs CI) would be "informational only" at the beginning, but helping > > us understand when we break ARM support. > > > > You can see this in the PR that adds the openlabs CI config: > > https://github.com/apache/flink/pull/9416 > > > > Any objections? > > > > Best, > > Stephan > > > > |
I'm sorry, but if these issues are only fixed later anyway I see no
reason to run these tests on each PR. We're just adding noise to each PR that everyone will just ignore. I'm curious as to the benefit of having this directly in Flink; why aren't the ARM builds run outside of the Flink project, and fixes for it provided? It seems to me like nothing about these arm builds is actually handled by the Flink project. On 26/08/2019 03:43, Xiyuan Wang wrote: > Thanks for Stephan to bring up this topic. > > The package build jobs work well now. I have a simple online demo which is > built and ran on a ARM VM. Feel free to have a try[1]. > > As the first step for ARM support, maybe it's good to add them now. > > While for the next step, the test part is still broken. It relates to some > points we find: > > 1. Some unit tests are failed[1] by Java coding. These kind of failure can > be fixed easily. > 2. Some tests are failed by depending on third part libaraies[2]. It > includes frocksdb, MapR Client and Netty. They don't have ARM release. > a. Frocksdb: I'm testing it locally now by `make check_some` and `make > jtest` similar with its travis job. There are 3 tests failed by `make > check_some`. Please see the ticket for more details. Once the test pass, > frocksdb can release ARM package then. > b. MapR Client. This belongs to MapR company. At this moment, maybe we > should skip MapR support for Flink ARM. > c. Netty. Actually Netty runs well on our ARM machine. We will ask > Netty community to release ARM support. If they do not want, OpenLab will > handle a Maven Repository for some common libraries on ARM. > > > For Chesnay's concern: > > Firstly, OpenLab team will keep maintaining and fixing ARM CI. It means > that once build or test fails, we'll fix it at once. > Secondly, OpenLab can provide ARM VMs to everyone for reproducing and > testing. You just need to creat a Test Request issue in openlab[1]. Then > we'll create ARM VMs for you, you can login and do the thing you want. > > Does it make sense? > > [1]: http://114.115.168.52:8081/#/overview > [1]: https://issues.apache.org/jira/browse/FLINK-13449 > https://issues.apache.org/jira/browse/FLINK-13450 > [2]: https://issues.apache.org/jira/browse/FLINK-13598 > [3]: https://github.com/theopenlab/openlab/issues/new/choose > > > > > Chesnay Schepler <[hidden email]> 于2019年8月24日周六 上午12:10写道: > >> I'm wondering what we are supposed to do if the build fails? >> We aren't providing and guides on setting up an arm dev environment; so >> reproducing it locally isn't possible. >> >> On 23/08/2019 17:55, Stephan Ewen wrote: >>> Hi all! >>> >>> As part of the Flink on ARM effort, there is a pull request that >> triggers a >>> build on OpenLabs CI for each push and runs tests on ARM machines. >>> >>> Currently that build is roughly equivalent to what the "core" and "tests" >>> profiles do on Travis. >>> The result will be posted to the PR comments, similar to the Flink Bot's >>> Travis build result. >>> The build currently passes :-) so Flink seems to be okay on ARM. >>> >>> My suggestion would be to try and add this and gather some experience >> with >>> it. >>> The Travis build results should be our "ground truth" and the ARM CI >>> (openlabs CI) would be "informational only" at the beginning, but helping >>> us understand when we break ARM support. >>> >>> You can see this in the PR that adds the openlabs CI config: >>> https://github.com/apache/flink/pull/9416 >>> >>> Any objections? >>> >>> Best, >>> Stephan >>> >> |
Sorry, maybe my words is misleading.
We are just starting adding ARM support. So the CI is non-voting at this moment to avoid blocking normal Flink development. But once the ARM CI works well and stable enough. We should mark it as voting. It means that in the future, if the ARM test is failed in a PR, the PR can not be merged. The test log may tell develpers what error is comming. If the develper need debug the detail on an ARM vm, OpenLab can provider it. Adding ARM CI can make sure Flink support ARM originally I left a workflow in the PR, I'd like to print it here: 1. Add the basic build script to ensure the CI system and build job works as expect. The job should be marked as non-voting first, it means the CI test failure won't block Flink PR to be merged. 2. Add the test script to run unit/intergration test. At this step the --fn parameter will be added to mvn test. It will run the full test cases in Flink, so that we can find what test is failed on ARM. 3. Fix the test failure one by one. 4. Once all the tests are passed, remove the --fn parameter and keep watch the CI's status for some days. If some bugs raise then, fix them as what we usually do for travis-ci. 5. Once the CI is stable enought, remove the non-voting tag, so that the ARM CI will be the same as travis-ci, to be one of the gate for Flink PR. 6. Finally, Flink community can announce and release Flink ARM version. Chesnay Schepler <[hidden email]> 于2019年8月26日周一 下午2:25写道: > I'm sorry, but if these issues are only fixed later anyway I see no > reason to run these tests on each PR. We're just adding noise to each PR > that everyone will just ignore. > > I'm curious as to the benefit of having this directly in Flink; why > aren't the ARM builds run outside of the Flink project, and fixes for it > provided? > > It seems to me like nothing about these arm builds is actually handled > by the Flink project. > > On 26/08/2019 03:43, Xiyuan Wang wrote: > > Thanks for Stephan to bring up this topic. > > > > The package build jobs work well now. I have a simple online demo which > is > > built and ran on a ARM VM. Feel free to have a try[1]. > > > > As the first step for ARM support, maybe it's good to add them now. > > > > While for the next step, the test part is still broken. It relates to > some > > points we find: > > > > 1. Some unit tests are failed[1] by Java coding. These kind of failure > can > > be fixed easily. > > 2. Some tests are failed by depending on third part libaraies[2]. It > > includes frocksdb, MapR Client and Netty. They don't have ARM release. > > a. Frocksdb: I'm testing it locally now by `make check_some` and > `make > > jtest` similar with its travis job. There are 3 tests failed by `make > > check_some`. Please see the ticket for more details. Once the test pass, > > frocksdb can release ARM package then. > > b. MapR Client. This belongs to MapR company. At this moment, maybe > we > > should skip MapR support for Flink ARM. > > c. Netty. Actually Netty runs well on our ARM machine. We will ask > > Netty community to release ARM support. If they do not want, OpenLab will > > handle a Maven Repository for some common libraries on ARM. > > > > > > For Chesnay's concern: > > > > Firstly, OpenLab team will keep maintaining and fixing ARM CI. It means > > that once build or test fails, we'll fix it at once. > > Secondly, OpenLab can provide ARM VMs to everyone for reproducing and > > testing. You just need to creat a Test Request issue in openlab[1]. Then > > we'll create ARM VMs for you, you can login and do the thing you want. > > > > Does it make sense? > > > > [1]: http://114.115.168.52:8081/#/overview > > [1]: https://issues.apache.org/jira/browse/FLINK-13449 > > https://issues.apache.org/jira/browse/FLINK-13450 > > [2]: https://issues.apache.org/jira/browse/FLINK-13598 > > [3]: https://github.com/theopenlab/openlab/issues/new/choose > > > > > > > > > > Chesnay Schepler <[hidden email]> 于2019年8月24日周六 上午12:10写道: > > > >> I'm wondering what we are supposed to do if the build fails? > >> We aren't providing and guides on setting up an arm dev environment; so > >> reproducing it locally isn't possible. > >> > >> On 23/08/2019 17:55, Stephan Ewen wrote: > >>> Hi all! > >>> > >>> As part of the Flink on ARM effort, there is a pull request that > >> triggers a > >>> build on OpenLabs CI for each push and runs tests on ARM machines. > >>> > >>> Currently that build is roughly equivalent to what the "core" and > "tests" > >>> profiles do on Travis. > >>> The result will be posted to the PR comments, similar to the Flink > Bot's > >>> Travis build result. > >>> The build currently passes :-) so Flink seems to be okay on ARM. > >>> > >>> My suggestion would be to try and add this and gather some experience > >> with > >>> it. > >>> The Travis build results should be our "ground truth" and the ARM CI > >>> (openlabs CI) would be "informational only" at the beginning, but > helping > >>> us understand when we break ARM support. > >>> > >>> You can see this in the PR that adds the openlabs CI config: > >>> https://github.com/apache/flink/pull/9416 > >>> > >>> Any objections? > >>> > >>> Best, > >>> Stephan > >>> > >> > > |
Adding CI builds for ARM makes only sense when we actually take them into
account as "blocking a merge", otherwise there is no point in having them. So we would need to be prepared to do that. The cases where something runs in UNIX/x64 but fails on ARM are few cases and so far seem to have been related to libraries or some magic that tries to do system dependent actions outside Java. One worthwhile discussion could be whether to run the ARM CI builds as part of the nightly tests, not on every commit. There are a lot of nightly tests, for example for different Java / Scala / Hadoop versions. On Mon, Aug 26, 2019 at 10:46 AM Xiyuan Wang <[hidden email]> wrote: > Sorry, maybe my words is misleading. > > We are just starting adding ARM support. So the CI is non-voting at this > moment to avoid blocking normal Flink development. > > But once the ARM CI works well and stable enough. We should mark it as > voting. It means that in the future, if the ARM test is failed in a PR, the > PR can not be merged. The test log may tell develpers what error is > comming. If the develper need debug the detail on an ARM vm, OpenLab can > provider it. > > Adding ARM CI can make sure Flink support ARM originally > > I left a workflow in the PR, I'd like to print it here: > > 1. Add the basic build script to ensure the CI system and build job > works as expect. The job should be marked as non-voting first, it means the > CI test failure won't block Flink PR to be merged. > 2. Add the test script to run unit/intergration test. At this step the > --fn parameter will be added to mvn test. It will run the full test cases > in Flink, so that we can find what test is failed on ARM. > 3. Fix the test failure one by one. > 4. Once all the tests are passed, remove the --fn parameter and keep > watch the CI's status for some days. If some bugs raise then, fix them as > what we usually do for travis-ci. > 5. Once the CI is stable enought, remove the non-voting tag, so that > the ARM CI will be the same as travis-ci, to be one of the gate for Flink > PR. > 6. Finally, Flink community can announce and release Flink ARM version. > > > Chesnay Schepler <[hidden email]> 于2019年8月26日周一 下午2:25写道: > >> I'm sorry, but if these issues are only fixed later anyway I see no >> reason to run these tests on each PR. We're just adding noise to each PR >> that everyone will just ignore. >> >> I'm curious as to the benefit of having this directly in Flink; why >> aren't the ARM builds run outside of the Flink project, and fixes for it >> provided? >> >> It seems to me like nothing about these arm builds is actually handled >> by the Flink project. >> >> On 26/08/2019 03:43, Xiyuan Wang wrote: >> > Thanks for Stephan to bring up this topic. >> > >> > The package build jobs work well now. I have a simple online demo which >> is >> > built and ran on a ARM VM. Feel free to have a try[1]. >> > >> > As the first step for ARM support, maybe it's good to add them now. >> > >> > While for the next step, the test part is still broken. It relates to >> some >> > points we find: >> > >> > 1. Some unit tests are failed[1] by Java coding. These kind of failure >> can >> > be fixed easily. >> > 2. Some tests are failed by depending on third part libaraies[2]. It >> > includes frocksdb, MapR Client and Netty. They don't have ARM release. >> > a. Frocksdb: I'm testing it locally now by `make check_some` and >> `make >> > jtest` similar with its travis job. There are 3 tests failed by `make >> > check_some`. Please see the ticket for more details. Once the test pass, >> > frocksdb can release ARM package then. >> > b. MapR Client. This belongs to MapR company. At this moment, >> maybe we >> > should skip MapR support for Flink ARM. >> > c. Netty. Actually Netty runs well on our ARM machine. We will ask >> > Netty community to release ARM support. If they do not want, OpenLab >> will >> > handle a Maven Repository for some common libraries on ARM. >> > >> > >> > For Chesnay's concern: >> > >> > Firstly, OpenLab team will keep maintaining and fixing ARM CI. It means >> > that once build or test fails, we'll fix it at once. >> > Secondly, OpenLab can provide ARM VMs to everyone for reproducing and >> > testing. You just need to creat a Test Request issue in openlab[1]. >> Then >> > we'll create ARM VMs for you, you can login and do the thing you want. >> > >> > Does it make sense? >> > >> > [1]: http://114.115.168.52:8081/#/overview >> > [1]: https://issues.apache.org/jira/browse/FLINK-13449 >> > https://issues.apache.org/jira/browse/FLINK-13450 >> > [2]: https://issues.apache.org/jira/browse/FLINK-13598 >> > [3]: https://github.com/theopenlab/openlab/issues/new/choose >> > >> > >> > >> > >> > Chesnay Schepler <[hidden email]> 于2019年8月24日周六 上午12:10写道: >> > >> >> I'm wondering what we are supposed to do if the build fails? >> >> We aren't providing and guides on setting up an arm dev environment; so >> >> reproducing it locally isn't possible. >> >> >> >> On 23/08/2019 17:55, Stephan Ewen wrote: >> >>> Hi all! >> >>> >> >>> As part of the Flink on ARM effort, there is a pull request that >> >> triggers a >> >>> build on OpenLabs CI for each push and runs tests on ARM machines. >> >>> >> >>> Currently that build is roughly equivalent to what the "core" and >> "tests" >> >>> profiles do on Travis. >> >>> The result will be posted to the PR comments, similar to the Flink >> Bot's >> >>> Travis build result. >> >>> The build currently passes :-) so Flink seems to be okay on ARM. >> >>> >> >>> My suggestion would be to try and add this and gather some experience >> >> with >> >>> it. >> >>> The Travis build results should be our "ground truth" and the ARM CI >> >>> (openlabs CI) would be "informational only" at the beginning, but >> helping >> >>> us understand when we break ARM support. >> >>> >> >>> You can see this in the PR that adds the openlabs CI config: >> >>> https://github.com/apache/flink/pull/9416 >> >>> >> >>> Any objections? >> >>> >> >>> Best, >> >>> Stephan >> >>> >> >> >> >> |
Before ARM CI is ready, I can close the CI test for each PR and let it only
be triggered by PR comment. It's quite easy for OpenLab to do this. OpenLab have many job piplines[1]. Now I use `check` pipline in https://github.com/apache/flink/pull/9416. The job trigger contains github_action and github_comment[2]. I can create a new pipline for Flink, the new trigger can only contain github_coment like: trigger: github: - event: pull_request action: comment comment: (?i)^\s*recheck_arm_build\s*$ So that the ARM job will not be ran for every PR. It'll be just ran for the PR which have `recheck_arm_build` comment. Then once ARM CI is ready, I can add it back. nightly tests can be added as well of couse. There is a kind of job in OpenLab called `periodic job`. We can use it for Flink daily nightly tests. If any error occur, the report can be sent to [hidden email] as well. [1]: https://github.com/theopenlab/openlab-zuul-jobs/blob/master/zuul.d/pipelines.yaml [2]: https://github.com/theopenlab/openlab-zuul-jobs/blob/master/zuul.d/pipelines.yaml#L10-L19 Stephan Ewen <[hidden email]> 于2019年8月26日周一 下午6:13写道: > Adding CI builds for ARM makes only sense when we actually take them into > account as "blocking a merge", otherwise there is no point in having them. > So we would need to be prepared to do that. > > The cases where something runs in UNIX/x64 but fails on ARM are few cases > and so far seem to have been related to libraries or some magic that tries > to do system dependent actions outside Java. > > One worthwhile discussion could be whether to run the ARM CI builds as part > of the nightly tests, not on every commit. > There are a lot of nightly tests, for example for different Java / Scala / > Hadoop versions. > > On Mon, Aug 26, 2019 at 10:46 AM Xiyuan Wang <[hidden email]> > wrote: > > > Sorry, maybe my words is misleading. > > > > We are just starting adding ARM support. So the CI is non-voting at this > > moment to avoid blocking normal Flink development. > > > > But once the ARM CI works well and stable enough. We should mark it as > > voting. It means that in the future, if the ARM test is failed in a PR, > the > > PR can not be merged. The test log may tell develpers what error is > > comming. If the develper need debug the detail on an ARM vm, OpenLab can > > provider it. > > > > Adding ARM CI can make sure Flink support ARM originally > > > > I left a workflow in the PR, I'd like to print it here: > > > > 1. Add the basic build script to ensure the CI system and build job > > works as expect. The job should be marked as non-voting first, it > means the > > CI test failure won't block Flink PR to be merged. > > 2. Add the test script to run unit/intergration test. At this step the > > --fn parameter will be added to mvn test. It will run the full test > cases > > in Flink, so that we can find what test is failed on ARM. > > 3. Fix the test failure one by one. > > 4. Once all the tests are passed, remove the --fn parameter and keep > > watch the CI's status for some days. If some bugs raise then, fix > them as > > what we usually do for travis-ci. > > 5. Once the CI is stable enought, remove the non-voting tag, so that > > the ARM CI will be the same as travis-ci, to be one of the gate for > Flink > > PR. > > 6. Finally, Flink community can announce and release Flink ARM > version. > > > > > > Chesnay Schepler <[hidden email]> 于2019年8月26日周一 下午2:25写道: > > > >> I'm sorry, but if these issues are only fixed later anyway I see no > >> reason to run these tests on each PR. We're just adding noise to each PR > >> that everyone will just ignore. > >> > >> I'm curious as to the benefit of having this directly in Flink; why > >> aren't the ARM builds run outside of the Flink project, and fixes for it > >> provided? > >> > >> It seems to me like nothing about these arm builds is actually handled > >> by the Flink project. > >> > >> On 26/08/2019 03:43, Xiyuan Wang wrote: > >> > Thanks for Stephan to bring up this topic. > >> > > >> > The package build jobs work well now. I have a simple online demo > which > >> is > >> > built and ran on a ARM VM. Feel free to have a try[1]. > >> > > >> > As the first step for ARM support, maybe it's good to add them now. > >> > > >> > While for the next step, the test part is still broken. It relates to > >> some > >> > points we find: > >> > > >> > 1. Some unit tests are failed[1] by Java coding. These kind of failure > >> can > >> > be fixed easily. > >> > 2. Some tests are failed by depending on third part libaraies[2]. It > >> > includes frocksdb, MapR Client and Netty. They don't have ARM release. > >> > a. Frocksdb: I'm testing it locally now by `make check_some` and > >> `make > >> > jtest` similar with its travis job. There are 3 tests failed by `make > >> > check_some`. Please see the ticket for more details. Once the test > pass, > >> > frocksdb can release ARM package then. > >> > b. MapR Client. This belongs to MapR company. At this moment, > >> maybe we > >> > should skip MapR support for Flink ARM. > >> > c. Netty. Actually Netty runs well on our ARM machine. We will > ask > >> > Netty community to release ARM support. If they do not want, OpenLab > >> will > >> > handle a Maven Repository for some common libraries on ARM. > >> > > >> > > >> > For Chesnay's concern: > >> > > >> > Firstly, OpenLab team will keep maintaining and fixing ARM CI. It > means > >> > that once build or test fails, we'll fix it at once. > >> > Secondly, OpenLab can provide ARM VMs to everyone for reproducing and > >> > testing. You just need to creat a Test Request issue in openlab[1]. > >> Then > >> > we'll create ARM VMs for you, you can login and do the thing you > want. > >> > > >> > Does it make sense? > >> > > >> > [1]: http://114.115.168.52:8081/#/overview > >> > [1]: https://issues.apache.org/jira/browse/FLINK-13449 > >> > https://issues.apache.org/jira/browse/FLINK-13450 > >> > [2]: https://issues.apache.org/jira/browse/FLINK-13598 > >> > [3]: https://github.com/theopenlab/openlab/issues/new/choose > >> > > >> > > >> > > >> > > >> > Chesnay Schepler <[hidden email]> 于2019年8月24日周六 上午12:10写道: > >> > > >> >> I'm wondering what we are supposed to do if the build fails? > >> >> We aren't providing and guides on setting up an arm dev environment; > so > >> >> reproducing it locally isn't possible. > >> >> > >> >> On 23/08/2019 17:55, Stephan Ewen wrote: > >> >>> Hi all! > >> >>> > >> >>> As part of the Flink on ARM effort, there is a pull request that > >> >> triggers a > >> >>> build on OpenLabs CI for each push and runs tests on ARM machines. > >> >>> > >> >>> Currently that build is roughly equivalent to what the "core" and > >> "tests" > >> >>> profiles do on Travis. > >> >>> The result will be posted to the PR comments, similar to the Flink > >> Bot's > >> >>> Travis build result. > >> >>> The build currently passes :-) so Flink seems to be okay on ARM. > >> >>> > >> >>> My suggestion would be to try and add this and gather some > experience > >> >> with > >> >>> it. > >> >>> The Travis build results should be our "ground truth" and the ARM CI > >> >>> (openlabs CI) would be "informational only" at the beginning, but > >> helping > >> >>> us understand when we break ARM support. > >> >>> > >> >>> You can see this in the PR that adds the openlabs CI config: > >> >>> https://github.com/apache/flink/pull/9416 > >> >>> > >> >>> Any objections? > >> >>> > >> >>> Best, > >> >>> Stephan > >> >>> > >> >> > >> > >> > |
The ARM CI trigger has been changed to `github comment` way only. It means
that every PR won't start ARM test unless a comment `check_arm` is added. Like what I did in the PR[1]. A POC for Flink nightly end to end test job is created as well[2]. I'll improve it then. Any feedback or question? [1]: https://github.com/apache/flink/pull/9416 https://github.com/apache/flink/pull/9416#issuecomment-527268203 [2]: https://github.com/theopenlab/openlab-zuul-jobs/pull/631 Thanks Xiyuan Wang <[hidden email]> 于2019年8月26日周一 下午7:41写道: > Before ARM CI is ready, I can close the CI test for each PR and let it > only be triggered by PR comment. It's quite easy for OpenLab to do this. > > OpenLab have many job piplines[1]. Now I use `check` pipline in > https://github.com/apache/flink/pull/9416. The job trigger contains > github_action and github_comment[2]. I can create a new pipline for Flink, > the new trigger can only contain github_coment like: > > trigger: > github: > - event: pull_request > action: comment > comment: (?i)^\s*recheck_arm_build\s*$ > > So that the ARM job will not be ran for every PR. It'll be just ran for > the PR which have `recheck_arm_build` comment. > > Then once ARM CI is ready, I can add it back. > > > nightly tests can be added as well of couse. There is a kind of job in > OpenLab called `periodic job`. We can use it for Flink daily nightly tests. > If any error occur, the report can be sent to [hidden email] as > well. > > [1]: > https://github.com/theopenlab/openlab-zuul-jobs/blob/master/zuul.d/pipelines.yaml > [2]: > https://github.com/theopenlab/openlab-zuul-jobs/blob/master/zuul.d/pipelines.yaml#L10-L19 > > Stephan Ewen <[hidden email]> 于2019年8月26日周一 下午6:13写道: > >> Adding CI builds for ARM makes only sense when we actually take them into >> account as "blocking a merge", otherwise there is no point in having them. >> So we would need to be prepared to do that. >> >> The cases where something runs in UNIX/x64 but fails on ARM are few cases >> and so far seem to have been related to libraries or some magic that tries >> to do system dependent actions outside Java. >> >> One worthwhile discussion could be whether to run the ARM CI builds as >> part >> of the nightly tests, not on every commit. >> There are a lot of nightly tests, for example for different Java / Scala / >> Hadoop versions. >> >> On Mon, Aug 26, 2019 at 10:46 AM Xiyuan Wang <[hidden email]> >> wrote: >> >> > Sorry, maybe my words is misleading. >> > >> > We are just starting adding ARM support. So the CI is non-voting at this >> > moment to avoid blocking normal Flink development. >> > >> > But once the ARM CI works well and stable enough. We should mark it as >> > voting. It means that in the future, if the ARM test is failed in a PR, >> the >> > PR can not be merged. The test log may tell develpers what error is >> > comming. If the develper need debug the detail on an ARM vm, OpenLab can >> > provider it. >> > >> > Adding ARM CI can make sure Flink support ARM originally >> > >> > I left a workflow in the PR, I'd like to print it here: >> > >> > 1. Add the basic build script to ensure the CI system and build job >> > works as expect. The job should be marked as non-voting first, it >> means the >> > CI test failure won't block Flink PR to be merged. >> > 2. Add the test script to run unit/intergration test. At this step >> the >> > --fn parameter will be added to mvn test. It will run the full test >> cases >> > in Flink, so that we can find what test is failed on ARM. >> > 3. Fix the test failure one by one. >> > 4. Once all the tests are passed, remove the --fn parameter and keep >> > watch the CI's status for some days. If some bugs raise then, fix >> them as >> > what we usually do for travis-ci. >> > 5. Once the CI is stable enought, remove the non-voting tag, so that >> > the ARM CI will be the same as travis-ci, to be one of the gate for >> Flink >> > PR. >> > 6. Finally, Flink community can announce and release Flink ARM >> version. >> > >> > >> > Chesnay Schepler <[hidden email]> 于2019年8月26日周一 下午2:25写道: >> > >> >> I'm sorry, but if these issues are only fixed later anyway I see no >> >> reason to run these tests on each PR. We're just adding noise to each >> PR >> >> that everyone will just ignore. >> >> >> >> I'm curious as to the benefit of having this directly in Flink; why >> >> aren't the ARM builds run outside of the Flink project, and fixes for >> it >> >> provided? >> >> >> >> It seems to me like nothing about these arm builds is actually handled >> >> by the Flink project. >> >> >> >> On 26/08/2019 03:43, Xiyuan Wang wrote: >> >> > Thanks for Stephan to bring up this topic. >> >> > >> >> > The package build jobs work well now. I have a simple online demo >> which >> >> is >> >> > built and ran on a ARM VM. Feel free to have a try[1]. >> >> > >> >> > As the first step for ARM support, maybe it's good to add them now. >> >> > >> >> > While for the next step, the test part is still broken. It relates to >> >> some >> >> > points we find: >> >> > >> >> > 1. Some unit tests are failed[1] by Java coding. These kind of >> failure >> >> can >> >> > be fixed easily. >> >> > 2. Some tests are failed by depending on third part libaraies[2]. It >> >> > includes frocksdb, MapR Client and Netty. They don't have ARM >> release. >> >> > a. Frocksdb: I'm testing it locally now by `make check_some` and >> >> `make >> >> > jtest` similar with its travis job. There are 3 tests failed by `make >> >> > check_some`. Please see the ticket for more details. Once the test >> pass, >> >> > frocksdb can release ARM package then. >> >> > b. MapR Client. This belongs to MapR company. At this moment, >> >> maybe we >> >> > should skip MapR support for Flink ARM. >> >> > c. Netty. Actually Netty runs well on our ARM machine. We will >> ask >> >> > Netty community to release ARM support. If they do not want, OpenLab >> >> will >> >> > handle a Maven Repository for some common libraries on ARM. >> >> > >> >> > >> >> > For Chesnay's concern: >> >> > >> >> > Firstly, OpenLab team will keep maintaining and fixing ARM CI. It >> means >> >> > that once build or test fails, we'll fix it at once. >> >> > Secondly, OpenLab can provide ARM VMs to everyone for reproducing >> and >> >> > testing. You just need to creat a Test Request issue in openlab[1]. >> >> Then >> >> > we'll create ARM VMs for you, you can login and do the thing you >> want. >> >> > >> >> > Does it make sense? >> >> > >> >> > [1]: http://114.115.168.52:8081/#/overview >> >> > [1]: https://issues.apache.org/jira/browse/FLINK-13449 >> >> > https://issues.apache.org/jira/browse/FLINK-13450 >> >> > [2]: https://issues.apache.org/jira/browse/FLINK-13598 >> >> > [3]: https://github.com/theopenlab/openlab/issues/new/choose >> >> > >> >> > >> >> > >> >> > >> >> > Chesnay Schepler <[hidden email]> 于2019年8月24日周六 上午12:10写道: >> >> > >> >> >> I'm wondering what we are supposed to do if the build fails? >> >> >> We aren't providing and guides on setting up an arm dev >> environment; so >> >> >> reproducing it locally isn't possible. >> >> >> >> >> >> On 23/08/2019 17:55, Stephan Ewen wrote: >> >> >>> Hi all! >> >> >>> >> >> >>> As part of the Flink on ARM effort, there is a pull request that >> >> >> triggers a >> >> >>> build on OpenLabs CI for each push and runs tests on ARM machines. >> >> >>> >> >> >>> Currently that build is roughly equivalent to what the "core" and >> >> "tests" >> >> >>> profiles do on Travis. >> >> >>> The result will be posted to the PR comments, similar to the Flink >> >> Bot's >> >> >>> Travis build result. >> >> >>> The build currently passes :-) so Flink seems to be okay on ARM. >> >> >>> >> >> >>> My suggestion would be to try and add this and gather some >> experience >> >> >> with >> >> >>> it. >> >> >>> The Travis build results should be our "ground truth" and the ARM >> CI >> >> >>> (openlabs CI) would be "informational only" at the beginning, but >> >> helping >> >> >>> us understand when we break ARM support. >> >> >>> >> >> >>> You can see this in the PR that adds the openlabs CI config: >> >> >>> https://github.com/apache/flink/pull/9416 >> >> >>> >> >> >>> Any objections? >> >> >>> >> >> >>> Best, >> >> >>> Stephan >> >> >>> >> >> >> >> >> >> >> >> > |
My gut feeling is that having a CI that only runs on a specific command
will not help too much. What about going with nightly builds then? We could set up the ARM CI the same way as the Travis CI nightly builds (cron builds). They report build failures to "[hidden email]". Maybe Chesnay or Jark could help with what needs to be done to post to that mailing list? A requirement would be that the builds are stable, from the ARM perspective, meaning that there are no failures at the moment caused by ARM specific issue. What do the others think? On Tue, Sep 3, 2019 at 4:40 AM Xiyuan Wang <[hidden email]> wrote: > The ARM CI trigger has been changed to `github comment` way only. It means > that every PR won't start ARM test unless a comment `check_arm` is added. > Like what I did in the PR[1]. > > A POC for Flink nightly end to end test job is created as well[2]. I'll > improve it then. > > Any feedback or question? > > > [1]: https://github.com/apache/flink/pull/9416 > https://github.com/apache/flink/pull/9416#issuecomment-527268203 > [2]: https://github.com/theopenlab/openlab-zuul-jobs/pull/631 > > > Thanks > > Xiyuan Wang <[hidden email]> 于2019年8月26日周一 下午7:41写道: > > > Before ARM CI is ready, I can close the CI test for each PR and let it > > only be triggered by PR comment. It's quite easy for OpenLab to do this. > > > > OpenLab have many job piplines[1]. Now I use `check` pipline in > > https://github.com/apache/flink/pull/9416. The job trigger contains > > github_action and github_comment[2]. I can create a new pipline for > Flink, > > the new trigger can only contain github_coment like: > > > > trigger: > > github: > > - event: pull_request > > action: comment > > comment: (?i)^\s*recheck_arm_build\s*$ > > > > So that the ARM job will not be ran for every PR. It'll be just ran for > > the PR which have `recheck_arm_build` comment. > > > > Then once ARM CI is ready, I can add it back. > > > > > > nightly tests can be added as well of couse. There is a kind of job in > > OpenLab called `periodic job`. We can use it for Flink daily nightly > tests. > > If any error occur, the report can be sent to [hidden email] > as > > well. > > > > [1]: > > > https://github.com/theopenlab/openlab-zuul-jobs/blob/master/zuul.d/pipelines.yaml > > [2]: > > > https://github.com/theopenlab/openlab-zuul-jobs/blob/master/zuul.d/pipelines.yaml#L10-L19 > > > > Stephan Ewen <[hidden email]> 于2019年8月26日周一 下午6:13写道: > > > >> Adding CI builds for ARM makes only sense when we actually take them > into > >> account as "blocking a merge", otherwise there is no point in having > them. > >> So we would need to be prepared to do that. > >> > >> The cases where something runs in UNIX/x64 but fails on ARM are few > cases > >> and so far seem to have been related to libraries or some magic that > tries > >> to do system dependent actions outside Java. > >> > >> One worthwhile discussion could be whether to run the ARM CI builds as > >> part > >> of the nightly tests, not on every commit. > >> There are a lot of nightly tests, for example for different Java / > Scala / > >> Hadoop versions. > >> > >> On Mon, Aug 26, 2019 at 10:46 AM Xiyuan Wang <[hidden email]> > >> wrote: > >> > >> > Sorry, maybe my words is misleading. > >> > > >> > We are just starting adding ARM support. So the CI is non-voting at > this > >> > moment to avoid blocking normal Flink development. > >> > > >> > But once the ARM CI works well and stable enough. We should mark it as > >> > voting. It means that in the future, if the ARM test is failed in a > PR, > >> the > >> > PR can not be merged. The test log may tell develpers what error is > >> > comming. If the develper need debug the detail on an ARM vm, OpenLab > can > >> > provider it. > >> > > >> > Adding ARM CI can make sure Flink support ARM originally > >> > > >> > I left a workflow in the PR, I'd like to print it here: > >> > > >> > 1. Add the basic build script to ensure the CI system and build job > >> > works as expect. The job should be marked as non-voting first, it > >> means the > >> > CI test failure won't block Flink PR to be merged. > >> > 2. Add the test script to run unit/intergration test. At this step > >> the > >> > --fn parameter will be added to mvn test. It will run the full test > >> cases > >> > in Flink, so that we can find what test is failed on ARM. > >> > 3. Fix the test failure one by one. > >> > 4. Once all the tests are passed, remove the --fn parameter and > keep > >> > watch the CI's status for some days. If some bugs raise then, fix > >> them as > >> > what we usually do for travis-ci. > >> > 5. Once the CI is stable enought, remove the non-voting tag, so > that > >> > the ARM CI will be the same as travis-ci, to be one of the gate for > >> Flink > >> > PR. > >> > 6. Finally, Flink community can announce and release Flink ARM > >> version. > >> > > >> > > >> > Chesnay Schepler <[hidden email]> 于2019年8月26日周一 下午2:25写道: > >> > > >> >> I'm sorry, but if these issues are only fixed later anyway I see no > >> >> reason to run these tests on each PR. We're just adding noise to each > >> PR > >> >> that everyone will just ignore. > >> >> > >> >> I'm curious as to the benefit of having this directly in Flink; why > >> >> aren't the ARM builds run outside of the Flink project, and fixes for > >> it > >> >> provided? > >> >> > >> >> It seems to me like nothing about these arm builds is actually > handled > >> >> by the Flink project. > >> >> > >> >> On 26/08/2019 03:43, Xiyuan Wang wrote: > >> >> > Thanks for Stephan to bring up this topic. > >> >> > > >> >> > The package build jobs work well now. I have a simple online demo > >> which > >> >> is > >> >> > built and ran on a ARM VM. Feel free to have a try[1]. > >> >> > > >> >> > As the first step for ARM support, maybe it's good to add them now. > >> >> > > >> >> > While for the next step, the test part is still broken. It relates > to > >> >> some > >> >> > points we find: > >> >> > > >> >> > 1. Some unit tests are failed[1] by Java coding. These kind of > >> failure > >> >> can > >> >> > be fixed easily. > >> >> > 2. Some tests are failed by depending on third part libaraies[2]. > It > >> >> > includes frocksdb, MapR Client and Netty. They don't have ARM > >> release. > >> >> > a. Frocksdb: I'm testing it locally now by `make check_some` > and > >> >> `make > >> >> > jtest` similar with its travis job. There are 3 tests failed by > `make > >> >> > check_some`. Please see the ticket for more details. Once the test > >> pass, > >> >> > frocksdb can release ARM package then. > >> >> > b. MapR Client. This belongs to MapR company. At this moment, > >> >> maybe we > >> >> > should skip MapR support for Flink ARM. > >> >> > c. Netty. Actually Netty runs well on our ARM machine. We will > >> ask > >> >> > Netty community to release ARM support. If they do not want, > OpenLab > >> >> will > >> >> > handle a Maven Repository for some common libraries on ARM. > >> >> > > >> >> > > >> >> > For Chesnay's concern: > >> >> > > >> >> > Firstly, OpenLab team will keep maintaining and fixing ARM CI. It > >> means > >> >> > that once build or test fails, we'll fix it at once. > >> >> > Secondly, OpenLab can provide ARM VMs to everyone for reproducing > >> and > >> >> > testing. You just need to creat a Test Request issue in > openlab[1]. > >> >> Then > >> >> > we'll create ARM VMs for you, you can login and do the thing you > >> want. > >> >> > > >> >> > Does it make sense? > >> >> > > >> >> > [1]: http://114.115.168.52:8081/#/overview > >> >> > [1]: https://issues.apache.org/jira/browse/FLINK-13449 > >> >> > https://issues.apache.org/jira/browse/FLINK-13450 > >> >> > [2]: https://issues.apache.org/jira/browse/FLINK-13598 > >> >> > [3]: https://github.com/theopenlab/openlab/issues/new/choose > >> >> > > >> >> > > >> >> > > >> >> > > >> >> > Chesnay Schepler <[hidden email]> 于2019年8月24日周六 上午12:10写道: > >> >> > > >> >> >> I'm wondering what we are supposed to do if the build fails? > >> >> >> We aren't providing and guides on setting up an arm dev > >> environment; so > >> >> >> reproducing it locally isn't possible. > >> >> >> > >> >> >> On 23/08/2019 17:55, Stephan Ewen wrote: > >> >> >>> Hi all! > >> >> >>> > >> >> >>> As part of the Flink on ARM effort, there is a pull request that > >> >> >> triggers a > >> >> >>> build on OpenLabs CI for each push and runs tests on ARM > machines. > >> >> >>> > >> >> >>> Currently that build is roughly equivalent to what the "core" and > >> >> "tests" > >> >> >>> profiles do on Travis. > >> >> >>> The result will be posted to the PR comments, similar to the > Flink > >> >> Bot's > >> >> >>> Travis build result. > >> >> >>> The build currently passes :-) so Flink seems to be okay on ARM. > >> >> >>> > >> >> >>> My suggestion would be to try and add this and gather some > >> experience > >> >> >> with > >> >> >>> it. > >> >> >>> The Travis build results should be our "ground truth" and the ARM > >> CI > >> >> >>> (openlabs CI) would be "informational only" at the beginning, but > >> >> helping > >> >> >>> us understand when we break ARM support. > >> >> >>> > >> >> >>> You can see this in the PR that adds the openlabs CI config: > >> >> >>> https://github.com/apache/flink/pull/9416 > >> >> >>> > >> >> >>> Any objections? > >> >> >>> > >> >> >>> Best, > >> >> >>> Stephan > >> >> >>> > >> >> >> > >> >> > >> >> > >> > > > |
Sure, we can run daily ARM job as Travis CI nightly jobs firstly. Once
it's stable enough, we can consider adding it to peer PR. BTW, I tested flink-end-to-end-test on ARM in last few days. Keeping the same as Travis, all 7 scenarios were tested: 1. split_checkpoints.sh 2. split_sticky.sh 3. split_ha.sh 4. split_heavy.sh 5. split_misc_hadoopfree.sh 6. split_misc.sh 7. split_container.sh The 1st-6th scenarios works well within some hacking and bug fixing locally: 1. frocksdb doesn't have official ARM release, so I built and install it locally for ARM. https://issues.apache.org/jira/browse/FLINK-13598 2. Prometheus has ARM release but the test always download x86 version. Download the correct version can fix the issue. https://issues.apache.org/jira/browse/FLINK-14086 3. Elasticsearch 6.0+ enables Xpack machine learning feature by default, but this feature doesn't support ARM. So Elasticsearch 6.0+ failed to start on ARM. Set `Xpack.ml.enabled: false` can fix this issue. https://issues.apache.org/jira/browse/FLINK-14126 The 7th scenario for container failed because: 1. docker-compose doesn't have official ARM package. Use `apt install docker-compose` can solve the problem. 2. minikube doesn't support ARM arch. Use kubeadm for K8S installation can solve the problem. Fixing the problem mentioned above is not hard. So I think we can add flink build, unit-test and e2e test as nightly jobs now. Any idea? Thanks. Stephan Ewen <[hidden email]> 于2019年9月19日周四 下午5:44写道: > My gut feeling is that having a CI that only runs on a specific command > will not help too much. > > What about going with nightly builds then? We could set up the ARM CI the > same way as the Travis CI nightly builds (cron builds). They report build > failures to "[hidden email]". > Maybe Chesnay or Jark could help with what needs to be done to post to that > mailing list? > > A requirement would be that the builds are stable, from the ARM > perspective, meaning that there are no failures at the moment caused by ARM > specific issue. > > What do the others think? > > > On Tue, Sep 3, 2019 at 4:40 AM Xiyuan Wang <[hidden email]> > wrote: > > > The ARM CI trigger has been changed to `github comment` way only. It > means > > that every PR won't start ARM test unless a comment `check_arm` is added. > > Like what I did in the PR[1]. > > > > A POC for Flink nightly end to end test job is created as well[2]. I'll > > improve it then. > > > > Any feedback or question? > > > > > > [1]: https://github.com/apache/flink/pull/9416 > > https://github.com/apache/flink/pull/9416#issuecomment-527268203 > > [2]: https://github.com/theopenlab/openlab-zuul-jobs/pull/631 > > > > > > Thanks > > > > Xiyuan Wang <[hidden email]> 于2019年8月26日周一 下午7:41写道: > > > > > Before ARM CI is ready, I can close the CI test for each PR and let it > > > only be triggered by PR comment. It's quite easy for OpenLab to do > this. > > > > > > OpenLab have many job piplines[1]. Now I use `check` pipline in > > > https://github.com/apache/flink/pull/9416. The job trigger contains > > > github_action and github_comment[2]. I can create a new pipline for > > Flink, > > > the new trigger can only contain github_coment like: > > > > > > trigger: > > > github: > > > - event: pull_request > > > action: comment > > > comment: (?i)^\s*recheck_arm_build\s*$ > > > > > > So that the ARM job will not be ran for every PR. It'll be just ran for > > > the PR which have `recheck_arm_build` comment. > > > > > > Then once ARM CI is ready, I can add it back. > > > > > > > > > nightly tests can be added as well of couse. There is a kind of job in > > > OpenLab called `periodic job`. We can use it for Flink daily nightly > > tests. > > > If any error occur, the report can be sent to [hidden email] > > as > > > well. > > > > > > [1]: > > > > > > https://github.com/theopenlab/openlab-zuul-jobs/blob/master/zuul.d/pipelines.yaml > > > [2]: > > > > > > https://github.com/theopenlab/openlab-zuul-jobs/blob/master/zuul.d/pipelines.yaml#L10-L19 > > > > > > Stephan Ewen <[hidden email]> 于2019年8月26日周一 下午6:13写道: > > > > > >> Adding CI builds for ARM makes only sense when we actually take them > > into > > >> account as "blocking a merge", otherwise there is no point in having > > them. > > >> So we would need to be prepared to do that. > > >> > > >> The cases where something runs in UNIX/x64 but fails on ARM are few > > cases > > >> and so far seem to have been related to libraries or some magic that > > tries > > >> to do system dependent actions outside Java. > > >> > > >> One worthwhile discussion could be whether to run the ARM CI builds as > > >> part > > >> of the nightly tests, not on every commit. > > >> There are a lot of nightly tests, for example for different Java / > > Scala / > > >> Hadoop versions. > > >> > > >> On Mon, Aug 26, 2019 at 10:46 AM Xiyuan Wang < > [hidden email]> > > >> wrote: > > >> > > >> > Sorry, maybe my words is misleading. > > >> > > > >> > We are just starting adding ARM support. So the CI is non-voting at > > this > > >> > moment to avoid blocking normal Flink development. > > >> > > > >> > But once the ARM CI works well and stable enough. We should mark it > as > > >> > voting. It means that in the future, if the ARM test is failed in a > > PR, > > >> the > > >> > PR can not be merged. The test log may tell develpers what error is > > >> > comming. If the develper need debug the detail on an ARM vm, OpenLab > > can > > >> > provider it. > > >> > > > >> > Adding ARM CI can make sure Flink support ARM originally > > >> > > > >> > I left a workflow in the PR, I'd like to print it here: > > >> > > > >> > 1. Add the basic build script to ensure the CI system and build > job > > >> > works as expect. The job should be marked as non-voting first, it > > >> means the > > >> > CI test failure won't block Flink PR to be merged. > > >> > 2. Add the test script to run unit/intergration test. At this > step > > >> the > > >> > --fn parameter will be added to mvn test. It will run the full > test > > >> cases > > >> > in Flink, so that we can find what test is failed on ARM. > > >> > 3. Fix the test failure one by one. > > >> > 4. Once all the tests are passed, remove the --fn parameter and > > keep > > >> > watch the CI's status for some days. If some bugs raise then, fix > > >> them as > > >> > what we usually do for travis-ci. > > >> > 5. Once the CI is stable enought, remove the non-voting tag, so > > that > > >> > the ARM CI will be the same as travis-ci, to be one of the gate > for > > >> Flink > > >> > PR. > > >> > 6. Finally, Flink community can announce and release Flink ARM > > >> version. > > >> > > > >> > > > >> > Chesnay Schepler <[hidden email]> 于2019年8月26日周一 下午2:25写道: > > >> > > > >> >> I'm sorry, but if these issues are only fixed later anyway I see no > > >> >> reason to run these tests on each PR. We're just adding noise to > each > > >> PR > > >> >> that everyone will just ignore. > > >> >> > > >> >> I'm curious as to the benefit of having this directly in Flink; why > > >> >> aren't the ARM builds run outside of the Flink project, and fixes > for > > >> it > > >> >> provided? > > >> >> > > >> >> It seems to me like nothing about these arm builds is actually > > handled > > >> >> by the Flink project. > > >> >> > > >> >> On 26/08/2019 03:43, Xiyuan Wang wrote: > > >> >> > Thanks for Stephan to bring up this topic. > > >> >> > > > >> >> > The package build jobs work well now. I have a simple online demo > > >> which > > >> >> is > > >> >> > built and ran on a ARM VM. Feel free to have a try[1]. > > >> >> > > > >> >> > As the first step for ARM support, maybe it's good to add them > now. > > >> >> > > > >> >> > While for the next step, the test part is still broken. It > relates > > to > > >> >> some > > >> >> > points we find: > > >> >> > > > >> >> > 1. Some unit tests are failed[1] by Java coding. These kind of > > >> failure > > >> >> can > > >> >> > be fixed easily. > > >> >> > 2. Some tests are failed by depending on third part libaraies[2]. > > It > > >> >> > includes frocksdb, MapR Client and Netty. They don't have ARM > > >> release. > > >> >> > a. Frocksdb: I'm testing it locally now by `make check_some` > > and > > >> >> `make > > >> >> > jtest` similar with its travis job. There are 3 tests failed by > > `make > > >> >> > check_some`. Please see the ticket for more details. Once the > test > > >> pass, > > >> >> > frocksdb can release ARM package then. > > >> >> > b. MapR Client. This belongs to MapR company. At this > moment, > > >> >> maybe we > > >> >> > should skip MapR support for Flink ARM. > > >> >> > c. Netty. Actually Netty runs well on our ARM machine. We > will > > >> ask > > >> >> > Netty community to release ARM support. If they do not want, > > OpenLab > > >> >> will > > >> >> > handle a Maven Repository for some common libraries on ARM. > > >> >> > > > >> >> > > > >> >> > For Chesnay's concern: > > >> >> > > > >> >> > Firstly, OpenLab team will keep maintaining and fixing ARM CI. It > > >> means > > >> >> > that once build or test fails, we'll fix it at once. > > >> >> > Secondly, OpenLab can provide ARM VMs to everyone for > reproducing > > >> and > > >> >> > testing. You just need to creat a Test Request issue in > > openlab[1]. > > >> >> Then > > >> >> > we'll create ARM VMs for you, you can login and do the thing you > > >> want. > > >> >> > > > >> >> > Does it make sense? > > >> >> > > > >> >> > [1]: http://114.115.168.52:8081/#/overview > > >> >> > [1]: https://issues.apache.org/jira/browse/FLINK-13449 > > >> >> > https://issues.apache.org/jira/browse/FLINK-13450 > > >> >> > [2]: https://issues.apache.org/jira/browse/FLINK-13598 > > >> >> > [3]: https://github.com/theopenlab/openlab/issues/new/choose > > >> >> > > > >> >> > > > >> >> > > > >> >> > > > >> >> > Chesnay Schepler <[hidden email]> 于2019年8月24日周六 上午12:10写道: > > >> >> > > > >> >> >> I'm wondering what we are supposed to do if the build fails? > > >> >> >> We aren't providing and guides on setting up an arm dev > > >> environment; so > > >> >> >> reproducing it locally isn't possible. > > >> >> >> > > >> >> >> On 23/08/2019 17:55, Stephan Ewen wrote: > > >> >> >>> Hi all! > > >> >> >>> > > >> >> >>> As part of the Flink on ARM effort, there is a pull request > that > > >> >> >> triggers a > > >> >> >>> build on OpenLabs CI for each push and runs tests on ARM > > machines. > > >> >> >>> > > >> >> >>> Currently that build is roughly equivalent to what the "core" > and > > >> >> "tests" > > >> >> >>> profiles do on Travis. > > >> >> >>> The result will be posted to the PR comments, similar to the > > Flink > > >> >> Bot's > > >> >> >>> Travis build result. > > >> >> >>> The build currently passes :-) so Flink seems to be okay on > ARM. > > >> >> >>> > > >> >> >>> My suggestion would be to try and add this and gather some > > >> experience > > >> >> >> with > > >> >> >>> it. > > >> >> >>> The Travis build results should be our "ground truth" and the > ARM > > >> CI > > >> >> >>> (openlabs CI) would be "informational only" at the beginning, > but > > >> >> helping > > >> >> >>> us understand when we break ARM support. > > >> >> >>> > > >> >> >>> You can see this in the PR that adds the openlabs CI config: > > >> >> >>> https://github.com/apache/flink/pull/9416 > > >> >> >>> > > >> >> >>> Any objections? > > >> >> >>> > > >> >> >>> Best, > > >> >> >>> Stephan > > >> >> >>> > > >> >> >> > > >> >> > > >> >> > > >> > > > > > > |
This sounds good Xiyuan. I'd also be in favour of running the ARM builds
regularly as cron jobs and once we see that they are stable we could run them for every master commit. Hence, I'd say let's fix the above mentioned problems and then set the nightly cron job up. Cheers, Till On Fri, Sep 20, 2019 at 8:57 AM Xiyuan Wang <[hidden email]> wrote: > Sure, we can run daily ARM job as Travis CI nightly jobs firstly. Once > it's stable enough, we can consider adding it to peer PR. > > BTW, I tested flink-end-to-end-test on ARM in last few days. Keeping the > same as Travis, all 7 scenarios were tested: > > 1. split_checkpoints.sh > 2. split_sticky.sh > 3. split_ha.sh > 4. split_heavy.sh > 5. split_misc_hadoopfree.sh > 6. split_misc.sh > 7. split_container.sh > > The 1st-6th scenarios works well within some hacking and bug fixing > locally: > 1. frocksdb doesn't have official ARM release, so I built and install > it locally for ARM. > https://issues.apache.org/jira/browse/FLINK-13598 > 2. Prometheus has ARM release but the test always download x86 version. > Download the correct version can fix the issue. > https://issues.apache.org/jira/browse/FLINK-14086 > 3. Elasticsearch 6.0+ enables Xpack machine learning feature by > default, but this feature doesn't support ARM. So Elasticsearch 6.0+ failed > to start on ARM. Set `Xpack.ml.enabled: false` can fix this issue. > https://issues.apache.org/jira/browse/FLINK-14126 > > The 7th scenario for container failed because: > 1. docker-compose doesn't have official ARM package. Use `apt install > docker-compose` can solve the problem. > 2. minikube doesn't support ARM arch. Use kubeadm for K8S installation > can solve the problem. > > Fixing the problem mentioned above is not hard. So I think we can add flink > build, unit-test and e2e test as nightly jobs now. > > Any idea? > > Thanks. > > Stephan Ewen <[hidden email]> 于2019年9月19日周四 下午5:44写道: > > > My gut feeling is that having a CI that only runs on a specific command > > will not help too much. > > > > What about going with nightly builds then? We could set up the ARM CI the > > same way as the Travis CI nightly builds (cron builds). They report build > > failures to "[hidden email]". > > Maybe Chesnay or Jark could help with what needs to be done to post to > that > > mailing list? > > > > A requirement would be that the builds are stable, from the ARM > > perspective, meaning that there are no failures at the moment caused by > ARM > > specific issue. > > > > What do the others think? > > > > > > On Tue, Sep 3, 2019 at 4:40 AM Xiyuan Wang <[hidden email]> > > wrote: > > > > > The ARM CI trigger has been changed to `github comment` way only. It > > means > > > that every PR won't start ARM test unless a comment `check_arm` is > added. > > > Like what I did in the PR[1]. > > > > > > A POC for Flink nightly end to end test job is created as well[2]. I'll > > > improve it then. > > > > > > Any feedback or question? > > > > > > > > > [1]: https://github.com/apache/flink/pull/9416 > > > https://github.com/apache/flink/pull/9416#issuecomment-527268203 > > > [2]: https://github.com/theopenlab/openlab-zuul-jobs/pull/631 > > > > > > > > > Thanks > > > > > > Xiyuan Wang <[hidden email]> 于2019年8月26日周一 下午7:41写道: > > > > > > > Before ARM CI is ready, I can close the CI test for each PR and let > it > > > > only be triggered by PR comment. It's quite easy for OpenLab to do > > this. > > > > > > > > OpenLab have many job piplines[1]. Now I use `check` pipline in > > > > https://github.com/apache/flink/pull/9416. The job trigger contains > > > > github_action and github_comment[2]. I can create a new pipline for > > > Flink, > > > > the new trigger can only contain github_coment like: > > > > > > > > trigger: > > > > github: > > > > - event: pull_request > > > > action: comment > > > > comment: (?i)^\s*recheck_arm_build\s*$ > > > > > > > > So that the ARM job will not be ran for every PR. It'll be just ran > for > > > > the PR which have `recheck_arm_build` comment. > > > > > > > > Then once ARM CI is ready, I can add it back. > > > > > > > > > > > > nightly tests can be added as well of couse. There is a kind of job > in > > > > OpenLab called `periodic job`. We can use it for Flink daily nightly > > > tests. > > > > If any error occur, the report can be sent to > [hidden email] > > > as > > > > well. > > > > > > > > [1]: > > > > > > > > > > https://github.com/theopenlab/openlab-zuul-jobs/blob/master/zuul.d/pipelines.yaml > > > > [2]: > > > > > > > > > > https://github.com/theopenlab/openlab-zuul-jobs/blob/master/zuul.d/pipelines.yaml#L10-L19 > > > > > > > > Stephan Ewen <[hidden email]> 于2019年8月26日周一 下午6:13写道: > > > > > > > >> Adding CI builds for ARM makes only sense when we actually take them > > > into > > > >> account as "blocking a merge", otherwise there is no point in having > > > them. > > > >> So we would need to be prepared to do that. > > > >> > > > >> The cases where something runs in UNIX/x64 but fails on ARM are few > > > cases > > > >> and so far seem to have been related to libraries or some magic that > > > tries > > > >> to do system dependent actions outside Java. > > > >> > > > >> One worthwhile discussion could be whether to run the ARM CI builds > as > > > >> part > > > >> of the nightly tests, not on every commit. > > > >> There are a lot of nightly tests, for example for different Java / > > > Scala / > > > >> Hadoop versions. > > > >> > > > >> On Mon, Aug 26, 2019 at 10:46 AM Xiyuan Wang < > > [hidden email]> > > > >> wrote: > > > >> > > > >> > Sorry, maybe my words is misleading. > > > >> > > > > >> > We are just starting adding ARM support. So the CI is non-voting > at > > > this > > > >> > moment to avoid blocking normal Flink development. > > > >> > > > > >> > But once the ARM CI works well and stable enough. We should mark > it > > as > > > >> > voting. It means that in the future, if the ARM test is failed in > a > > > PR, > > > >> the > > > >> > PR can not be merged. The test log may tell develpers what error > is > > > >> > comming. If the develper need debug the detail on an ARM vm, > OpenLab > > > can > > > >> > provider it. > > > >> > > > > >> > Adding ARM CI can make sure Flink support ARM originally > > > >> > > > > >> > I left a workflow in the PR, I'd like to print it here: > > > >> > > > > >> > 1. Add the basic build script to ensure the CI system and build > > job > > > >> > works as expect. The job should be marked as non-voting first, > it > > > >> means the > > > >> > CI test failure won't block Flink PR to be merged. > > > >> > 2. Add the test script to run unit/intergration test. At this > > step > > > >> the > > > >> > --fn parameter will be added to mvn test. It will run the full > > test > > > >> cases > > > >> > in Flink, so that we can find what test is failed on ARM. > > > >> > 3. Fix the test failure one by one. > > > >> > 4. Once all the tests are passed, remove the --fn parameter and > > > keep > > > >> > watch the CI's status for some days. If some bugs raise then, > fix > > > >> them as > > > >> > what we usually do for travis-ci. > > > >> > 5. Once the CI is stable enought, remove the non-voting tag, so > > > that > > > >> > the ARM CI will be the same as travis-ci, to be one of the gate > > for > > > >> Flink > > > >> > PR. > > > >> > 6. Finally, Flink community can announce and release Flink ARM > > > >> version. > > > >> > > > > >> > > > > >> > Chesnay Schepler <[hidden email]> 于2019年8月26日周一 下午2:25写道: > > > >> > > > > >> >> I'm sorry, but if these issues are only fixed later anyway I see > no > > > >> >> reason to run these tests on each PR. We're just adding noise to > > each > > > >> PR > > > >> >> that everyone will just ignore. > > > >> >> > > > >> >> I'm curious as to the benefit of having this directly in Flink; > why > > > >> >> aren't the ARM builds run outside of the Flink project, and fixes > > for > > > >> it > > > >> >> provided? > > > >> >> > > > >> >> It seems to me like nothing about these arm builds is actually > > > handled > > > >> >> by the Flink project. > > > >> >> > > > >> >> On 26/08/2019 03:43, Xiyuan Wang wrote: > > > >> >> > Thanks for Stephan to bring up this topic. > > > >> >> > > > > >> >> > The package build jobs work well now. I have a simple online > demo > > > >> which > > > >> >> is > > > >> >> > built and ran on a ARM VM. Feel free to have a try[1]. > > > >> >> > > > > >> >> > As the first step for ARM support, maybe it's good to add them > > now. > > > >> >> > > > > >> >> > While for the next step, the test part is still broken. It > > relates > > > to > > > >> >> some > > > >> >> > points we find: > > > >> >> > > > > >> >> > 1. Some unit tests are failed[1] by Java coding. These kind of > > > >> failure > > > >> >> can > > > >> >> > be fixed easily. > > > >> >> > 2. Some tests are failed by depending on third part > libaraies[2]. > > > It > > > >> >> > includes frocksdb, MapR Client and Netty. They don't have ARM > > > >> release. > > > >> >> > a. Frocksdb: I'm testing it locally now by `make > check_some` > > > and > > > >> >> `make > > > >> >> > jtest` similar with its travis job. There are 3 tests failed by > > > `make > > > >> >> > check_some`. Please see the ticket for more details. Once the > > test > > > >> pass, > > > >> >> > frocksdb can release ARM package then. > > > >> >> > b. MapR Client. This belongs to MapR company. At this > > moment, > > > >> >> maybe we > > > >> >> > should skip MapR support for Flink ARM. > > > >> >> > c. Netty. Actually Netty runs well on our ARM machine. We > > will > > > >> ask > > > >> >> > Netty community to release ARM support. If they do not want, > > > OpenLab > > > >> >> will > > > >> >> > handle a Maven Repository for some common libraries on ARM. > > > >> >> > > > > >> >> > > > > >> >> > For Chesnay's concern: > > > >> >> > > > > >> >> > Firstly, OpenLab team will keep maintaining and fixing ARM CI. > It > > > >> means > > > >> >> > that once build or test fails, we'll fix it at once. > > > >> >> > Secondly, OpenLab can provide ARM VMs to everyone for > > reproducing > > > >> and > > > >> >> > testing. You just need to creat a Test Request issue in > > > openlab[1]. > > > >> >> Then > > > >> >> > we'll create ARM VMs for you, you can login and do the thing > you > > > >> want. > > > >> >> > > > > >> >> > Does it make sense? > > > >> >> > > > > >> >> > [1]: http://114.115.168.52:8081/#/overview > > > >> >> > [1]: https://issues.apache.org/jira/browse/FLINK-13449 > > > >> >> > https://issues.apache.org/jira/browse/FLINK-13450 > > > >> >> > [2]: https://issues.apache.org/jira/browse/FLINK-13598 > > > >> >> > [3]: https://github.com/theopenlab/openlab/issues/new/choose > > > >> >> > > > > >> >> > > > > >> >> > > > > >> >> > > > > >> >> > Chesnay Schepler <[hidden email]> 于2019年8月24日周六 上午12:10写道: > > > >> >> > > > > >> >> >> I'm wondering what we are supposed to do if the build fails? > > > >> >> >> We aren't providing and guides on setting up an arm dev > > > >> environment; so > > > >> >> >> reproducing it locally isn't possible. > > > >> >> >> > > > >> >> >> On 23/08/2019 17:55, Stephan Ewen wrote: > > > >> >> >>> Hi all! > > > >> >> >>> > > > >> >> >>> As part of the Flink on ARM effort, there is a pull request > > that > > > >> >> >> triggers a > > > >> >> >>> build on OpenLabs CI for each push and runs tests on ARM > > > machines. > > > >> >> >>> > > > >> >> >>> Currently that build is roughly equivalent to what the "core" > > and > > > >> >> "tests" > > > >> >> >>> profiles do on Travis. > > > >> >> >>> The result will be posted to the PR comments, similar to the > > > Flink > > > >> >> Bot's > > > >> >> >>> Travis build result. > > > >> >> >>> The build currently passes :-) so Flink seems to be okay on > > ARM. > > > >> >> >>> > > > >> >> >>> My suggestion would be to try and add this and gather some > > > >> experience > > > >> >> >> with > > > >> >> >>> it. > > > >> >> >>> The Travis build results should be our "ground truth" and the > > ARM > > > >> CI > > > >> >> >>> (openlabs CI) would be "informational only" at the beginning, > > but > > > >> >> helping > > > >> >> >>> us understand when we break ARM support. > > > >> >> >>> > > > >> >> >>> You can see this in the PR that adds the openlabs CI config: > > > >> >> >>> https://github.com/apache/flink/pull/9416 > > > >> >> >>> > > > >> >> >>> Any objections? > > > >> >> >>> > > > >> >> >>> Best, > > > >> >> >>> Stephan > > > >> >> >>> > > > >> >> >> > > > >> >> > > > >> >> > > > >> > > > > > > > > > > |
Hi Till
Thanks for your response. All ARM related work is triggered here: https://issues.apache.org/jira/browse/FLINK-13448 and I have created some PRs already. After do some hacking locally, E2E tests runs well now. I have added them into OpenLab alreay. The POC log: http://status.openlabtesting.org/builds?project=apache%2Fflink&pipeline=periodic-20 It runs at UTC2000 everyday. Following the POC, I have created the official PR for cron job as well which contains core/test related module test and e2e test(expect container ones): https://github.com/apache/flink/pull/9416 Once it's merged, I can configure it at OpenLab side to send the test result everyday to [hidden email]. Thanks. Till Rohrmann <[hidden email]> 于2019年9月23日周一 下午8:40写道: > This sounds good Xiyuan. I'd also be in favour of running the ARM builds > regularly as cron jobs and once we see that they are stable we could run > them for every master commit. Hence, I'd say let's fix the above mentioned > problems and then set the nightly cron job up. > > Cheers, > Till > > On Fri, Sep 20, 2019 at 8:57 AM Xiyuan Wang <[hidden email]> > wrote: > > > Sure, we can run daily ARM job as Travis CI nightly jobs firstly. Once > > it's stable enough, we can consider adding it to peer PR. > > > > BTW, I tested flink-end-to-end-test on ARM in last few days. Keeping the > > same as Travis, all 7 scenarios were tested: > > > > 1. split_checkpoints.sh > > 2. split_sticky.sh > > 3. split_ha.sh > > 4. split_heavy.sh > > 5. split_misc_hadoopfree.sh > > 6. split_misc.sh > > 7. split_container.sh > > > > The 1st-6th scenarios works well within some hacking and bug fixing > > locally: > > 1. frocksdb doesn't have official ARM release, so I built and install > > it locally for ARM. > > https://issues.apache.org/jira/browse/FLINK-13598 > > 2. Prometheus has ARM release but the test always download x86 > version. > > Download the correct version can fix the issue. > > https://issues.apache.org/jira/browse/FLINK-14086 > > 3. Elasticsearch 6.0+ enables Xpack machine learning feature by > > default, but this feature doesn't support ARM. So Elasticsearch 6.0+ > failed > > to start on ARM. Set `Xpack.ml.enabled: false` can fix this issue. > > https://issues.apache.org/jira/browse/FLINK-14126 > > > > The 7th scenario for container failed because: > > 1. docker-compose doesn't have official ARM package. Use `apt install > > docker-compose` can solve the problem. > > 2. minikube doesn't support ARM arch. Use kubeadm for K8S > installation > > can solve the problem. > > > > Fixing the problem mentioned above is not hard. So I think we can add > flink > > build, unit-test and e2e test as nightly jobs now. > > > > Any idea? > > > > Thanks. > > > > Stephan Ewen <[hidden email]> 于2019年9月19日周四 下午5:44写道: > > > > > My gut feeling is that having a CI that only runs on a specific command > > > will not help too much. > > > > > > What about going with nightly builds then? We could set up the ARM CI > the > > > same way as the Travis CI nightly builds (cron builds). They report > build > > > failures to "[hidden email]". > > > Maybe Chesnay or Jark could help with what needs to be done to post to > > that > > > mailing list? > > > > > > A requirement would be that the builds are stable, from the ARM > > > perspective, meaning that there are no failures at the moment caused by > > ARM > > > specific issue. > > > > > > What do the others think? > > > > > > > > > On Tue, Sep 3, 2019 at 4:40 AM Xiyuan Wang <[hidden email]> > > > wrote: > > > > > > > The ARM CI trigger has been changed to `github comment` way only. It > > > means > > > > that every PR won't start ARM test unless a comment `check_arm` is > > added. > > > > Like what I did in the PR[1]. > > > > > > > > A POC for Flink nightly end to end test job is created as well[2]. > I'll > > > > improve it then. > > > > > > > > Any feedback or question? > > > > > > > > > > > > [1]: https://github.com/apache/flink/pull/9416 > > > > > https://github.com/apache/flink/pull/9416#issuecomment-527268203 > > > > [2]: https://github.com/theopenlab/openlab-zuul-jobs/pull/631 > > > > > > > > > > > > Thanks > > > > > > > > Xiyuan Wang <[hidden email]> 于2019年8月26日周一 下午7:41写道: > > > > > > > > > Before ARM CI is ready, I can close the CI test for each PR and let > > it > > > > > only be triggered by PR comment. It's quite easy for OpenLab to do > > > this. > > > > > > > > > > OpenLab have many job piplines[1]. Now I use `check` pipline in > > > > > https://github.com/apache/flink/pull/9416. The job trigger > contains > > > > > github_action and github_comment[2]. I can create a new pipline for > > > > Flink, > > > > > the new trigger can only contain github_coment like: > > > > > > > > > > trigger: > > > > > github: > > > > > - event: pull_request > > > > > action: comment > > > > > comment: (?i)^\s*recheck_arm_build\s*$ > > > > > > > > > > So that the ARM job will not be ran for every PR. It'll be just ran > > for > > > > > the PR which have `recheck_arm_build` comment. > > > > > > > > > > Then once ARM CI is ready, I can add it back. > > > > > > > > > > > > > > > nightly tests can be added as well of couse. There is a kind of job > > in > > > > > OpenLab called `periodic job`. We can use it for Flink daily > nightly > > > > tests. > > > > > If any error occur, the report can be sent to > > [hidden email] > > > > as > > > > > well. > > > > > > > > > > [1]: > > > > > > > > > > > > > > > https://github.com/theopenlab/openlab-zuul-jobs/blob/master/zuul.d/pipelines.yaml > > > > > [2]: > > > > > > > > > > > > > > > https://github.com/theopenlab/openlab-zuul-jobs/blob/master/zuul.d/pipelines.yaml#L10-L19 > > > > > > > > > > Stephan Ewen <[hidden email]> 于2019年8月26日周一 下午6:13写道: > > > > > > > > > >> Adding CI builds for ARM makes only sense when we actually take > them > > > > into > > > > >> account as "blocking a merge", otherwise there is no point in > having > > > > them. > > > > >> So we would need to be prepared to do that. > > > > >> > > > > >> The cases where something runs in UNIX/x64 but fails on ARM are > few > > > > cases > > > > >> and so far seem to have been related to libraries or some magic > that > > > > tries > > > > >> to do system dependent actions outside Java. > > > > >> > > > > >> One worthwhile discussion could be whether to run the ARM CI > builds > > as > > > > >> part > > > > >> of the nightly tests, not on every commit. > > > > >> There are a lot of nightly tests, for example for different Java / > > > > Scala / > > > > >> Hadoop versions. > > > > >> > > > > >> On Mon, Aug 26, 2019 at 10:46 AM Xiyuan Wang < > > > [hidden email]> > > > > >> wrote: > > > > >> > > > > >> > Sorry, maybe my words is misleading. > > > > >> > > > > > >> > We are just starting adding ARM support. So the CI is non-voting > > at > > > > this > > > > >> > moment to avoid blocking normal Flink development. > > > > >> > > > > > >> > But once the ARM CI works well and stable enough. We should mark > > it > > > as > > > > >> > voting. It means that in the future, if the ARM test is failed > in > > a > > > > PR, > > > > >> the > > > > >> > PR can not be merged. The test log may tell develpers what error > > is > > > > >> > comming. If the develper need debug the detail on an ARM vm, > > OpenLab > > > > can > > > > >> > provider it. > > > > >> > > > > > >> > Adding ARM CI can make sure Flink support ARM originally > > > > >> > > > > > >> > I left a workflow in the PR, I'd like to print it here: > > > > >> > > > > > >> > 1. Add the basic build script to ensure the CI system and > build > > > job > > > > >> > works as expect. The job should be marked as non-voting > first, > > it > > > > >> means the > > > > >> > CI test failure won't block Flink PR to be merged. > > > > >> > 2. Add the test script to run unit/intergration test. At this > > > step > > > > >> the > > > > >> > --fn parameter will be added to mvn test. It will run the > full > > > test > > > > >> cases > > > > >> > in Flink, so that we can find what test is failed on ARM. > > > > >> > 3. Fix the test failure one by one. > > > > >> > 4. Once all the tests are passed, remove the --fn parameter > and > > > > keep > > > > >> > watch the CI's status for some days. If some bugs raise then, > > fix > > > > >> them as > > > > >> > what we usually do for travis-ci. > > > > >> > 5. Once the CI is stable enought, remove the non-voting tag, > so > > > > that > > > > >> > the ARM CI will be the same as travis-ci, to be one of the > gate > > > for > > > > >> Flink > > > > >> > PR. > > > > >> > 6. Finally, Flink community can announce and release Flink > ARM > > > > >> version. > > > > >> > > > > > >> > > > > > >> > Chesnay Schepler <[hidden email]> 于2019年8月26日周一 下午2:25写道: > > > > >> > > > > > >> >> I'm sorry, but if these issues are only fixed later anyway I > see > > no > > > > >> >> reason to run these tests on each PR. We're just adding noise > to > > > each > > > > >> PR > > > > >> >> that everyone will just ignore. > > > > >> >> > > > > >> >> I'm curious as to the benefit of having this directly in Flink; > > why > > > > >> >> aren't the ARM builds run outside of the Flink project, and > fixes > > > for > > > > >> it > > > > >> >> provided? > > > > >> >> > > > > >> >> It seems to me like nothing about these arm builds is actually > > > > handled > > > > >> >> by the Flink project. > > > > >> >> > > > > >> >> On 26/08/2019 03:43, Xiyuan Wang wrote: > > > > >> >> > Thanks for Stephan to bring up this topic. > > > > >> >> > > > > > >> >> > The package build jobs work well now. I have a simple online > > demo > > > > >> which > > > > >> >> is > > > > >> >> > built and ran on a ARM VM. Feel free to have a try[1]. > > > > >> >> > > > > > >> >> > As the first step for ARM support, maybe it's good to add > them > > > now. > > > > >> >> > > > > > >> >> > While for the next step, the test part is still broken. It > > > relates > > > > to > > > > >> >> some > > > > >> >> > points we find: > > > > >> >> > > > > > >> >> > 1. Some unit tests are failed[1] by Java coding. These kind > of > > > > >> failure > > > > >> >> can > > > > >> >> > be fixed easily. > > > > >> >> > 2. Some tests are failed by depending on third part > > libaraies[2]. > > > > It > > > > >> >> > includes frocksdb, MapR Client and Netty. They don't have ARM > > > > >> release. > > > > >> >> > a. Frocksdb: I'm testing it locally now by `make > > check_some` > > > > and > > > > >> >> `make > > > > >> >> > jtest` similar with its travis job. There are 3 tests failed > by > > > > `make > > > > >> >> > check_some`. Please see the ticket for more details. Once the > > > test > > > > >> pass, > > > > >> >> > frocksdb can release ARM package then. > > > > >> >> > b. MapR Client. This belongs to MapR company. At this > > > moment, > > > > >> >> maybe we > > > > >> >> > should skip MapR support for Flink ARM. > > > > >> >> > c. Netty. Actually Netty runs well on our ARM machine. > We > > > will > > > > >> ask > > > > >> >> > Netty community to release ARM support. If they do not want, > > > > OpenLab > > > > >> >> will > > > > >> >> > handle a Maven Repository for some common libraries on ARM. > > > > >> >> > > > > > >> >> > > > > > >> >> > For Chesnay's concern: > > > > >> >> > > > > > >> >> > Firstly, OpenLab team will keep maintaining and fixing ARM > CI. > > It > > > > >> means > > > > >> >> > that once build or test fails, we'll fix it at once. > > > > >> >> > Secondly, OpenLab can provide ARM VMs to everyone for > > > reproducing > > > > >> and > > > > >> >> > testing. You just need to creat a Test Request issue in > > > > openlab[1]. > > > > >> >> Then > > > > >> >> > we'll create ARM VMs for you, you can login and do the thing > > you > > > > >> want. > > > > >> >> > > > > > >> >> > Does it make sense? > > > > >> >> > > > > > >> >> > [1]: http://114.115.168.52:8081/#/overview > > > > >> >> > [1]: https://issues.apache.org/jira/browse/FLINK-13449 > > > > >> >> > https://issues.apache.org/jira/browse/FLINK-13450 > > > > >> >> > [2]: https://issues.apache.org/jira/browse/FLINK-13598 > > > > >> >> > [3]: https://github.com/theopenlab/openlab/issues/new/choose > > > > >> >> > > > > > >> >> > > > > > >> >> > > > > > >> >> > > > > > >> >> > Chesnay Schepler <[hidden email]> 于2019年8月24日周六 > 上午12:10写道: > > > > >> >> > > > > > >> >> >> I'm wondering what we are supposed to do if the build fails? > > > > >> >> >> We aren't providing and guides on setting up an arm dev > > > > >> environment; so > > > > >> >> >> reproducing it locally isn't possible. > > > > >> >> >> > > > > >> >> >> On 23/08/2019 17:55, Stephan Ewen wrote: > > > > >> >> >>> Hi all! > > > > >> >> >>> > > > > >> >> >>> As part of the Flink on ARM effort, there is a pull request > > > that > > > > >> >> >> triggers a > > > > >> >> >>> build on OpenLabs CI for each push and runs tests on ARM > > > > machines. > > > > >> >> >>> > > > > >> >> >>> Currently that build is roughly equivalent to what the > "core" > > > and > > > > >> >> "tests" > > > > >> >> >>> profiles do on Travis. > > > > >> >> >>> The result will be posted to the PR comments, similar to > the > > > > Flink > > > > >> >> Bot's > > > > >> >> >>> Travis build result. > > > > >> >> >>> The build currently passes :-) so Flink seems to be okay on > > > ARM. > > > > >> >> >>> > > > > >> >> >>> My suggestion would be to try and add this and gather some > > > > >> experience > > > > >> >> >> with > > > > >> >> >>> it. > > > > >> >> >>> The Travis build results should be our "ground truth" and > the > > > ARM > > > > >> CI > > > > >> >> >>> (openlabs CI) would be "informational only" at the > beginning, > > > but > > > > >> >> helping > > > > >> >> >>> us understand when we break ARM support. > > > > >> >> >>> > > > > >> >> >>> You can see this in the PR that adds the openlabs CI > config: > > > > >> >> >>> https://github.com/apache/flink/pull/9416 > > > > >> >> >>> > > > > >> >> >>> Any objections? > > > > >> >> >>> > > > > >> >> >>> Best, > > > > >> >> >>> Stephan > > > > >> >> >>> > > > > >> >> >> > > > > >> >> > > > > >> >> > > > > >> > > > > > > > > > > > > > > > |
Hi, Flink Team,
According to the discussion, I assume that we are now agree that running cron job for ARM at this moment. I have ran POC e2e test in OpenLab for some days[1]. It includes: flink-end-to-end-test-part1 split_checkpoints.sh and split_sticky.sh flink-end-to-end-test-part2 split_heavy.sh and split_ha.sh flink-end-to-end-test-part3 split_misc.sh and split_misc_hadoopfree.sh part1 and part2 runs well. part3 is not statble. I need take more time to fix part3. container part is not included because the problem5 mentioned below.(I'll add it once it's solved.) While I did som hacks to make sure the job pass. It includes: 1. Frocksdb ARM package: https://issues.apache.org/jira/browse/FLINK-13598 (Not solved) 2. PrometheusReporterEndToEndITCase doesn't support ARM arch: https://issues.apache.org/jira/browse/FLINK-14086 (PR for fix: https://github.com/apache/flink/pull/9768) 3. Elasticsearch Xpack Machine Learning doesn't support ARM : https://issues.apache.org/jira/browse/FLINK-14126 (PR for fix: https://github.com/apache/flink/pull/9765) 4. maven-shade-plugin 3.2.1 doesn't work on ARM for Flink (Fixed, thanks @Dian Fu ) 5. flink e2e container test doesn't support ARM: https://issues.apache.org/jira/browse/FLINK-14241 (PR for fix: https://github.com/apache/flink/pull/9782) Please help review these PRs. Thanks very much. And I added a PR[2] <https://github.com/apache/flink/pull/9416> to make Flink run cron jobs officially. Once it's merged, the jobs will be ran once a day at UTC2000. The result can be sent to [hidden email] if Flink team can give the permission to send mail by [hidden email] [1]: http://status.openlabtesting.org/builds?project=apache%2Fflink [2]: https://github.com/apache/flink/pull/9416 Thanks. Xiyuan Wang <[hidden email]> 于2019年9月25日周三 下午5:33写道: > Hi Till > Thanks for your response. All ARM related work is triggered here: > https://issues.apache.org/jira/browse/FLINK-13448 and I have created some > PRs already. > > After do some hacking locally, E2E tests runs well now. I have added > them into OpenLab alreay. The POC log: > http://status.openlabtesting.org/builds?project=apache%2Fflink&pipeline=periodic-20 It > runs at UTC2000 everyday. Following the POC, I have created the official PR > for cron job as well which contains core/test related module test and e2e > test(expect container ones): https://github.com/apache/flink/pull/9416 > > Once it's merged, I can configure it at OpenLab side to send the test > result everyday to [hidden email]. > > Thanks. > > > > > > Till Rohrmann <[hidden email]> 于2019年9月23日周一 下午8:40写道: > >> This sounds good Xiyuan. I'd also be in favour of running the ARM builds >> regularly as cron jobs and once we see that they are stable we could run >> them for every master commit. Hence, I'd say let's fix the above mentioned >> problems and then set the nightly cron job up. >> >> Cheers, >> Till >> >> On Fri, Sep 20, 2019 at 8:57 AM Xiyuan Wang <[hidden email]> >> wrote: >> >> > Sure, we can run daily ARM job as Travis CI nightly jobs firstly. Once >> > it's stable enough, we can consider adding it to peer PR. >> > >> > BTW, I tested flink-end-to-end-test on ARM in last few days. Keeping the >> > same as Travis, all 7 scenarios were tested: >> > >> > 1. split_checkpoints.sh >> > 2. split_sticky.sh >> > 3. split_ha.sh >> > 4. split_heavy.sh >> > 5. split_misc_hadoopfree.sh >> > 6. split_misc.sh >> > 7. split_container.sh >> > >> > The 1st-6th scenarios works well within some hacking and bug fixing >> > locally: >> > 1. frocksdb doesn't have official ARM release, so I built and >> install >> > it locally for ARM. >> > https://issues.apache.org/jira/browse/FLINK-13598 >> > 2. Prometheus has ARM release but the test always download x86 >> version. >> > Download the correct version can fix the issue. >> > https://issues.apache.org/jira/browse/FLINK-14086 >> > 3. Elasticsearch 6.0+ enables Xpack machine learning feature by >> > default, but this feature doesn't support ARM. So Elasticsearch 6.0+ >> failed >> > to start on ARM. Set `Xpack.ml.enabled: false` can fix this issue. >> > https://issues.apache.org/jira/browse/FLINK-14126 >> > >> > The 7th scenario for container failed because: >> > 1. docker-compose doesn't have official ARM package. Use `apt >> install >> > docker-compose` can solve the problem. >> > 2. minikube doesn't support ARM arch. Use kubeadm for K8S >> installation >> > can solve the problem. >> > >> > Fixing the problem mentioned above is not hard. So I think we can add >> flink >> > build, unit-test and e2e test as nightly jobs now. >> > >> > Any idea? >> > >> > Thanks. >> > >> > Stephan Ewen <[hidden email]> 于2019年9月19日周四 下午5:44写道: >> > >> > > My gut feeling is that having a CI that only runs on a specific >> command >> > > will not help too much. >> > > >> > > What about going with nightly builds then? We could set up the ARM CI >> the >> > > same way as the Travis CI nightly builds (cron builds). They report >> build >> > > failures to "[hidden email]". >> > > Maybe Chesnay or Jark could help with what needs to be done to post to >> > that >> > > mailing list? >> > > >> > > A requirement would be that the builds are stable, from the ARM >> > > perspective, meaning that there are no failures at the moment caused >> by >> > ARM >> > > specific issue. >> > > >> > > What do the others think? >> > > >> > > >> > > On Tue, Sep 3, 2019 at 4:40 AM Xiyuan Wang <[hidden email]> >> > > wrote: >> > > >> > > > The ARM CI trigger has been changed to `github comment` way only. It >> > > means >> > > > that every PR won't start ARM test unless a comment `check_arm` is >> > added. >> > > > Like what I did in the PR[1]. >> > > > >> > > > A POC for Flink nightly end to end test job is created as well[2]. >> I'll >> > > > improve it then. >> > > > >> > > > Any feedback or question? >> > > > >> > > > >> > > > [1]: https://github.com/apache/flink/pull/9416 >> > > > >> https://github.com/apache/flink/pull/9416#issuecomment-527268203 >> > > > [2]: https://github.com/theopenlab/openlab-zuul-jobs/pull/631 >> > > > >> > > > >> > > > Thanks >> > > > >> > > > Xiyuan Wang <[hidden email]> 于2019年8月26日周一 下午7:41写道: >> > > > >> > > > > Before ARM CI is ready, I can close the CI test for each PR and >> let >> > it >> > > > > only be triggered by PR comment. It's quite easy for OpenLab to >> do >> > > this. >> > > > > >> > > > > OpenLab have many job piplines[1]. Now I use `check` pipline in >> > > > > https://github.com/apache/flink/pull/9416. The job trigger >> contains >> > > > > github_action and github_comment[2]. I can create a new pipline >> for >> > > > Flink, >> > > > > the new trigger can only contain github_coment like: >> > > > > >> > > > > trigger: >> > > > > github: >> > > > > - event: pull_request >> > > > > action: comment >> > > > > comment: (?i)^\s*recheck_arm_build\s*$ >> > > > > >> > > > > So that the ARM job will not be ran for every PR. It'll be just >> ran >> > for >> > > > > the PR which have `recheck_arm_build` comment. >> > > > > >> > > > > Then once ARM CI is ready, I can add it back. >> > > > > >> > > > > >> > > > > nightly tests can be added as well of couse. There is a kind of >> job >> > in >> > > > > OpenLab called `periodic job`. We can use it for Flink daily >> nightly >> > > > tests. >> > > > > If any error occur, the report can be sent to >> > [hidden email] >> > > > as >> > > > > well. >> > > > > >> > > > > [1]: >> > > > > >> > > > >> > > >> > >> https://github.com/theopenlab/openlab-zuul-jobs/blob/master/zuul.d/pipelines.yaml >> > > > > [2]: >> > > > > >> > > > >> > > >> > >> https://github.com/theopenlab/openlab-zuul-jobs/blob/master/zuul.d/pipelines.yaml#L10-L19 >> > > > > >> > > > > Stephan Ewen <[hidden email]> 于2019年8月26日周一 下午6:13写道: >> > > > > >> > > > >> Adding CI builds for ARM makes only sense when we actually take >> them >> > > > into >> > > > >> account as "blocking a merge", otherwise there is no point in >> having >> > > > them. >> > > > >> So we would need to be prepared to do that. >> > > > >> >> > > > >> The cases where something runs in UNIX/x64 but fails on ARM are >> few >> > > > cases >> > > > >> and so far seem to have been related to libraries or some magic >> that >> > > > tries >> > > > >> to do system dependent actions outside Java. >> > > > >> >> > > > >> One worthwhile discussion could be whether to run the ARM CI >> builds >> > as >> > > > >> part >> > > > >> of the nightly tests, not on every commit. >> > > > >> There are a lot of nightly tests, for example for different Java >> / >> > > > Scala / >> > > > >> Hadoop versions. >> > > > >> >> > > > >> On Mon, Aug 26, 2019 at 10:46 AM Xiyuan Wang < >> > > [hidden email]> >> > > > >> wrote: >> > > > >> >> > > > >> > Sorry, maybe my words is misleading. >> > > > >> > >> > > > >> > We are just starting adding ARM support. So the CI is >> non-voting >> > at >> > > > this >> > > > >> > moment to avoid blocking normal Flink development. >> > > > >> > >> > > > >> > But once the ARM CI works well and stable enough. We should >> mark >> > it >> > > as >> > > > >> > voting. It means that in the future, if the ARM test is failed >> in >> > a >> > > > PR, >> > > > >> the >> > > > >> > PR can not be merged. The test log may tell develpers what >> error >> > is >> > > > >> > comming. If the develper need debug the detail on an ARM vm, >> > OpenLab >> > > > can >> > > > >> > provider it. >> > > > >> > >> > > > >> > Adding ARM CI can make sure Flink support ARM originally >> > > > >> > >> > > > >> > I left a workflow in the PR, I'd like to print it here: >> > > > >> > >> > > > >> > 1. Add the basic build script to ensure the CI system and >> build >> > > job >> > > > >> > works as expect. The job should be marked as non-voting >> first, >> > it >> > > > >> means the >> > > > >> > CI test failure won't block Flink PR to be merged. >> > > > >> > 2. Add the test script to run unit/intergration test. At >> this >> > > step >> > > > >> the >> > > > >> > --fn parameter will be added to mvn test. It will run the >> full >> > > test >> > > > >> cases >> > > > >> > in Flink, so that we can find what test is failed on ARM. >> > > > >> > 3. Fix the test failure one by one. >> > > > >> > 4. Once all the tests are passed, remove the --fn parameter >> and >> > > > keep >> > > > >> > watch the CI's status for some days. If some bugs raise >> then, >> > fix >> > > > >> them as >> > > > >> > what we usually do for travis-ci. >> > > > >> > 5. Once the CI is stable enought, remove the non-voting >> tag, so >> > > > that >> > > > >> > the ARM CI will be the same as travis-ci, to be one of the >> gate >> > > for >> > > > >> Flink >> > > > >> > PR. >> > > > >> > 6. Finally, Flink community can announce and release Flink >> ARM >> > > > >> version. >> > > > >> > >> > > > >> > >> > > > >> > Chesnay Schepler <[hidden email]> 于2019年8月26日周一 下午2:25写道: >> > > > >> > >> > > > >> >> I'm sorry, but if these issues are only fixed later anyway I >> see >> > no >> > > > >> >> reason to run these tests on each PR. We're just adding noise >> to >> > > each >> > > > >> PR >> > > > >> >> that everyone will just ignore. >> > > > >> >> >> > > > >> >> I'm curious as to the benefit of having this directly in >> Flink; >> > why >> > > > >> >> aren't the ARM builds run outside of the Flink project, and >> fixes >> > > for >> > > > >> it >> > > > >> >> provided? >> > > > >> >> >> > > > >> >> It seems to me like nothing about these arm builds is actually >> > > > handled >> > > > >> >> by the Flink project. >> > > > >> >> >> > > > >> >> On 26/08/2019 03:43, Xiyuan Wang wrote: >> > > > >> >> > Thanks for Stephan to bring up this topic. >> > > > >> >> > >> > > > >> >> > The package build jobs work well now. I have a simple online >> > demo >> > > > >> which >> > > > >> >> is >> > > > >> >> > built and ran on a ARM VM. Feel free to have a try[1]. >> > > > >> >> > >> > > > >> >> > As the first step for ARM support, maybe it's good to add >> them >> > > now. >> > > > >> >> > >> > > > >> >> > While for the next step, the test part is still broken. It >> > > relates >> > > > to >> > > > >> >> some >> > > > >> >> > points we find: >> > > > >> >> > >> > > > >> >> > 1. Some unit tests are failed[1] by Java coding. These kind >> of >> > > > >> failure >> > > > >> >> can >> > > > >> >> > be fixed easily. >> > > > >> >> > 2. Some tests are failed by depending on third part >> > libaraies[2]. >> > > > It >> > > > >> >> > includes frocksdb, MapR Client and Netty. They don't have >> ARM >> > > > >> release. >> > > > >> >> > a. Frocksdb: I'm testing it locally now by `make >> > check_some` >> > > > and >> > > > >> >> `make >> > > > >> >> > jtest` similar with its travis job. There are 3 tests >> failed by >> > > > `make >> > > > >> >> > check_some`. Please see the ticket for more details. Once >> the >> > > test >> > > > >> pass, >> > > > >> >> > frocksdb can release ARM package then. >> > > > >> >> > b. MapR Client. This belongs to MapR company. At this >> > > moment, >> > > > >> >> maybe we >> > > > >> >> > should skip MapR support for Flink ARM. >> > > > >> >> > c. Netty. Actually Netty runs well on our ARM machine. >> We >> > > will >> > > > >> ask >> > > > >> >> > Netty community to release ARM support. If they do not want, >> > > > OpenLab >> > > > >> >> will >> > > > >> >> > handle a Maven Repository for some common libraries on ARM. >> > > > >> >> > >> > > > >> >> > >> > > > >> >> > For Chesnay's concern: >> > > > >> >> > >> > > > >> >> > Firstly, OpenLab team will keep maintaining and fixing ARM >> CI. >> > It >> > > > >> means >> > > > >> >> > that once build or test fails, we'll fix it at once. >> > > > >> >> > Secondly, OpenLab can provide ARM VMs to everyone for >> > > reproducing >> > > > >> and >> > > > >> >> > testing. You just need to creat a Test Request issue in >> > > > openlab[1]. >> > > > >> >> Then >> > > > >> >> > we'll create ARM VMs for you, you can login and do the >> thing >> > you >> > > > >> want. >> > > > >> >> > >> > > > >> >> > Does it make sense? >> > > > >> >> > >> > > > >> >> > [1]: http://114.115.168.52:8081/#/overview >> > > > >> >> > [1]: https://issues.apache.org/jira/browse/FLINK-13449 >> > > > >> >> > https://issues.apache.org/jira/browse/FLINK-13450 >> > > > >> >> > [2]: https://issues.apache.org/jira/browse/FLINK-13598 >> > > > >> >> > [3]: >> https://github.com/theopenlab/openlab/issues/new/choose >> > > > >> >> > >> > > > >> >> > >> > > > >> >> > >> > > > >> >> > >> > > > >> >> > Chesnay Schepler <[hidden email]> 于2019年8月24日周六 >> 上午12:10写道: >> > > > >> >> > >> > > > >> >> >> I'm wondering what we are supposed to do if the build >> fails? >> > > > >> >> >> We aren't providing and guides on setting up an arm dev >> > > > >> environment; so >> > > > >> >> >> reproducing it locally isn't possible. >> > > > >> >> >> >> > > > >> >> >> On 23/08/2019 17:55, Stephan Ewen wrote: >> > > > >> >> >>> Hi all! >> > > > >> >> >>> >> > > > >> >> >>> As part of the Flink on ARM effort, there is a pull >> request >> > > that >> > > > >> >> >> triggers a >> > > > >> >> >>> build on OpenLabs CI for each push and runs tests on ARM >> > > > machines. >> > > > >> >> >>> >> > > > >> >> >>> Currently that build is roughly equivalent to what the >> "core" >> > > and >> > > > >> >> "tests" >> > > > >> >> >>> profiles do on Travis. >> > > > >> >> >>> The result will be posted to the PR comments, similar to >> the >> > > > Flink >> > > > >> >> Bot's >> > > > >> >> >>> Travis build result. >> > > > >> >> >>> The build currently passes :-) so Flink seems to be okay >> on >> > > ARM. >> > > > >> >> >>> >> > > > >> >> >>> My suggestion would be to try and add this and gather some >> > > > >> experience >> > > > >> >> >> with >> > > > >> >> >>> it. >> > > > >> >> >>> The Travis build results should be our "ground truth" and >> the >> > > ARM >> > > > >> CI >> > > > >> >> >>> (openlabs CI) would be "informational only" at the >> beginning, >> > > but >> > > > >> >> helping >> > > > >> >> >>> us understand when we break ARM support. >> > > > >> >> >>> >> > > > >> >> >>> You can see this in the PR that adds the openlabs CI >> config: >> > > > >> >> >>> https://github.com/apache/flink/pull/9416 >> > > > >> >> >>> >> > > > >> >> >>> Any objections? >> > > > >> >> >>> >> > > > >> >> >>> Best, >> > > > >> >> >>> Stephan >> > > > >> >> >>> >> > > > >> >> >> >> > > > >> >> >> > > > >> >> >> > > > >> >> > > > > >> > > > >> > > >> > >> > |
Free forum by Nabble | Edit this page |