Hi devs!
I wanted to bring up something that was discussed in a few independent groups of people in the past days. I'd like to revise using timeouts in our JUnit tests. The suggestion would be not to use them anymore. The problem with timeouts is that we have no thread dump and stack traces of the system as it hangs. If we were not using a timeout, the CI runner would have caught the timeout and created a thread dump which often is a great starting point for debugging. This problem has been spotted e.g. during debugging FLINK-22416[1]. In the past thread dumps were not always taken for hanging tests, but it was changed quite recently in FLINK-21346[2]. I am happy to hear your opinions on it. If there are no objections I would like to add the suggestion to the Coding Guidelines[3] Best, Dawid [1] https://issues.apache.org/jira/browse/FLINK-22416 [2] https://issues.apache.org/jira/browse/FLINK-21346 [3] https://flink.apache.org/contributing/code-style-and-quality-java.html#java-language-features-and-libraries OpenPGP_signature (855 bytes) Download Attachment |
+1 from my side.
We should probably double-check if we really need 4h timeouts on test tasks in AZP. It feels like 2h be enough. On Mon, Apr 26, 2021 at 9:54 AM Dawid Wysakowicz <[hidden email]> wrote: > Hi devs! > > I wanted to bring up something that was discussed in a few independent > groups of people in the past days. I'd like to revise using timeouts in > our JUnit tests. The suggestion would be not to use them anymore. The > problem with timeouts is that we have no thread dump and stack traces of > the system as it hangs. If we were not using a timeout, the CI runner > would have caught the timeout and created a thread dump which often is a > great starting point for debugging. > > This problem has been spotted e.g. during debugging FLINK-22416[1]. In > the past thread dumps were not always taken for hanging tests, but it > was changed quite recently in FLINK-21346[2]. I am happy to hear your > opinions on it. If there are no objections I would like to add the > suggestion to the Coding Guidelines[3] > > Best, > > Dawid > > > [1] https://issues.apache.org/jira/browse/FLINK-22416 > > [2] https://issues.apache.org/jira/browse/FLINK-21346 > > [3] > > https://flink.apache.org/contributing/code-style-and-quality-java.html#java-language-features-and-libraries > > > |
+1. I think this rule makes a lot of sense.
Cheers, Till On Mon, Apr 26, 2021 at 10:08 AM Arvid Heise <[hidden email]> wrote: > +1 from my side. > > We should probably double-check if we really need 4h timeouts on test tasks > in AZP. It feels like 2h be enough. > > On Mon, Apr 26, 2021 at 9:54 AM Dawid Wysakowicz <[hidden email]> > wrote: > > > Hi devs! > > > > I wanted to bring up something that was discussed in a few independent > > groups of people in the past days. I'd like to revise using timeouts in > > our JUnit tests. The suggestion would be not to use them anymore. The > > problem with timeouts is that we have no thread dump and stack traces of > > the system as it hangs. If we were not using a timeout, the CI runner > > would have caught the timeout and created a thread dump which often is a > > great starting point for debugging. > > > > This problem has been spotted e.g. during debugging FLINK-22416[1]. In > > the past thread dumps were not always taken for hanging tests, but it > > was changed quite recently in FLINK-21346[2]. I am happy to hear your > > opinions on it. If there are no objections I would like to add the > > suggestion to the Coding Guidelines[3] > > > > Best, > > > > Dawid > > > > > > [1] https://issues.apache.org/jira/browse/FLINK-22416 > > > > [2] https://issues.apache.org/jira/browse/FLINK-21346 > > > > [3] > > > > > https://flink.apache.org/contributing/code-style-and-quality-java.html#java-language-features-and-libraries > > > > > > > |
I was actually recently wondering if we shouldn't rather use timeouts more
aggressively in JUnit. There was recently a case where a number of tests accidentally ran for 5 minutes, because a timeout was increased to 5 minutes. If we had a global limit of 1 minute per test, we would have caught this case (and we would encourage people to be careful with CI time). If we are going to add some custom timeout infrastructure to JUnit (in Java, not in the CI bash scripts ;) ) it should be fairly straightforward to print the current stack traces in case of a timeout. Another benefit of solving this problem at a Junit level is that the behavior would be the same in all environments (for example when running the tests locally). The final benefit would be that we would get a list of all tests that are timing out (from that module), instead of having the test stall at a random test. On Mon, Apr 26, 2021 at 10:49 AM Till Rohrmann <[hidden email]> wrote: > +1. I think this rule makes a lot of sense. > > Cheers, > Till > > On Mon, Apr 26, 2021 at 10:08 AM Arvid Heise <[hidden email]> wrote: > > > +1 from my side. > > > > We should probably double-check if we really need 4h timeouts on test > tasks > > in AZP. It feels like 2h be enough. > > > > On Mon, Apr 26, 2021 at 9:54 AM Dawid Wysakowicz <[hidden email] > > > > wrote: > > > > > Hi devs! > > > > > > I wanted to bring up something that was discussed in a few independent > > > groups of people in the past days. I'd like to revise using timeouts in > > > our JUnit tests. The suggestion would be not to use them anymore. The > > > problem with timeouts is that we have no thread dump and stack traces > of > > > the system as it hangs. If we were not using a timeout, the CI runner > > > would have caught the timeout and created a thread dump which often is > a > > > great starting point for debugging. > > > > > > This problem has been spotted e.g. during debugging FLINK-22416[1]. In > > > the past thread dumps were not always taken for hanging tests, but it > > > was changed quite recently in FLINK-21346[2]. I am happy to hear your > > > opinions on it. If there are no objections I would like to add the > > > suggestion to the Coding Guidelines[3] > > > > > > Best, > > > > > > Dawid > > > > > > > > > [1] https://issues.apache.org/jira/browse/FLINK-22416 > > > > > > [2] https://issues.apache.org/jira/browse/FLINK-21346 > > > > > > [3] > > > > > > > > > https://flink.apache.org/contributing/code-style-and-quality-java.html#java-language-features-and-libraries > > > > > > > > > > > > |
Assuming that not many tests deadlock I think it should be fine to simply
let the build process deadlock. Even if multiple tests fail consistently, then one would see them one after another. That way we wouldn't have to build some extra tooling. Moreover, the behaviour would be consistent on the local machine because the same test would also deadlock there. Decreasing the build/test time is in my opinion more relevant for tests which actually do pass but do things too slowly and, hence, more of an orthogonal discussion. Cheers, Till On Mon, Apr 26, 2021 at 8:25 PM Robert Metzger <[hidden email]> wrote: > I was actually recently wondering if we shouldn't rather use timeouts more > aggressively in JUnit. > There was recently a case where a number of tests accidentally ran for 5 > minutes, because a timeout was increased to 5 minutes. > If we had a global limit of 1 minute per test, we would have caught this > case (and we would encourage people to be careful with CI time). If we are > going to add some custom timeout infrastructure to JUnit (in Java, not in > the CI bash scripts ;) ) it should be fairly straightforward to print the > current stack traces in case of a timeout. > Another benefit of solving this problem at a Junit level is that the > behavior would be the same in all environments (for example when running > the tests locally). > The final benefit would be that we would get a list of all tests that are > timing out (from that module), instead of having the test stall at a random > test. > > > On Mon, Apr 26, 2021 at 10:49 AM Till Rohrmann <[hidden email]> > wrote: > > > +1. I think this rule makes a lot of sense. > > > > Cheers, > > Till > > > > On Mon, Apr 26, 2021 at 10:08 AM Arvid Heise <[hidden email]> wrote: > > > > > +1 from my side. > > > > > > We should probably double-check if we really need 4h timeouts on test > > tasks > > > in AZP. It feels like 2h be enough. > > > > > > On Mon, Apr 26, 2021 at 9:54 AM Dawid Wysakowicz < > [hidden email] > > > > > > wrote: > > > > > > > Hi devs! > > > > > > > > I wanted to bring up something that was discussed in a few > independent > > > > groups of people in the past days. I'd like to revise using timeouts > in > > > > our JUnit tests. The suggestion would be not to use them anymore. The > > > > problem with timeouts is that we have no thread dump and stack traces > > of > > > > the system as it hangs. If we were not using a timeout, the CI runner > > > > would have caught the timeout and created a thread dump which often > is > > a > > > > great starting point for debugging. > > > > > > > > This problem has been spotted e.g. during debugging FLINK-22416[1]. > In > > > > the past thread dumps were not always taken for hanging tests, but it > > > > was changed quite recently in FLINK-21346[2]. I am happy to hear your > > > > opinions on it. If there are no objections I would like to add the > > > > suggestion to the Coding Guidelines[3] > > > > > > > > Best, > > > > > > > > Dawid > > > > > > > > > > > > [1] https://issues.apache.org/jira/browse/FLINK-22416 > > > > > > > > [2] https://issues.apache.org/jira/browse/FLINK-21346 > > > > > > > > [3] > > > > > > > > > > > > > > https://flink.apache.org/contributing/code-style-and-quality-java.html#java-language-features-and-libraries > > > > > > > > > > > > > > > > > > |
In reply to this post by dwysakowicz
Just to make sure I understand the proposal correctly: is the proposal to
disallow the usage of @Test(timeout=...) for Flink Junit tests? Here is my understanding of the pros/cons according to the discussion so far. Pros of allowing timeout: 1) When there are tests that are unreasonably slow, it helps us catch those tests and thus increase the quality of unit tests. 2) When there are tests that cause deadlock, it helps the AZP job fail fast instead of being blocked for 4 hours. This saves resources and also allows developers to get their PR tested again earlier (useful when the test failure is not relevant to their PR). Cons of allowing timeout: 1) When there are tests that cause deadlock, we could not see the thread dump of all threads, which makes debugging the issue harder. I would suggest that we should still allow timeout because the pros outweigh the cons. As far as I can tell, if we allow timeout and encounter a deadlock bug in AZP, we still know which test (or test suite) fails. There is a good chance we can reproduce the deadlock locally (by running it 100 times) and get the debug information we need. In the rare case where the deadlock happens only on AZP, we can just disable the timeout for that particular test. So the lack of thread dump is not really a concern. On the other hand, if we disallow timeout, it will be very hard for us to catch low-quality tests. I don't know if there is a good alternative way to catch those tests. On Mon, Apr 26, 2021 at 3:54 PM Dawid Wysakowicz <[hidden email]> wrote: > Hi devs! > > I wanted to bring up something that was discussed in a few independent > groups of people in the past days. I'd like to revise using timeouts in > our JUnit tests. The suggestion would be not to use them anymore. The > problem with timeouts is that we have no thread dump and stack traces of > the system as it hangs. If we were not using a timeout, the CI runner > would have caught the timeout and created a thread dump which often is a > great starting point for debugging. > > This problem has been spotted e.g. during debugging FLINK-22416[1]. In > the past thread dumps were not always taken for hanging tests, but it > was changed quite recently in FLINK-21346[2]. I am happy to hear your > opinions on it. If there are no objections I would like to add the > suggestion to the Coding Guidelines[3] > > Best, > > Dawid > > > [1] https://issues.apache.org/jira/browse/FLINK-22416 > > [2] https://issues.apache.org/jira/browse/FLINK-21346 > > [3] > > https://flink.apache.org/contributing/code-style-and-quality-java.html#java-language-features-and-libraries > > > |
There is one more point that may be useful to consider here.
In order to debug deadlock that is not easily reproducible, it is likely not sufficient to see only the thread dump to figure out the root cause. We likely need to enable the INFO level logging. Since AZP does not provide INFO level logging by default, we either need to reproduce the bug locally or change the AZP log4j temporarily. This further reduces the benefit of logging the thread dump (which comes at the cost of letting the AZP job hang). On Tue, Apr 27, 2021 at 9:34 PM Dong Lin <[hidden email]> wrote: > Just to make sure I understand the proposal correctly: is the proposal to > disallow the usage of @Test(timeout=...) for Flink Junit tests? > > Here is my understanding of the pros/cons according to the discussion so > far. > > Pros of allowing timeout: > 1) When there are tests that are unreasonably slow, it helps us > catch those tests and thus increase the quality of unit tests. > 2) When there are tests that cause deadlock, it helps the AZP job fail > fast instead of being blocked for 4 hours. This saves resources and also > allows developers to get their PR tested again earlier (useful when the > test failure is not relevant to their PR). > > Cons of allowing timeout: > 1) When there are tests that cause deadlock, we could not see the thread > dump of all threads, which makes debugging the issue harder. > > I would suggest that we should still allow timeout because the pros > outweigh the cons. > > As far as I can tell, if we allow timeout and encounter a deadlock bug in > AZP, we still know which test (or test suite) fails. There is a good chance > we can reproduce the deadlock locally (by running it 100 times) and get the > debug information we need. In the rare case where the deadlock happens only > on AZP, we can just disable the timeout for that particular test. So the > lack of thread dump is not really a concern. > > On the other hand, if we disallow timeout, it will be very hard for us to > catch low-quality tests. I don't know if there is a good alternative way to > catch those tests. > > > > On Mon, Apr 26, 2021 at 3:54 PM Dawid Wysakowicz <[hidden email]> > wrote: > >> Hi devs! >> >> I wanted to bring up something that was discussed in a few independent >> groups of people in the past days. I'd like to revise using timeouts in >> our JUnit tests. The suggestion would be not to use them anymore. The >> problem with timeouts is that we have no thread dump and stack traces of >> the system as it hangs. If we were not using a timeout, the CI runner >> would have caught the timeout and created a thread dump which often is a >> great starting point for debugging. >> >> This problem has been spotted e.g. during debugging FLINK-22416[1]. In >> the past thread dumps were not always taken for hanging tests, but it >> was changed quite recently in FLINK-21346[2]. I am happy to hear your >> opinions on it. If there are no objections I would like to add the >> suggestion to the Coding Guidelines[3] >> >> Best, >> >> Dawid >> >> >> [1] https://issues.apache.org/jira/browse/FLINK-22416 >> >> [2] https://issues.apache.org/jira/browse/FLINK-21346 >> >> [3] >> >> https://flink.apache.org/contributing/code-style-and-quality-java.html#java-language-features-and-libraries >> >> >> |
I think we do capture the INFO logs of the test runs on AZP.
I am also not sure whether we really caught slow tests with Junit's timeout rule before. I think the default is usually to increase the timeout to make the test pass. One way to find slow tests is to measure the time and look at the outliers. Cheers, Till On Tue, Apr 27, 2021 at 3:49 PM Dong Lin <[hidden email]> wrote: > There is one more point that may be useful to consider here. > > In order to debug deadlock that is not easily reproducible, it is likely > not sufficient to see only the thread dump to figure out the root cause. We > likely need to enable the INFO level logging. Since AZP does not provide > INFO level logging by default, we either need to reproduce the bug locally > or change the AZP log4j temporarily. This further reduces the benefit of > logging the thread dump (which comes at the cost of letting the AZP job > hang). > > > On Tue, Apr 27, 2021 at 9:34 PM Dong Lin <[hidden email]> wrote: > > > Just to make sure I understand the proposal correctly: is the proposal to > > disallow the usage of @Test(timeout=...) for Flink Junit tests? > > > > Here is my understanding of the pros/cons according to the discussion so > > far. > > > > Pros of allowing timeout: > > 1) When there are tests that are unreasonably slow, it helps us > > catch those tests and thus increase the quality of unit tests. > > 2) When there are tests that cause deadlock, it helps the AZP job fail > > fast instead of being blocked for 4 hours. This saves resources and also > > allows developers to get their PR tested again earlier (useful when the > > test failure is not relevant to their PR). > > > > Cons of allowing timeout: > > 1) When there are tests that cause deadlock, we could not see the thread > > dump of all threads, which makes debugging the issue harder. > > > > I would suggest that we should still allow timeout because the pros > > outweigh the cons. > > > > As far as I can tell, if we allow timeout and encounter a deadlock bug in > > AZP, we still know which test (or test suite) fails. There is a good > chance > > we can reproduce the deadlock locally (by running it 100 times) and get > the > > debug information we need. In the rare case where the deadlock happens > only > > on AZP, we can just disable the timeout for that particular test. So the > > lack of thread dump is not really a concern. > > > > On the other hand, if we disallow timeout, it will be very hard for us to > > catch low-quality tests. I don't know if there is a good alternative way > to > > catch those tests. > > > > > > > > On Mon, Apr 26, 2021 at 3:54 PM Dawid Wysakowicz <[hidden email] > > > > wrote: > > > >> Hi devs! > >> > >> I wanted to bring up something that was discussed in a few independent > >> groups of people in the past days. I'd like to revise using timeouts in > >> our JUnit tests. The suggestion would be not to use them anymore. The > >> problem with timeouts is that we have no thread dump and stack traces of > >> the system as it hangs. If we were not using a timeout, the CI runner > >> would have caught the timeout and created a thread dump which often is a > >> great starting point for debugging. > >> > >> This problem has been spotted e.g. during debugging FLINK-22416[1]. In > >> the past thread dumps were not always taken for hanging tests, but it > >> was changed quite recently in FLINK-21346[2]. I am happy to hear your > >> opinions on it. If there are no objections I would like to add the > >> suggestion to the Coding Guidelines[3] > >> > >> Best, > >> > >> Dawid > >> > >> > >> [1] https://issues.apache.org/jira/browse/FLINK-22416 > >> > >> [2] https://issues.apache.org/jira/browse/FLINK-21346 > >> > >> [3] > >> > >> > https://flink.apache.org/contributing/code-style-and-quality-java.html#java-language-features-and-libraries > >> > >> > >> > |
Just to add to Dong Lin's list of cons of allowing timeout:
- Any timeout value that you manually set is arbitrary. If it's set too low, you get test instabilities. What too low means depends on numerous factors, such as hardware and current utilization (especially I/O). If you run in VMs and the VM is migrated while running a test, any reasonable timeout will probably fail. While you could make a similar case for the overall timeout of tests, any smaller hiccup in the range of minutes will not impact the overall runtime much. The probability of having a VM constantly migrating during the same stage is abysmally low. - A timeout is more maintenance-intensive. It's one more knob where you can tweak a build or not. If you change the test a bit, you also need to double-check the timeout. Hence, there have been quite a few commits that just increase timeouts. - Whether a test uses a timeout or not is arbitrary: Why do some ITs have a timeout and others don't? All IT tests are prone to timeout if there are issues with resource allocation. Similarly, there are quite a few unit tests with timeouts while others don't have them with no obvious pattern. - An ill-set timeout reduces build reproducibility. Imagine having a release with such a timeout and the users cannot build Flink reliably. I'd like to also point out that we should not cater around unstable tests if our overall goal is to have as many green builds as possible. If we assume that our builds fail more often than not, we should also look into the other direction and continue the builds on error. I'm not a big fan of that. One argument that I also heard is that it eases local debugging in case of refactorings as you can see multiple failures at the same time. But no one is keeping you from temporarily adding a timeout on your branch. Then, we can be sure that the timeout is plausible for your hardware and avoid all above mentioned drawbacks. @Robert Metzger <[hidden email]> > If we had a global limit of 1 minute per test, we would have caught this > case (and we would encourage people to be careful with CI time). > There are quite a few tests that run longer, especially on a well utilized build machine. A global limit is even worse than individual limits as there is no value that fits it all. If you screwed up and 200 tests hang, you'd also run into the global timeout anyway. I'm also not sure what these additional hangs bring you except a huge log. I'm also not sure if it's really better in terms of CI time. For example, for UnalignedCheckpointRescaleITCase, we test all known partitioners in one pipeline for correctness. For higher parallelism, that means the test runs over 1 minute regularly. If there is a global limit, I'd need to split the test into smaller chunks, where I'm positive that the sum of the chunks will be larger than before. PS: all tests on AZP will print INFO in the artifacts. There you can also retrieve the stacktraces. PPS: I also said that we should revalidate the current timeout on AZP. So the argument that we have >2h of precious CI time wasted is kind of constructed and is just due to some random defaults. On Tue, Apr 27, 2021 at 6:42 PM Till Rohrmann <[hidden email]> wrote: > I think we do capture the INFO logs of the test runs on AZP. > > I am also not sure whether we really caught slow tests with Junit's timeout > rule before. I think the default is usually to increase the timeout to make > the test pass. One way to find slow tests is to measure the time and look > at the outliers. > > Cheers, > Till > > On Tue, Apr 27, 2021 at 3:49 PM Dong Lin <[hidden email]> wrote: > > > There is one more point that may be useful to consider here. > > > > In order to debug deadlock that is not easily reproducible, it is likely > > not sufficient to see only the thread dump to figure out the root cause. > We > > likely need to enable the INFO level logging. Since AZP does not provide > > INFO level logging by default, we either need to reproduce the bug > locally > > or change the AZP log4j temporarily. This further reduces the benefit of > > logging the thread dump (which comes at the cost of letting the AZP job > > hang). > > > > > > On Tue, Apr 27, 2021 at 9:34 PM Dong Lin <[hidden email]> wrote: > > > > > Just to make sure I understand the proposal correctly: is the proposal > to > > > disallow the usage of @Test(timeout=...) for Flink Junit tests? > > > > > > Here is my understanding of the pros/cons according to the discussion > so > > > far. > > > > > > Pros of allowing timeout: > > > 1) When there are tests that are unreasonably slow, it helps us > > > catch those tests and thus increase the quality of unit tests. > > > 2) When there are tests that cause deadlock, it helps the AZP job fail > > > fast instead of being blocked for 4 hours. This saves resources and > also > > > allows developers to get their PR tested again earlier (useful when the > > > test failure is not relevant to their PR). > > > > > > Cons of allowing timeout: > > > 1) When there are tests that cause deadlock, we could not see the > thread > > > dump of all threads, which makes debugging the issue harder. > > > > > > I would suggest that we should still allow timeout because the pros > > > outweigh the cons. > > > > > > As far as I can tell, if we allow timeout and encounter a deadlock bug > in > > > AZP, we still know which test (or test suite) fails. There is a good > > chance > > > we can reproduce the deadlock locally (by running it 100 times) and get > > the > > > debug information we need. In the rare case where the deadlock happens > > only > > > on AZP, we can just disable the timeout for that particular test. So > the > > > lack of thread dump is not really a concern. > > > > > > On the other hand, if we disallow timeout, it will be very hard for us > to > > > catch low-quality tests. I don't know if there is a good alternative > way > > to > > > catch those tests. > > > > > > > > > > > > On Mon, Apr 26, 2021 at 3:54 PM Dawid Wysakowicz < > [hidden email] > > > > > > wrote: > > > > > >> Hi devs! > > >> > > >> I wanted to bring up something that was discussed in a few independent > > >> groups of people in the past days. I'd like to revise using timeouts > in > > >> our JUnit tests. The suggestion would be not to use them anymore. The > > >> problem with timeouts is that we have no thread dump and stack traces > of > > >> the system as it hangs. If we were not using a timeout, the CI runner > > >> would have caught the timeout and created a thread dump which often > is a > > >> great starting point for debugging. > > >> > > >> This problem has been spotted e.g. during debugging FLINK-22416[1]. In > > >> the past thread dumps were not always taken for hanging tests, but it > > >> was changed quite recently in FLINK-21346[2]. I am happy to hear your > > >> opinions on it. If there are no objections I would like to add the > > >> suggestion to the Coding Guidelines[3] > > >> > > >> Best, > > >> > > >> Dawid > > >> > > >> > > >> [1] https://issues.apache.org/jira/browse/FLINK-22416 > > >> > > >> [2] https://issues.apache.org/jira/browse/FLINK-21346 > > >> > > >> [3] > > >> > > >> > > > https://flink.apache.org/contributing/code-style-and-quality-java.html#java-language-features-and-libraries > > >> > > >> > > >> > > > |
Thanks for the detailed explanations! Regarding the usage of timeout, now I
agree that it is better to remove per-test timeouts because it helps make our testing results more reliable and consistent. My previous concern is that it might not be a good idea to intentionally let the test hang in AZP in order to get the thread dump. Now I get that there are a few practical concerns around the usage of timeout which makes testing results unreliable (e.g. flakiness in the presence of VM migration). Regarding the level logging on AZP, it appears that we actually set "rootLogger.level = OFF" in most log4j2-test.properties, which means that no INFO log would be printed on AZP. For example, I tried to increase the log level in this <https://github.com/apache/flink/pull/15617> PR and was suggested in this <https://issues.apache.org/jira/browse/FLINK-22085?focusedCommentId=17321055&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17321055> comment to avoid increasing the log level. Did I miss something here? On Wed, Apr 28, 2021 at 2:22 AM Arvid Heise <[hidden email]> wrote: > Just to add to Dong Lin's list of cons of allowing timeout: > - Any timeout value that you manually set is arbitrary. If it's set too > low, you get test instabilities. What too low means depends on numerous > factors, such as hardware and current utilization (especially I/O). If you > run in VMs and the VM is migrated while running a test, any reasonable > timeout will probably fail. While you could make a similar case for the > overall timeout of tests, any smaller hiccup in the range of minutes will > not impact the overall runtime much. The probability of having a VM > constantly migrating during the same stage is abysmally low. > - A timeout is more maintenance-intensive. It's one more knob where you can > tweak a build or not. If you change the test a bit, you also need to > double-check the timeout. Hence, there have been quite a few commits that > just increase timeouts. > - Whether a test uses a timeout or not is arbitrary: Why do some ITs have a > timeout and others don't? All IT tests are prone to timeout if there are > issues with resource allocation. Similarly, there are quite a few unit > tests with timeouts while others don't have them with no obvious pattern. > - An ill-set timeout reduces build reproducibility. Imagine having a > release with such a timeout and the users cannot build Flink reliably. > > I'd like to also point out that we should not cater around unstable tests > if our overall goal is to have as many green builds as possible. If we > assume that our builds fail more often than not, we should also look into > the other direction and continue the builds on error. I'm not a big fan of > that. > > One argument that I also heard is that it eases local debugging in case of > refactorings as you can see multiple failures at the same time. But no one > is keeping you from temporarily adding a timeout on your branch. Then, we > can be sure that the timeout is plausible for your hardware and avoid all > above mentioned drawbacks. > > @Robert Metzger <[hidden email]> > > > If we had a global limit of 1 minute per test, we would have caught this > > case (and we would encourage people to be careful with CI time). > > > There are quite a few tests that run longer, especially on a well utilized > build machine. A global limit is even worse than individual limits as there > is no value that fits it all. If you screwed up and 200 tests hang, you'd > also run into the global timeout anyway. I'm also not sure what these > additional hangs bring you except a huge log. > > I'm also not sure if it's really better in terms of CI time. For example, > for UnalignedCheckpointRescaleITCase, we test all known partitioners in one > pipeline for correctness. For higher parallelism, that means the test runs > over 1 minute regularly. If there is a global limit, I'd need to split the > test into smaller chunks, where I'm positive that the sum of the chunks > will be larger than before. > > PS: all tests on AZP will print INFO in the artifacts. There you can also > retrieve the stacktraces. > PPS: I also said that we should revalidate the current timeout on AZP. So > the argument that we have >2h of precious CI time wasted is kind of > constructed and is just due to some random defaults. > > On Tue, Apr 27, 2021 at 6:42 PM Till Rohrmann <[hidden email]> > wrote: > > > I think we do capture the INFO logs of the test runs on AZP. > > > > I am also not sure whether we really caught slow tests with Junit's > timeout > > rule before. I think the default is usually to increase the timeout to > make > > the test pass. One way to find slow tests is to measure the time and look > > at the outliers. > > > > Cheers, > > Till > > > > On Tue, Apr 27, 2021 at 3:49 PM Dong Lin <[hidden email]> wrote: > > > > > There is one more point that may be useful to consider here. > > > > > > In order to debug deadlock that is not easily reproducible, it is > likely > > > not sufficient to see only the thread dump to figure out the root > cause. > > We > > > likely need to enable the INFO level logging. Since AZP does not > provide > > > INFO level logging by default, we either need to reproduce the bug > > locally > > > or change the AZP log4j temporarily. This further reduces the benefit > of > > > logging the thread dump (which comes at the cost of letting the AZP job > > > hang). > > > > > > > > > On Tue, Apr 27, 2021 at 9:34 PM Dong Lin <[hidden email]> wrote: > > > > > > > Just to make sure I understand the proposal correctly: is the > proposal > > to > > > > disallow the usage of @Test(timeout=...) for Flink Junit tests? > > > > > > > > Here is my understanding of the pros/cons according to the discussion > > so > > > > far. > > > > > > > > Pros of allowing timeout: > > > > 1) When there are tests that are unreasonably slow, it helps us > > > > catch those tests and thus increase the quality of unit tests. > > > > 2) When there are tests that cause deadlock, it helps the AZP job > fail > > > > fast instead of being blocked for 4 hours. This saves resources and > > also > > > > allows developers to get their PR tested again earlier (useful when > the > > > > test failure is not relevant to their PR). > > > > > > > > Cons of allowing timeout: > > > > 1) When there are tests that cause deadlock, we could not see the > > thread > > > > dump of all threads, which makes debugging the issue harder. > > > > > > > > I would suggest that we should still allow timeout because the pros > > > > outweigh the cons. > > > > > > > > As far as I can tell, if we allow timeout and encounter a deadlock > bug > > in > > > > AZP, we still know which test (or test suite) fails. There is a good > > > chance > > > > we can reproduce the deadlock locally (by running it 100 times) and > get > > > the > > > > debug information we need. In the rare case where the deadlock > happens > > > only > > > > on AZP, we can just disable the timeout for that particular test. So > > the > > > > lack of thread dump is not really a concern. > > > > > > > > On the other hand, if we disallow timeout, it will be very hard for > us > > to > > > > catch low-quality tests. I don't know if there is a good alternative > > way > > > to > > > > catch those tests. > > > > > > > > > > > > > > > > On Mon, Apr 26, 2021 at 3:54 PM Dawid Wysakowicz < > > [hidden email] > > > > > > > > wrote: > > > > > > > >> Hi devs! > > > >> > > > >> I wanted to bring up something that was discussed in a few > independent > > > >> groups of people in the past days. I'd like to revise using timeouts > > in > > > >> our JUnit tests. The suggestion would be not to use them anymore. > The > > > >> problem with timeouts is that we have no thread dump and stack > traces > > of > > > >> the system as it hangs. If we were not using a timeout, the CI > runner > > > >> would have caught the timeout and created a thread dump which often > > is a > > > >> great starting point for debugging. > > > >> > > > >> This problem has been spotted e.g. during debugging FLINK-22416[1]. > In > > > >> the past thread dumps were not always taken for hanging tests, but > it > > > >> was changed quite recently in FLINK-21346[2]. I am happy to hear > your > > > >> opinions on it. If there are no objections I would like to add the > > > >> suggestion to the Coding Guidelines[3] > > > >> > > > >> Best, > > > >> > > > >> Dawid > > > >> > > > >> > > > >> [1] https://issues.apache.org/jira/browse/FLINK-22416 > > > >> > > > >> [2] https://issues.apache.org/jira/browse/FLINK-21346 > > > >> > > > >> [3] > > > >> > > > >> > > > > > > https://flink.apache.org/contributing/code-style-and-quality-java.html#java-language-features-and-libraries > > > >> > > > >> > > > >> > > > > > > |
I think for the maven tests we use this log4j.properties file [1].
[1] https://github.com/apache/flink/blob/master/tools/ci/log4j.properties Cheers, Till On Wed, Apr 28, 2021 at 4:47 AM Dong Lin <[hidden email]> wrote: > Thanks for the detailed explanations! Regarding the usage of timeout, now I > agree that it is better to remove per-test timeouts because it helps > make our testing results more reliable and consistent. > > My previous concern is that it might not be a good idea to intentionally > let the test hang in AZP in order to get the thread dump. Now I get that > there are a few practical concerns around the usage of timeout which makes > testing results unreliable (e.g. flakiness in the presence of VM > migration). > > Regarding the level logging on AZP, it appears that we actually set > "rootLogger.level = OFF" in most log4j2-test.properties, which means that > no INFO log would be printed on AZP. For example, I tried to increase the > log level in this <https://github.com/apache/flink/pull/15617> PR and was > suggested in this > < > https://issues.apache.org/jira/browse/FLINK-22085?focusedCommentId=17321055&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17321055 > > > comment to avoid increasing the log level. Did I miss something here? > > > On Wed, Apr 28, 2021 at 2:22 AM Arvid Heise <[hidden email]> wrote: > > > Just to add to Dong Lin's list of cons of allowing timeout: > > - Any timeout value that you manually set is arbitrary. If it's set too > > low, you get test instabilities. What too low means depends on numerous > > factors, such as hardware and current utilization (especially I/O). If > you > > run in VMs and the VM is migrated while running a test, any reasonable > > timeout will probably fail. While you could make a similar case for the > > overall timeout of tests, any smaller hiccup in the range of minutes will > > not impact the overall runtime much. The probability of having a VM > > constantly migrating during the same stage is abysmally low. > > - A timeout is more maintenance-intensive. It's one more knob where you > can > > tweak a build or not. If you change the test a bit, you also need to > > double-check the timeout. Hence, there have been quite a few commits that > > just increase timeouts. > > - Whether a test uses a timeout or not is arbitrary: Why do some ITs > have a > > timeout and others don't? All IT tests are prone to timeout if there are > > issues with resource allocation. Similarly, there are quite a few unit > > tests with timeouts while others don't have them with no obvious pattern. > > - An ill-set timeout reduces build reproducibility. Imagine having a > > release with such a timeout and the users cannot build Flink reliably. > > > > I'd like to also point out that we should not cater around unstable tests > > if our overall goal is to have as many green builds as possible. If we > > assume that our builds fail more often than not, we should also look into > > the other direction and continue the builds on error. I'm not a big fan > of > > that. > > > > One argument that I also heard is that it eases local debugging in case > of > > refactorings as you can see multiple failures at the same time. But no > one > > is keeping you from temporarily adding a timeout on your branch. Then, we > > can be sure that the timeout is plausible for your hardware and avoid all > > above mentioned drawbacks. > > > > @Robert Metzger <[hidden email]> > > > > > If we had a global limit of 1 minute per test, we would have caught > this > > > case (and we would encourage people to be careful with CI time). > > > > > There are quite a few tests that run longer, especially on a well > utilized > > build machine. A global limit is even worse than individual limits as > there > > is no value that fits it all. If you screwed up and 200 tests hang, you'd > > also run into the global timeout anyway. I'm also not sure what these > > additional hangs bring you except a huge log. > > > > I'm also not sure if it's really better in terms of CI time. For example, > > for UnalignedCheckpointRescaleITCase, we test all known partitioners in > one > > pipeline for correctness. For higher parallelism, that means the test > runs > > over 1 minute regularly. If there is a global limit, I'd need to split > the > > test into smaller chunks, where I'm positive that the sum of the chunks > > will be larger than before. > > > > PS: all tests on AZP will print INFO in the artifacts. There you can also > > retrieve the stacktraces. > > PPS: I also said that we should revalidate the current timeout on AZP. So > > the argument that we have >2h of precious CI time wasted is kind of > > constructed and is just due to some random defaults. > > > > On Tue, Apr 27, 2021 at 6:42 PM Till Rohrmann <[hidden email]> > > wrote: > > > > > I think we do capture the INFO logs of the test runs on AZP. > > > > > > I am also not sure whether we really caught slow tests with Junit's > > timeout > > > rule before. I think the default is usually to increase the timeout to > > make > > > the test pass. One way to find slow tests is to measure the time and > look > > > at the outliers. > > > > > > Cheers, > > > Till > > > > > > On Tue, Apr 27, 2021 at 3:49 PM Dong Lin <[hidden email]> wrote: > > > > > > > There is one more point that may be useful to consider here. > > > > > > > > In order to debug deadlock that is not easily reproducible, it is > > likely > > > > not sufficient to see only the thread dump to figure out the root > > cause. > > > We > > > > likely need to enable the INFO level logging. Since AZP does not > > provide > > > > INFO level logging by default, we either need to reproduce the bug > > > locally > > > > or change the AZP log4j temporarily. This further reduces the benefit > > of > > > > logging the thread dump (which comes at the cost of letting the AZP > job > > > > hang). > > > > > > > > > > > > On Tue, Apr 27, 2021 at 9:34 PM Dong Lin <[hidden email]> > wrote: > > > > > > > > > Just to make sure I understand the proposal correctly: is the > > proposal > > > to > > > > > disallow the usage of @Test(timeout=...) for Flink Junit tests? > > > > > > > > > > Here is my understanding of the pros/cons according to the > discussion > > > so > > > > > far. > > > > > > > > > > Pros of allowing timeout: > > > > > 1) When there are tests that are unreasonably slow, it helps us > > > > > catch those tests and thus increase the quality of unit tests. > > > > > 2) When there are tests that cause deadlock, it helps the AZP job > > fail > > > > > fast instead of being blocked for 4 hours. This saves resources and > > > also > > > > > allows developers to get their PR tested again earlier (useful when > > the > > > > > test failure is not relevant to their PR). > > > > > > > > > > Cons of allowing timeout: > > > > > 1) When there are tests that cause deadlock, we could not see the > > > thread > > > > > dump of all threads, which makes debugging the issue harder. > > > > > > > > > > I would suggest that we should still allow timeout because the pros > > > > > outweigh the cons. > > > > > > > > > > As far as I can tell, if we allow timeout and encounter a deadlock > > bug > > > in > > > > > AZP, we still know which test (or test suite) fails. There is a > good > > > > chance > > > > > we can reproduce the deadlock locally (by running it 100 times) and > > get > > > > the > > > > > debug information we need. In the rare case where the deadlock > > happens > > > > only > > > > > on AZP, we can just disable the timeout for that particular test. > So > > > the > > > > > lack of thread dump is not really a concern. > > > > > > > > > > On the other hand, if we disallow timeout, it will be very hard for > > us > > > to > > > > > catch low-quality tests. I don't know if there is a good > alternative > > > way > > > > to > > > > > catch those tests. > > > > > > > > > > > > > > > > > > > > On Mon, Apr 26, 2021 at 3:54 PM Dawid Wysakowicz < > > > [hidden email] > > > > > > > > > > wrote: > > > > > > > > > >> Hi devs! > > > > >> > > > > >> I wanted to bring up something that was discussed in a few > > independent > > > > >> groups of people in the past days. I'd like to revise using > timeouts > > > in > > > > >> our JUnit tests. The suggestion would be not to use them anymore. > > The > > > > >> problem with timeouts is that we have no thread dump and stack > > traces > > > of > > > > >> the system as it hangs. If we were not using a timeout, the CI > > runner > > > > >> would have caught the timeout and created a thread dump which > often > > > is a > > > > >> great starting point for debugging. > > > > >> > > > > >> This problem has been spotted e.g. during debugging > FLINK-22416[1]. > > In > > > > >> the past thread dumps were not always taken for hanging tests, but > > it > > > > >> was changed quite recently in FLINK-21346[2]. I am happy to hear > > your > > > > >> opinions on it. If there are no objections I would like to add the > > > > >> suggestion to the Coding Guidelines[3] > > > > >> > > > > >> Best, > > > > >> > > > > >> Dawid > > > > >> > > > > >> > > > > >> [1] https://issues.apache.org/jira/browse/FLINK-22416 > > > > >> > > > > >> [2] https://issues.apache.org/jira/browse/FLINK-21346 > > > > >> > > > > >> [3] > > > > >> > > > > >> > > > > > > > > > > https://flink.apache.org/contributing/code-style-and-quality-java.html#java-language-features-and-libraries > > > > >> > > > > >> > > > > >> > > > > > > > > > > |
Thanks Till. Yes you are right. The INFO logging is enabled. It is just
dumped to a file (the FileAppender) other than the console. There is probably a way to retrieve that log file from AZP. I will ask other colleagues how to get this later. On Thu, Apr 29, 2021 at 4:51 PM Till Rohrmann <[hidden email]> wrote: > I think for the maven tests we use this log4j.properties file [1]. > > [1] https://github.com/apache/flink/blob/master/tools/ci/log4j.properties > > Cheers, > Till > > On Wed, Apr 28, 2021 at 4:47 AM Dong Lin <[hidden email]> wrote: > > > Thanks for the detailed explanations! Regarding the usage of timeout, > now I > > agree that it is better to remove per-test timeouts because it helps > > make our testing results more reliable and consistent. > > > > My previous concern is that it might not be a good idea to intentionally > > let the test hang in AZP in order to get the thread dump. Now I get that > > there are a few practical concerns around the usage of timeout which > makes > > testing results unreliable (e.g. flakiness in the presence of VM > > migration). > > > > Regarding the level logging on AZP, it appears that we actually set > > "rootLogger.level = OFF" in most log4j2-test.properties, which means that > > no INFO log would be printed on AZP. For example, I tried to increase the > > log level in this <https://github.com/apache/flink/pull/15617> PR and > was > > suggested in this > > < > > > https://issues.apache.org/jira/browse/FLINK-22085?focusedCommentId=17321055&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17321055 > > > > > comment to avoid increasing the log level. Did I miss something here? > > > > > > On Wed, Apr 28, 2021 at 2:22 AM Arvid Heise <[hidden email]> wrote: > > > > > Just to add to Dong Lin's list of cons of allowing timeout: > > > - Any timeout value that you manually set is arbitrary. If it's set too > > > low, you get test instabilities. What too low means depends on numerous > > > factors, such as hardware and current utilization (especially I/O). If > > you > > > run in VMs and the VM is migrated while running a test, any reasonable > > > timeout will probably fail. While you could make a similar case for the > > > overall timeout of tests, any smaller hiccup in the range of minutes > will > > > not impact the overall runtime much. The probability of having a VM > > > constantly migrating during the same stage is abysmally low. > > > - A timeout is more maintenance-intensive. It's one more knob where you > > can > > > tweak a build or not. If you change the test a bit, you also need to > > > double-check the timeout. Hence, there have been quite a few commits > that > > > just increase timeouts. > > > - Whether a test uses a timeout or not is arbitrary: Why do some ITs > > have a > > > timeout and others don't? All IT tests are prone to timeout if there > are > > > issues with resource allocation. Similarly, there are quite a few unit > > > tests with timeouts while others don't have them with no obvious > pattern. > > > - An ill-set timeout reduces build reproducibility. Imagine having a > > > release with such a timeout and the users cannot build Flink reliably. > > > > > > I'd like to also point out that we should not cater around unstable > tests > > > if our overall goal is to have as many green builds as possible. If we > > > assume that our builds fail more often than not, we should also look > into > > > the other direction and continue the builds on error. I'm not a big fan > > of > > > that. > > > > > > One argument that I also heard is that it eases local debugging in case > > of > > > refactorings as you can see multiple failures at the same time. But no > > one > > > is keeping you from temporarily adding a timeout on your branch. Then, > we > > > can be sure that the timeout is plausible for your hardware and avoid > all > > > above mentioned drawbacks. > > > > > > @Robert Metzger <[hidden email]> > > > > > > > If we had a global limit of 1 minute per test, we would have caught > > this > > > > case (and we would encourage people to be careful with CI time). > > > > > > > There are quite a few tests that run longer, especially on a well > > utilized > > > build machine. A global limit is even worse than individual limits as > > there > > > is no value that fits it all. If you screwed up and 200 tests hang, > you'd > > > also run into the global timeout anyway. I'm also not sure what these > > > additional hangs bring you except a huge log. > > > > > > I'm also not sure if it's really better in terms of CI time. For > example, > > > for UnalignedCheckpointRescaleITCase, we test all known partitioners in > > one > > > pipeline for correctness. For higher parallelism, that means the test > > runs > > > over 1 minute regularly. If there is a global limit, I'd need to split > > the > > > test into smaller chunks, where I'm positive that the sum of the chunks > > > will be larger than before. > > > > > > PS: all tests on AZP will print INFO in the artifacts. There you can > also > > > retrieve the stacktraces. > > > PPS: I also said that we should revalidate the current timeout on AZP. > So > > > the argument that we have >2h of precious CI time wasted is kind of > > > constructed and is just due to some random defaults. > > > > > > On Tue, Apr 27, 2021 at 6:42 PM Till Rohrmann <[hidden email]> > > > wrote: > > > > > > > I think we do capture the INFO logs of the test runs on AZP. > > > > > > > > I am also not sure whether we really caught slow tests with Junit's > > > timeout > > > > rule before. I think the default is usually to increase the timeout > to > > > make > > > > the test pass. One way to find slow tests is to measure the time and > > look > > > > at the outliers. > > > > > > > > Cheers, > > > > Till > > > > > > > > On Tue, Apr 27, 2021 at 3:49 PM Dong Lin <[hidden email]> > wrote: > > > > > > > > > There is one more point that may be useful to consider here. > > > > > > > > > > In order to debug deadlock that is not easily reproducible, it is > > > likely > > > > > not sufficient to see only the thread dump to figure out the root > > > cause. > > > > We > > > > > likely need to enable the INFO level logging. Since AZP does not > > > provide > > > > > INFO level logging by default, we either need to reproduce the bug > > > > locally > > > > > or change the AZP log4j temporarily. This further reduces the > benefit > > > of > > > > > logging the thread dump (which comes at the cost of letting the AZP > > job > > > > > hang). > > > > > > > > > > > > > > > On Tue, Apr 27, 2021 at 9:34 PM Dong Lin <[hidden email]> > > wrote: > > > > > > > > > > > Just to make sure I understand the proposal correctly: is the > > > proposal > > > > to > > > > > > disallow the usage of @Test(timeout=...) for Flink Junit tests? > > > > > > > > > > > > Here is my understanding of the pros/cons according to the > > discussion > > > > so > > > > > > far. > > > > > > > > > > > > Pros of allowing timeout: > > > > > > 1) When there are tests that are unreasonably slow, it helps us > > > > > > catch those tests and thus increase the quality of unit tests. > > > > > > 2) When there are tests that cause deadlock, it helps the AZP job > > > fail > > > > > > fast instead of being blocked for 4 hours. This saves resources > and > > > > also > > > > > > allows developers to get their PR tested again earlier (useful > when > > > the > > > > > > test failure is not relevant to their PR). > > > > > > > > > > > > Cons of allowing timeout: > > > > > > 1) When there are tests that cause deadlock, we could not see the > > > > thread > > > > > > dump of all threads, which makes debugging the issue harder. > > > > > > > > > > > > I would suggest that we should still allow timeout because the > pros > > > > > > outweigh the cons. > > > > > > > > > > > > As far as I can tell, if we allow timeout and encounter a > deadlock > > > bug > > > > in > > > > > > AZP, we still know which test (or test suite) fails. There is a > > good > > > > > chance > > > > > > we can reproduce the deadlock locally (by running it 100 times) > and > > > get > > > > > the > > > > > > debug information we need. In the rare case where the deadlock > > > happens > > > > > only > > > > > > on AZP, we can just disable the timeout for that particular test. > > So > > > > the > > > > > > lack of thread dump is not really a concern. > > > > > > > > > > > > On the other hand, if we disallow timeout, it will be very hard > for > > > us > > > > to > > > > > > catch low-quality tests. I don't know if there is a good > > alternative > > > > way > > > > > to > > > > > > catch those tests. > > > > > > > > > > > > > > > > > > > > > > > > On Mon, Apr 26, 2021 at 3:54 PM Dawid Wysakowicz < > > > > [hidden email] > > > > > > > > > > > > wrote: > > > > > > > > > > > >> Hi devs! > > > > > >> > > > > > >> I wanted to bring up something that was discussed in a few > > > independent > > > > > >> groups of people in the past days. I'd like to revise using > > timeouts > > > > in > > > > > >> our JUnit tests. The suggestion would be not to use them > anymore. > > > The > > > > > >> problem with timeouts is that we have no thread dump and stack > > > traces > > > > of > > > > > >> the system as it hangs. If we were not using a timeout, the CI > > > runner > > > > > >> would have caught the timeout and created a thread dump which > > often > > > > is a > > > > > >> great starting point for debugging. > > > > > >> > > > > > >> This problem has been spotted e.g. during debugging > > FLINK-22416[1]. > > > In > > > > > >> the past thread dumps were not always taken for hanging tests, > but > > > it > > > > > >> was changed quite recently in FLINK-21346[2]. I am happy to hear > > > your > > > > > >> opinions on it. If there are no objections I would like to add > the > > > > > >> suggestion to the Coding Guidelines[3] > > > > > >> > > > > > >> Best, > > > > > >> > > > > > >> Dawid > > > > > >> > > > > > >> > > > > > >> [1] https://issues.apache.org/jira/browse/FLINK-22416 > > > > > >> > > > > > >> [2] https://issues.apache.org/jira/browse/FLINK-21346 > > > > > >> > > > > > >> [3] > > > > > >> > > > > > >> > > > > > > > > > > > > > > > https://flink.apache.org/contributing/code-style-and-quality-java.html#java-language-features-and-libraries > > > > > >> > > > > > >> > > > > > >> > > > > > > > > > > > > > > > |
Yes, you can click on the test job where a test failed. Then you can click
on 1 artifact produced (or on the overview page on the X published (artifacts)). This brings you to the published artifacts page where we upload for every job the logs. Cheers, Till On Fri, Apr 30, 2021 at 9:22 AM Dong Lin <[hidden email]> wrote: > Thanks Till. Yes you are right. The INFO logging is enabled. It is just > dumped to a file (the FileAppender) other than the console. > > There is probably a way to retrieve that log file from AZP. I will ask > other colleagues how to get this later. > > On Thu, Apr 29, 2021 at 4:51 PM Till Rohrmann <[hidden email]> > wrote: > > > I think for the maven tests we use this log4j.properties file [1]. > > > > [1] > https://github.com/apache/flink/blob/master/tools/ci/log4j.properties > > > > Cheers, > > Till > > > > On Wed, Apr 28, 2021 at 4:47 AM Dong Lin <[hidden email]> wrote: > > > > > Thanks for the detailed explanations! Regarding the usage of timeout, > > now I > > > agree that it is better to remove per-test timeouts because it helps > > > make our testing results more reliable and consistent. > > > > > > My previous concern is that it might not be a good idea to > intentionally > > > let the test hang in AZP in order to get the thread dump. Now I get > that > > > there are a few practical concerns around the usage of timeout which > > makes > > > testing results unreliable (e.g. flakiness in the presence of VM > > > migration). > > > > > > Regarding the level logging on AZP, it appears that we actually set > > > "rootLogger.level = OFF" in most log4j2-test.properties, which means > that > > > no INFO log would be printed on AZP. For example, I tried to increase > the > > > log level in this <https://github.com/apache/flink/pull/15617> PR and > > was > > > suggested in this > > > < > > > > > > https://issues.apache.org/jira/browse/FLINK-22085?focusedCommentId=17321055&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17321055 > > > > > > > comment to avoid increasing the log level. Did I miss something here? > > > > > > > > > On Wed, Apr 28, 2021 at 2:22 AM Arvid Heise <[hidden email]> wrote: > > > > > > > Just to add to Dong Lin's list of cons of allowing timeout: > > > > - Any timeout value that you manually set is arbitrary. If it's set > too > > > > low, you get test instabilities. What too low means depends on > numerous > > > > factors, such as hardware and current utilization (especially I/O). > If > > > you > > > > run in VMs and the VM is migrated while running a test, any > reasonable > > > > timeout will probably fail. While you could make a similar case for > the > > > > overall timeout of tests, any smaller hiccup in the range of minutes > > will > > > > not impact the overall runtime much. The probability of having a VM > > > > constantly migrating during the same stage is abysmally low. > > > > - A timeout is more maintenance-intensive. It's one more knob where > you > > > can > > > > tweak a build or not. If you change the test a bit, you also need to > > > > double-check the timeout. Hence, there have been quite a few commits > > that > > > > just increase timeouts. > > > > - Whether a test uses a timeout or not is arbitrary: Why do some ITs > > > have a > > > > timeout and others don't? All IT tests are prone to timeout if there > > are > > > > issues with resource allocation. Similarly, there are quite a few > unit > > > > tests with timeouts while others don't have them with no obvious > > pattern. > > > > - An ill-set timeout reduces build reproducibility. Imagine having a > > > > release with such a timeout and the users cannot build Flink > reliably. > > > > > > > > I'd like to also point out that we should not cater around unstable > > tests > > > > if our overall goal is to have as many green builds as possible. If > we > > > > assume that our builds fail more often than not, we should also look > > into > > > > the other direction and continue the builds on error. I'm not a big > fan > > > of > > > > that. > > > > > > > > One argument that I also heard is that it eases local debugging in > case > > > of > > > > refactorings as you can see multiple failures at the same time. But > no > > > one > > > > is keeping you from temporarily adding a timeout on your branch. > Then, > > we > > > > can be sure that the timeout is plausible for your hardware and avoid > > all > > > > above mentioned drawbacks. > > > > > > > > @Robert Metzger <[hidden email]> > > > > > > > > > If we had a global limit of 1 minute per test, we would have caught > > > this > > > > > case (and we would encourage people to be careful with CI time). > > > > > > > > > There are quite a few tests that run longer, especially on a well > > > utilized > > > > build machine. A global limit is even worse than individual limits as > > > there > > > > is no value that fits it all. If you screwed up and 200 tests hang, > > you'd > > > > also run into the global timeout anyway. I'm also not sure what these > > > > additional hangs bring you except a huge log. > > > > > > > > I'm also not sure if it's really better in terms of CI time. For > > example, > > > > for UnalignedCheckpointRescaleITCase, we test all known partitioners > in > > > one > > > > pipeline for correctness. For higher parallelism, that means the test > > > runs > > > > over 1 minute regularly. If there is a global limit, I'd need to > split > > > the > > > > test into smaller chunks, where I'm positive that the sum of the > chunks > > > > will be larger than before. > > > > > > > > PS: all tests on AZP will print INFO in the artifacts. There you can > > also > > > > retrieve the stacktraces. > > > > PPS: I also said that we should revalidate the current timeout on > AZP. > > So > > > > the argument that we have >2h of precious CI time wasted is kind of > > > > constructed and is just due to some random defaults. > > > > > > > > On Tue, Apr 27, 2021 at 6:42 PM Till Rohrmann <[hidden email]> > > > > wrote: > > > > > > > > > I think we do capture the INFO logs of the test runs on AZP. > > > > > > > > > > I am also not sure whether we really caught slow tests with Junit's > > > > timeout > > > > > rule before. I think the default is usually to increase the timeout > > to > > > > make > > > > > the test pass. One way to find slow tests is to measure the time > and > > > look > > > > > at the outliers. > > > > > > > > > > Cheers, > > > > > Till > > > > > > > > > > On Tue, Apr 27, 2021 at 3:49 PM Dong Lin <[hidden email]> > > wrote: > > > > > > > > > > > There is one more point that may be useful to consider here. > > > > > > > > > > > > In order to debug deadlock that is not easily reproducible, it is > > > > likely > > > > > > not sufficient to see only the thread dump to figure out the root > > > > cause. > > > > > We > > > > > > likely need to enable the INFO level logging. Since AZP does not > > > > provide > > > > > > INFO level logging by default, we either need to reproduce the > bug > > > > > locally > > > > > > or change the AZP log4j temporarily. This further reduces the > > benefit > > > > of > > > > > > logging the thread dump (which comes at the cost of letting the > AZP > > > job > > > > > > hang). > > > > > > > > > > > > > > > > > > On Tue, Apr 27, 2021 at 9:34 PM Dong Lin <[hidden email]> > > > wrote: > > > > > > > > > > > > > Just to make sure I understand the proposal correctly: is the > > > > proposal > > > > > to > > > > > > > disallow the usage of @Test(timeout=...) for Flink Junit tests? > > > > > > > > > > > > > > Here is my understanding of the pros/cons according to the > > > discussion > > > > > so > > > > > > > far. > > > > > > > > > > > > > > Pros of allowing timeout: > > > > > > > 1) When there are tests that are unreasonably slow, it helps us > > > > > > > catch those tests and thus increase the quality of unit tests. > > > > > > > 2) When there are tests that cause deadlock, it helps the AZP > job > > > > fail > > > > > > > fast instead of being blocked for 4 hours. This saves resources > > and > > > > > also > > > > > > > allows developers to get their PR tested again earlier (useful > > when > > > > the > > > > > > > test failure is not relevant to their PR). > > > > > > > > > > > > > > Cons of allowing timeout: > > > > > > > 1) When there are tests that cause deadlock, we could not see > the > > > > > thread > > > > > > > dump of all threads, which makes debugging the issue harder. > > > > > > > > > > > > > > I would suggest that we should still allow timeout because the > > pros > > > > > > > outweigh the cons. > > > > > > > > > > > > > > As far as I can tell, if we allow timeout and encounter a > > deadlock > > > > bug > > > > > in > > > > > > > AZP, we still know which test (or test suite) fails. There is a > > > good > > > > > > chance > > > > > > > we can reproduce the deadlock locally (by running it 100 times) > > and > > > > get > > > > > > the > > > > > > > debug information we need. In the rare case where the deadlock > > > > happens > > > > > > only > > > > > > > on AZP, we can just disable the timeout for that particular > test. > > > So > > > > > the > > > > > > > lack of thread dump is not really a concern. > > > > > > > > > > > > > > On the other hand, if we disallow timeout, it will be very hard > > for > > > > us > > > > > to > > > > > > > catch low-quality tests. I don't know if there is a good > > > alternative > > > > > way > > > > > > to > > > > > > > catch those tests. > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Mon, Apr 26, 2021 at 3:54 PM Dawid Wysakowicz < > > > > > [hidden email] > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > >> Hi devs! > > > > > > >> > > > > > > >> I wanted to bring up something that was discussed in a few > > > > independent > > > > > > >> groups of people in the past days. I'd like to revise using > > > timeouts > > > > > in > > > > > > >> our JUnit tests. The suggestion would be not to use them > > anymore. > > > > The > > > > > > >> problem with timeouts is that we have no thread dump and stack > > > > traces > > > > > of > > > > > > >> the system as it hangs. If we were not using a timeout, the CI > > > > runner > > > > > > >> would have caught the timeout and created a thread dump which > > > often > > > > > is a > > > > > > >> great starting point for debugging. > > > > > > >> > > > > > > >> This problem has been spotted e.g. during debugging > > > FLINK-22416[1]. > > > > In > > > > > > >> the past thread dumps were not always taken for hanging tests, > > but > > > > it > > > > > > >> was changed quite recently in FLINK-21346[2]. I am happy to > hear > > > > your > > > > > > >> opinions on it. If there are no objections I would like to add > > the > > > > > > >> suggestion to the Coding Guidelines[3] > > > > > > >> > > > > > > >> Best, > > > > > > >> > > > > > > >> Dawid > > > > > > >> > > > > > > >> > > > > > > >> [1] https://issues.apache.org/jira/browse/FLINK-22416 > > > > > > >> > > > > > > >> [2] https://issues.apache.org/jira/browse/FLINK-21346 > > > > > > >> > > > > > > >> [3] > > > > > > >> > > > > > > >> > > > > > > > > > > > > > > > > > > > > > https://flink.apache.org/contributing/code-style-and-quality-java.html#java-language-features-and-libraries > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > > > > > > > > > > > > > > > > |
Hi,
I'm ok with removing most or even all timeouts. Just one thing. Reason behind using junit timeouts that I've heard (and I was adhering to it) was that maven watchdog was doing the thread dump and killing the test process using timeout based on logs inactivity. Some tests were by nature prone to live lock (for example if a job had an infinite source), that was preventing the watchdog from kicking in. And if I remember correctly we didn't have a global timeout for all tests, just travis was killing our jobs without even uploading any logs to S3. So junit level timeouts were very valuable in those kinds of cases, to at least get some logs, even if without a stack trace. I have no idea what's the state of our watch dogs right now, after all of the changes in the past years, so I don't know how relevant is this reasoning. Piotrek pt., 30 kwi 2021 o 11:52 Till Rohrmann <[hidden email]> napisał(a): > Yes, you can click on the test job where a test failed. Then you can click > on 1 artifact produced (or on the overview page on the X published > (artifacts)). This brings you to the published artifacts page where we > upload for every job the logs. > > Cheers, > Till > > On Fri, Apr 30, 2021 at 9:22 AM Dong Lin <[hidden email]> wrote: > > > Thanks Till. Yes you are right. The INFO logging is enabled. It is just > > dumped to a file (the FileAppender) other than the console. > > > > There is probably a way to retrieve that log file from AZP. I will ask > > other colleagues how to get this later. > > > > On Thu, Apr 29, 2021 at 4:51 PM Till Rohrmann <[hidden email]> > > wrote: > > > > > I think for the maven tests we use this log4j.properties file [1]. > > > > > > [1] > > https://github.com/apache/flink/blob/master/tools/ci/log4j.properties > > > > > > Cheers, > > > Till > > > > > > On Wed, Apr 28, 2021 at 4:47 AM Dong Lin <[hidden email]> wrote: > > > > > > > Thanks for the detailed explanations! Regarding the usage of timeout, > > > now I > > > > agree that it is better to remove per-test timeouts because it helps > > > > make our testing results more reliable and consistent. > > > > > > > > My previous concern is that it might not be a good idea to > > intentionally > > > > let the test hang in AZP in order to get the thread dump. Now I get > > that > > > > there are a few practical concerns around the usage of timeout which > > > makes > > > > testing results unreliable (e.g. flakiness in the presence of VM > > > > migration). > > > > > > > > Regarding the level logging on AZP, it appears that we actually set > > > > "rootLogger.level = OFF" in most log4j2-test.properties, which means > > that > > > > no INFO log would be printed on AZP. For example, I tried to increase > > the > > > > log level in this <https://github.com/apache/flink/pull/15617> PR > and > > > was > > > > suggested in this > > > > < > > > > > > > > > > https://issues.apache.org/jira/browse/FLINK-22085?focusedCommentId=17321055&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17321055 > > > > > > > > > comment to avoid increasing the log level. Did I miss something here? > > > > > > > > > > > > On Wed, Apr 28, 2021 at 2:22 AM Arvid Heise <[hidden email]> > wrote: > > > > > > > > > Just to add to Dong Lin's list of cons of allowing timeout: > > > > > - Any timeout value that you manually set is arbitrary. If it's set > > too > > > > > low, you get test instabilities. What too low means depends on > > numerous > > > > > factors, such as hardware and current utilization (especially I/O). > > If > > > > you > > > > > run in VMs and the VM is migrated while running a test, any > > reasonable > > > > > timeout will probably fail. While you could make a similar case for > > the > > > > > overall timeout of tests, any smaller hiccup in the range of > minutes > > > will > > > > > not impact the overall runtime much. The probability of having a VM > > > > > constantly migrating during the same stage is abysmally low. > > > > > - A timeout is more maintenance-intensive. It's one more knob where > > you > > > > can > > > > > tweak a build or not. If you change the test a bit, you also need > to > > > > > double-check the timeout. Hence, there have been quite a few > commits > > > that > > > > > just increase timeouts. > > > > > - Whether a test uses a timeout or not is arbitrary: Why do some > ITs > > > > have a > > > > > timeout and others don't? All IT tests are prone to timeout if > there > > > are > > > > > issues with resource allocation. Similarly, there are quite a few > > unit > > > > > tests with timeouts while others don't have them with no obvious > > > pattern. > > > > > - An ill-set timeout reduces build reproducibility. Imagine having > a > > > > > release with such a timeout and the users cannot build Flink > > reliably. > > > > > > > > > > I'd like to also point out that we should not cater around unstable > > > tests > > > > > if our overall goal is to have as many green builds as possible. If > > we > > > > > assume that our builds fail more often than not, we should also > look > > > into > > > > > the other direction and continue the builds on error. I'm not a big > > fan > > > > of > > > > > that. > > > > > > > > > > One argument that I also heard is that it eases local debugging in > > case > > > > of > > > > > refactorings as you can see multiple failures at the same time. But > > no > > > > one > > > > > is keeping you from temporarily adding a timeout on your branch. > > Then, > > > we > > > > > can be sure that the timeout is plausible for your hardware and > avoid > > > all > > > > > above mentioned drawbacks. > > > > > > > > > > @Robert Metzger <[hidden email]> > > > > > > > > > > > If we had a global limit of 1 minute per test, we would have > caught > > > > this > > > > > > case (and we would encourage people to be careful with CI time). > > > > > > > > > > > There are quite a few tests that run longer, especially on a well > > > > utilized > > > > > build machine. A global limit is even worse than individual limits > as > > > > there > > > > > is no value that fits it all. If you screwed up and 200 tests hang, > > > you'd > > > > > also run into the global timeout anyway. I'm also not sure what > these > > > > > additional hangs bring you except a huge log. > > > > > > > > > > I'm also not sure if it's really better in terms of CI time. For > > > example, > > > > > for UnalignedCheckpointRescaleITCase, we test all known > partitioners > > in > > > > one > > > > > pipeline for correctness. For higher parallelism, that means the > test > > > > runs > > > > > over 1 minute regularly. If there is a global limit, I'd need to > > split > > > > the > > > > > test into smaller chunks, where I'm positive that the sum of the > > chunks > > > > > will be larger than before. > > > > > > > > > > PS: all tests on AZP will print INFO in the artifacts. There you > can > > > also > > > > > retrieve the stacktraces. > > > > > PPS: I also said that we should revalidate the current timeout on > > AZP. > > > So > > > > > the argument that we have >2h of precious CI time wasted is kind of > > > > > constructed and is just due to some random defaults. > > > > > > > > > > On Tue, Apr 27, 2021 at 6:42 PM Till Rohrmann < > [hidden email]> > > > > > wrote: > > > > > > > > > > > I think we do capture the INFO logs of the test runs on AZP. > > > > > > > > > > > > I am also not sure whether we really caught slow tests with > Junit's > > > > > timeout > > > > > > rule before. I think the default is usually to increase the > timeout > > > to > > > > > make > > > > > > the test pass. One way to find slow tests is to measure the time > > and > > > > look > > > > > > at the outliers. > > > > > > > > > > > > Cheers, > > > > > > Till > > > > > > > > > > > > On Tue, Apr 27, 2021 at 3:49 PM Dong Lin <[hidden email]> > > > wrote: > > > > > > > > > > > > > There is one more point that may be useful to consider here. > > > > > > > > > > > > > > In order to debug deadlock that is not easily reproducible, it > is > > > > > likely > > > > > > > not sufficient to see only the thread dump to figure out the > root > > > > > cause. > > > > > > We > > > > > > > likely need to enable the INFO level logging. Since AZP does > not > > > > > provide > > > > > > > INFO level logging by default, we either need to reproduce the > > bug > > > > > > locally > > > > > > > or change the AZP log4j temporarily. This further reduces the > > > benefit > > > > > of > > > > > > > logging the thread dump (which comes at the cost of letting the > > AZP > > > > job > > > > > > > hang). > > > > > > > > > > > > > > > > > > > > > On Tue, Apr 27, 2021 at 9:34 PM Dong Lin <[hidden email]> > > > > wrote: > > > > > > > > > > > > > > > Just to make sure I understand the proposal correctly: is the > > > > > proposal > > > > > > to > > > > > > > > disallow the usage of @Test(timeout=...) for Flink Junit > tests? > > > > > > > > > > > > > > > > Here is my understanding of the pros/cons according to the > > > > discussion > > > > > > so > > > > > > > > far. > > > > > > > > > > > > > > > > Pros of allowing timeout: > > > > > > > > 1) When there are tests that are unreasonably slow, it helps > us > > > > > > > > catch those tests and thus increase the quality of unit > tests. > > > > > > > > 2) When there are tests that cause deadlock, it helps the AZP > > job > > > > > fail > > > > > > > > fast instead of being blocked for 4 hours. This saves > resources > > > and > > > > > > also > > > > > > > > allows developers to get their PR tested again earlier > (useful > > > when > > > > > the > > > > > > > > test failure is not relevant to their PR). > > > > > > > > > > > > > > > > Cons of allowing timeout: > > > > > > > > 1) When there are tests that cause deadlock, we could not see > > the > > > > > > thread > > > > > > > > dump of all threads, which makes debugging the issue harder. > > > > > > > > > > > > > > > > I would suggest that we should still allow timeout because > the > > > pros > > > > > > > > outweigh the cons. > > > > > > > > > > > > > > > > As far as I can tell, if we allow timeout and encounter a > > > deadlock > > > > > bug > > > > > > in > > > > > > > > AZP, we still know which test (or test suite) fails. There > is a > > > > good > > > > > > > chance > > > > > > > > we can reproduce the deadlock locally (by running it 100 > times) > > > and > > > > > get > > > > > > > the > > > > > > > > debug information we need. In the rare case where the > deadlock > > > > > happens > > > > > > > only > > > > > > > > on AZP, we can just disable the timeout for that particular > > test. > > > > So > > > > > > the > > > > > > > > lack of thread dump is not really a concern. > > > > > > > > > > > > > > > > On the other hand, if we disallow timeout, it will be very > hard > > > for > > > > > us > > > > > > to > > > > > > > > catch low-quality tests. I don't know if there is a good > > > > alternative > > > > > > way > > > > > > > to > > > > > > > > catch those tests. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Mon, Apr 26, 2021 at 3:54 PM Dawid Wysakowicz < > > > > > > [hidden email] > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > >> Hi devs! > > > > > > > >> > > > > > > > >> I wanted to bring up something that was discussed in a few > > > > > independent > > > > > > > >> groups of people in the past days. I'd like to revise using > > > > timeouts > > > > > > in > > > > > > > >> our JUnit tests. The suggestion would be not to use them > > > anymore. > > > > > The > > > > > > > >> problem with timeouts is that we have no thread dump and > stack > > > > > traces > > > > > > of > > > > > > > >> the system as it hangs. If we were not using a timeout, the > CI > > > > > runner > > > > > > > >> would have caught the timeout and created a thread dump > which > > > > often > > > > > > is a > > > > > > > >> great starting point for debugging. > > > > > > > >> > > > > > > > >> This problem has been spotted e.g. during debugging > > > > FLINK-22416[1]. > > > > > In > > > > > > > >> the past thread dumps were not always taken for hanging > tests, > > > but > > > > > it > > > > > > > >> was changed quite recently in FLINK-21346[2]. I am happy to > > hear > > > > > your > > > > > > > >> opinions on it. If there are no objections I would like to > add > > > the > > > > > > > >> suggestion to the Coding Guidelines[3] > > > > > > > >> > > > > > > > >> Best, > > > > > > > >> > > > > > > > >> Dawid > > > > > > > >> > > > > > > > >> > > > > > > > >> [1] https://issues.apache.org/jira/browse/FLINK-22416 > > > > > > > >> > > > > > > > >> [2] https://issues.apache.org/jira/browse/FLINK-21346 > > > > > > > >> > > > > > > > >> [3] > > > > > > > >> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://flink.apache.org/contributing/code-style-and-quality-java.html#java-language-features-and-libraries > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > |
Hey all,
Sorry I've not been active in the thread. I think the conclusion is rather positive and we do want to depend more on the Azure watchdog and discourage timeouts in JUnit tests from now on. I'll update our coding guidelines accordingly. @Piotr Yes, this was a problem in the past, but that was solved in the FLINK-21346, that I linked, quite recently. The thread dumps will be taken and logs uploaded if the job is reaching the global limit. Similarly as it happens for the e2e tests. Best, Dawid [1] https://issues.apache.org/jira/browse/FLINK-21346 On 04/05/2021 17:24, Piotr Nowojski wrote: > Hi, > > I'm ok with removing most or even all timeouts. Just one thing. > > Reason behind using junit timeouts that I've heard (and I was adhering to > it) was that maven watchdog was doing the thread dump and killing the test > process using timeout based on logs inactivity. Some tests were by nature > prone to live lock (for example if a job had an infinite source), that was > preventing the watchdog from kicking in. And if I remember correctly we > didn't have a global timeout for all tests, just travis was killing our > jobs without even uploading any logs to S3. So junit level timeouts were > very valuable in those kinds of cases, to at least get some logs, even if > without a stack trace. > > I have no idea what's the state of our watch dogs right now, after all of > the changes in the past years, so I don't know how relevant is this > reasoning. > > Piotrek > > pt., 30 kwi 2021 o 11:52 Till Rohrmann <[hidden email]> napisał(a): > >> Yes, you can click on the test job where a test failed. Then you can click >> on 1 artifact produced (or on the overview page on the X published >> (artifacts)). This brings you to the published artifacts page where we >> upload for every job the logs. >> >> Cheers, >> Till >> >> On Fri, Apr 30, 2021 at 9:22 AM Dong Lin <[hidden email]> wrote: >> >>> Thanks Till. Yes you are right. The INFO logging is enabled. It is just >>> dumped to a file (the FileAppender) other than the console. >>> >>> There is probably a way to retrieve that log file from AZP. I will ask >>> other colleagues how to get this later. >>> >>> On Thu, Apr 29, 2021 at 4:51 PM Till Rohrmann <[hidden email]> >>> wrote: >>> >>>> I think for the maven tests we use this log4j.properties file [1]. >>>> >>>> [1] >>> https://github.com/apache/flink/blob/master/tools/ci/log4j.properties >>>> Cheers, >>>> Till >>>> >>>> On Wed, Apr 28, 2021 at 4:47 AM Dong Lin <[hidden email]> wrote: >>>> >>>>> Thanks for the detailed explanations! Regarding the usage of timeout, >>>> now I >>>>> agree that it is better to remove per-test timeouts because it helps >>>>> make our testing results more reliable and consistent. >>>>> >>>>> My previous concern is that it might not be a good idea to >>> intentionally >>>>> let the test hang in AZP in order to get the thread dump. Now I get >>> that >>>>> there are a few practical concerns around the usage of timeout which >>>> makes >>>>> testing results unreliable (e.g. flakiness in the presence of VM >>>>> migration). >>>>> >>>>> Regarding the level logging on AZP, it appears that we actually set >>>>> "rootLogger.level = OFF" in most log4j2-test.properties, which means >>> that >>>>> no INFO log would be printed on AZP. For example, I tried to increase >>> the >>>>> log level in this <https://github.com/apache/flink/pull/15617> PR >> and >>>> was >>>>> suggested in this >>>>> < >>>>> >> https://issues.apache.org/jira/browse/FLINK-22085?focusedCommentId=17321055&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17321055 >>>>> comment to avoid increasing the log level. Did I miss something here? >>>>> >>>>> >>>>> On Wed, Apr 28, 2021 at 2:22 AM Arvid Heise <[hidden email]> >> wrote: >>>>>> Just to add to Dong Lin's list of cons of allowing timeout: >>>>>> - Any timeout value that you manually set is arbitrary. If it's set >>> too >>>>>> low, you get test instabilities. What too low means depends on >>> numerous >>>>>> factors, such as hardware and current utilization (especially I/O). >>> If >>>>> you >>>>>> run in VMs and the VM is migrated while running a test, any >>> reasonable >>>>>> timeout will probably fail. While you could make a similar case for >>> the >>>>>> overall timeout of tests, any smaller hiccup in the range of >> minutes >>>> will >>>>>> not impact the overall runtime much. The probability of having a VM >>>>>> constantly migrating during the same stage is abysmally low. >>>>>> - A timeout is more maintenance-intensive. It's one more knob where >>> you >>>>> can >>>>>> tweak a build or not. If you change the test a bit, you also need >> to >>>>>> double-check the timeout. Hence, there have been quite a few >> commits >>>> that >>>>>> just increase timeouts. >>>>>> - Whether a test uses a timeout or not is arbitrary: Why do some >> ITs >>>>> have a >>>>>> timeout and others don't? All IT tests are prone to timeout if >> there >>>> are >>>>>> issues with resource allocation. Similarly, there are quite a few >>> unit >>>>>> tests with timeouts while others don't have them with no obvious >>>> pattern. >>>>>> - An ill-set timeout reduces build reproducibility. Imagine having >> a >>>>>> release with such a timeout and the users cannot build Flink >>> reliably. >>>>>> I'd like to also point out that we should not cater around unstable >>>> tests >>>>>> if our overall goal is to have as many green builds as possible. If >>> we >>>>>> assume that our builds fail more often than not, we should also >> look >>>> into >>>>>> the other direction and continue the builds on error. I'm not a big >>> fan >>>>> of >>>>>> that. >>>>>> >>>>>> One argument that I also heard is that it eases local debugging in >>> case >>>>> of >>>>>> refactorings as you can see multiple failures at the same time. But >>> no >>>>> one >>>>>> is keeping you from temporarily adding a timeout on your branch. >>> Then, >>>> we >>>>>> can be sure that the timeout is plausible for your hardware and >> avoid >>>> all >>>>>> above mentioned drawbacks. >>>>>> >>>>>> @Robert Metzger <[hidden email]> >>>>>> >>>>>>> If we had a global limit of 1 minute per test, we would have >> caught >>>>> this >>>>>>> case (and we would encourage people to be careful with CI time). >>>>>>> >>>>>> There are quite a few tests that run longer, especially on a well >>>>> utilized >>>>>> build machine. A global limit is even worse than individual limits >> as >>>>> there >>>>>> is no value that fits it all. If you screwed up and 200 tests hang, >>>> you'd >>>>>> also run into the global timeout anyway. I'm also not sure what >> these >>>>>> additional hangs bring you except a huge log. >>>>>> >>>>>> I'm also not sure if it's really better in terms of CI time. For >>>> example, >>>>>> for UnalignedCheckpointRescaleITCase, we test all known >> partitioners >>> in >>>>> one >>>>>> pipeline for correctness. For higher parallelism, that means the >> test >>>>> runs >>>>>> over 1 minute regularly. If there is a global limit, I'd need to >>> split >>>>> the >>>>>> test into smaller chunks, where I'm positive that the sum of the >>> chunks >>>>>> will be larger than before. >>>>>> >>>>>> PS: all tests on AZP will print INFO in the artifacts. There you >> can >>>> also >>>>>> retrieve the stacktraces. >>>>>> PPS: I also said that we should revalidate the current timeout on >>> AZP. >>>> So >>>>>> the argument that we have >2h of precious CI time wasted is kind of >>>>>> constructed and is just due to some random defaults. >>>>>> >>>>>> On Tue, Apr 27, 2021 at 6:42 PM Till Rohrmann < >> [hidden email]> >>>>>> wrote: >>>>>> >>>>>>> I think we do capture the INFO logs of the test runs on AZP. >>>>>>> >>>>>>> I am also not sure whether we really caught slow tests with >> Junit's >>>>>> timeout >>>>>>> rule before. I think the default is usually to increase the >> timeout >>>> to >>>>>> make >>>>>>> the test pass. One way to find slow tests is to measure the time >>> and >>>>> look >>>>>>> at the outliers. >>>>>>> >>>>>>> Cheers, >>>>>>> Till >>>>>>> >>>>>>> On Tue, Apr 27, 2021 at 3:49 PM Dong Lin <[hidden email]> >>>> wrote: >>>>>>>> There is one more point that may be useful to consider here. >>>>>>>> >>>>>>>> In order to debug deadlock that is not easily reproducible, it >> is >>>>>> likely >>>>>>>> not sufficient to see only the thread dump to figure out the >> root >>>>>> cause. >>>>>>> We >>>>>>>> likely need to enable the INFO level logging. Since AZP does >> not >>>>>> provide >>>>>>>> INFO level logging by default, we either need to reproduce the >>> bug >>>>>>> locally >>>>>>>> or change the AZP log4j temporarily. This further reduces the >>>> benefit >>>>>> of >>>>>>>> logging the thread dump (which comes at the cost of letting the >>> AZP >>>>> job >>>>>>>> hang). >>>>>>>> >>>>>>>> >>>>>>>> On Tue, Apr 27, 2021 at 9:34 PM Dong Lin <[hidden email]> >>>>> wrote: >>>>>>>>> Just to make sure I understand the proposal correctly: is the >>>>>> proposal >>>>>>> to >>>>>>>>> disallow the usage of @Test(timeout=...) for Flink Junit >> tests? >>>>>>>>> Here is my understanding of the pros/cons according to the >>>>> discussion >>>>>>> so >>>>>>>>> far. >>>>>>>>> >>>>>>>>> Pros of allowing timeout: >>>>>>>>> 1) When there are tests that are unreasonably slow, it helps >> us >>>>>>>>> catch those tests and thus increase the quality of unit >> tests. >>>>>>>>> 2) When there are tests that cause deadlock, it helps the AZP >>> job >>>>>> fail >>>>>>>>> fast instead of being blocked for 4 hours. This saves >> resources >>>> and >>>>>>> also >>>>>>>>> allows developers to get their PR tested again earlier >> (useful >>>> when >>>>>> the >>>>>>>>> test failure is not relevant to their PR). >>>>>>>>> >>>>>>>>> Cons of allowing timeout: >>>>>>>>> 1) When there are tests that cause deadlock, we could not see >>> the >>>>>>> thread >>>>>>>>> dump of all threads, which makes debugging the issue harder. >>>>>>>>> >>>>>>>>> I would suggest that we should still allow timeout because >> the >>>> pros >>>>>>>>> outweigh the cons. >>>>>>>>> >>>>>>>>> As far as I can tell, if we allow timeout and encounter a >>>> deadlock >>>>>> bug >>>>>>> in >>>>>>>>> AZP, we still know which test (or test suite) fails. There >> is a >>>>> good >>>>>>>> chance >>>>>>>>> we can reproduce the deadlock locally (by running it 100 >> times) >>>> and >>>>>> get >>>>>>>> the >>>>>>>>> debug information we need. In the rare case where the >> deadlock >>>>>> happens >>>>>>>> only >>>>>>>>> on AZP, we can just disable the timeout for that particular >>> test. >>>>> So >>>>>>> the >>>>>>>>> lack of thread dump is not really a concern. >>>>>>>>> >>>>>>>>> On the other hand, if we disallow timeout, it will be very >> hard >>>> for >>>>>> us >>>>>>> to >>>>>>>>> catch low-quality tests. I don't know if there is a good >>>>> alternative >>>>>>> way >>>>>>>> to >>>>>>>>> catch those tests. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Mon, Apr 26, 2021 at 3:54 PM Dawid Wysakowicz < >>>>>>> [hidden email] >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Hi devs! >>>>>>>>>> >>>>>>>>>> I wanted to bring up something that was discussed in a few >>>>>> independent >>>>>>>>>> groups of people in the past days. I'd like to revise using >>>>> timeouts >>>>>>> in >>>>>>>>>> our JUnit tests. The suggestion would be not to use them >>>> anymore. >>>>>> The >>>>>>>>>> problem with timeouts is that we have no thread dump and >> stack >>>>>> traces >>>>>>> of >>>>>>>>>> the system as it hangs. If we were not using a timeout, the >> CI >>>>>> runner >>>>>>>>>> would have caught the timeout and created a thread dump >> which >>>>> often >>>>>>> is a >>>>>>>>>> great starting point for debugging. >>>>>>>>>> >>>>>>>>>> This problem has been spotted e.g. during debugging >>>>> FLINK-22416[1]. >>>>>> In >>>>>>>>>> the past thread dumps were not always taken for hanging >> tests, >>>> but >>>>>> it >>>>>>>>>> was changed quite recently in FLINK-21346[2]. I am happy to >>> hear >>>>>> your >>>>>>>>>> opinions on it. If there are no objections I would like to >> add >>>> the >>>>>>>>>> suggestion to the Coding Guidelines[3] >>>>>>>>>> >>>>>>>>>> Best, >>>>>>>>>> >>>>>>>>>> Dawid >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> [1] https://issues.apache.org/jira/browse/FLINK-22416 >>>>>>>>>> >>>>>>>>>> [2] https://issues.apache.org/jira/browse/FLINK-21346 >>>>>>>>>> >>>>>>>>>> [3] >>>>>>>>>> >>>>>>>>>> >> https://flink.apache.org/contributing/code-style-and-quality-java.html#java-language-features-and-libraries >>>>>>>>>> >>>>>>>>>> OpenPGP_signature (855 bytes) Download Attachment |
Thanks for the response Dawid. In this case I don't see a reason for
timeouts in JUnit. I agree with the previously mentioned points and I think it doesn't make sense to use them in order to limit test duration Piotrek wt., 4 maj 2021 o 17:44 Dawid Wysakowicz <[hidden email]> napisał(a): > Hey all, > > Sorry I've not been active in the thread. I think the conclusion is > rather positive and we do want to depend more on the Azure watchdog and > discourage timeouts in JUnit tests from now on. I'll update our coding > guidelines accordingly. > > @Piotr Yes, this was a problem in the past, but that was solved in the > FLINK-21346, that I linked, quite recently. The thread dumps will be > taken and logs uploaded if the job is reaching the global limit. > Similarly as it happens for the e2e tests. > > Best, > > Dawid > > > [1] https://issues.apache.org/jira/browse/FLINK-21346 > > On 04/05/2021 17:24, Piotr Nowojski wrote: > > Hi, > > > > I'm ok with removing most or even all timeouts. Just one thing. > > > > Reason behind using junit timeouts that I've heard (and I was adhering to > > it) was that maven watchdog was doing the thread dump and killing the > test > > process using timeout based on logs inactivity. Some tests were by nature > > prone to live lock (for example if a job had an infinite source), that > was > > preventing the watchdog from kicking in. And if I remember correctly we > > didn't have a global timeout for all tests, just travis was killing our > > jobs without even uploading any logs to S3. So junit level timeouts were > > very valuable in those kinds of cases, to at least get some logs, even if > > without a stack trace. > > > > I have no idea what's the state of our watch dogs right now, after all of > > the changes in the past years, so I don't know how relevant is this > > reasoning. > > > > Piotrek > > > > pt., 30 kwi 2021 o 11:52 Till Rohrmann <[hidden email]> > napisał(a): > > > >> Yes, you can click on the test job where a test failed. Then you can > click > >> on 1 artifact produced (or on the overview page on the X published > >> (artifacts)). This brings you to the published artifacts page where we > >> upload for every job the logs. > >> > >> Cheers, > >> Till > >> > >> On Fri, Apr 30, 2021 at 9:22 AM Dong Lin <[hidden email]> wrote: > >> > >>> Thanks Till. Yes you are right. The INFO logging is enabled. It is just > >>> dumped to a file (the FileAppender) other than the console. > >>> > >>> There is probably a way to retrieve that log file from AZP. I will ask > >>> other colleagues how to get this later. > >>> > >>> On Thu, Apr 29, 2021 at 4:51 PM Till Rohrmann <[hidden email]> > >>> wrote: > >>> > >>>> I think for the maven tests we use this log4j.properties file [1]. > >>>> > >>>> [1] > >>> https://github.com/apache/flink/blob/master/tools/ci/log4j.properties > >>>> Cheers, > >>>> Till > >>>> > >>>> On Wed, Apr 28, 2021 at 4:47 AM Dong Lin <[hidden email]> wrote: > >>>> > >>>>> Thanks for the detailed explanations! Regarding the usage of timeout, > >>>> now I > >>>>> agree that it is better to remove per-test timeouts because it helps > >>>>> make our testing results more reliable and consistent. > >>>>> > >>>>> My previous concern is that it might not be a good idea to > >>> intentionally > >>>>> let the test hang in AZP in order to get the thread dump. Now I get > >>> that > >>>>> there are a few practical concerns around the usage of timeout which > >>>> makes > >>>>> testing results unreliable (e.g. flakiness in the presence of VM > >>>>> migration). > >>>>> > >>>>> Regarding the level logging on AZP, it appears that we actually set > >>>>> "rootLogger.level = OFF" in most log4j2-test.properties, which means > >>> that > >>>>> no INFO log would be printed on AZP. For example, I tried to increase > >>> the > >>>>> log level in this <https://github.com/apache/flink/pull/15617> PR > >> and > >>>> was > >>>>> suggested in this > >>>>> < > >>>>> > >> > https://issues.apache.org/jira/browse/FLINK-22085?focusedCommentId=17321055&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17321055 > >>>>> comment to avoid increasing the log level. Did I miss something here? > >>>>> > >>>>> > >>>>> On Wed, Apr 28, 2021 at 2:22 AM Arvid Heise <[hidden email]> > >> wrote: > >>>>>> Just to add to Dong Lin's list of cons of allowing timeout: > >>>>>> - Any timeout value that you manually set is arbitrary. If it's set > >>> too > >>>>>> low, you get test instabilities. What too low means depends on > >>> numerous > >>>>>> factors, such as hardware and current utilization (especially I/O). > >>> If > >>>>> you > >>>>>> run in VMs and the VM is migrated while running a test, any > >>> reasonable > >>>>>> timeout will probably fail. While you could make a similar case for > >>> the > >>>>>> overall timeout of tests, any smaller hiccup in the range of > >> minutes > >>>> will > >>>>>> not impact the overall runtime much. The probability of having a VM > >>>>>> constantly migrating during the same stage is abysmally low. > >>>>>> - A timeout is more maintenance-intensive. It's one more knob where > >>> you > >>>>> can > >>>>>> tweak a build or not. If you change the test a bit, you also need > >> to > >>>>>> double-check the timeout. Hence, there have been quite a few > >> commits > >>>> that > >>>>>> just increase timeouts. > >>>>>> - Whether a test uses a timeout or not is arbitrary: Why do some > >> ITs > >>>>> have a > >>>>>> timeout and others don't? All IT tests are prone to timeout if > >> there > >>>> are > >>>>>> issues with resource allocation. Similarly, there are quite a few > >>> unit > >>>>>> tests with timeouts while others don't have them with no obvious > >>>> pattern. > >>>>>> - An ill-set timeout reduces build reproducibility. Imagine having > >> a > >>>>>> release with such a timeout and the users cannot build Flink > >>> reliably. > >>>>>> I'd like to also point out that we should not cater around unstable > >>>> tests > >>>>>> if our overall goal is to have as many green builds as possible. If > >>> we > >>>>>> assume that our builds fail more often than not, we should also > >> look > >>>> into > >>>>>> the other direction and continue the builds on error. I'm not a big > >>> fan > >>>>> of > >>>>>> that. > >>>>>> > >>>>>> One argument that I also heard is that it eases local debugging in > >>> case > >>>>> of > >>>>>> refactorings as you can see multiple failures at the same time. But > >>> no > >>>>> one > >>>>>> is keeping you from temporarily adding a timeout on your branch. > >>> Then, > >>>> we > >>>>>> can be sure that the timeout is plausible for your hardware and > >> avoid > >>>> all > >>>>>> above mentioned drawbacks. > >>>>>> > >>>>>> @Robert Metzger <[hidden email]> > >>>>>> > >>>>>>> If we had a global limit of 1 minute per test, we would have > >> caught > >>>>> this > >>>>>>> case (and we would encourage people to be careful with CI time). > >>>>>>> > >>>>>> There are quite a few tests that run longer, especially on a well > >>>>> utilized > >>>>>> build machine. A global limit is even worse than individual limits > >> as > >>>>> there > >>>>>> is no value that fits it all. If you screwed up and 200 tests hang, > >>>> you'd > >>>>>> also run into the global timeout anyway. I'm also not sure what > >> these > >>>>>> additional hangs bring you except a huge log. > >>>>>> > >>>>>> I'm also not sure if it's really better in terms of CI time. For > >>>> example, > >>>>>> for UnalignedCheckpointRescaleITCase, we test all known > >> partitioners > >>> in > >>>>> one > >>>>>> pipeline for correctness. For higher parallelism, that means the > >> test > >>>>> runs > >>>>>> over 1 minute regularly. If there is a global limit, I'd need to > >>> split > >>>>> the > >>>>>> test into smaller chunks, where I'm positive that the sum of the > >>> chunks > >>>>>> will be larger than before. > >>>>>> > >>>>>> PS: all tests on AZP will print INFO in the artifacts. There you > >> can > >>>> also > >>>>>> retrieve the stacktraces. > >>>>>> PPS: I also said that we should revalidate the current timeout on > >>> AZP. > >>>> So > >>>>>> the argument that we have >2h of precious CI time wasted is kind of > >>>>>> constructed and is just due to some random defaults. > >>>>>> > >>>>>> On Tue, Apr 27, 2021 at 6:42 PM Till Rohrmann < > >> [hidden email]> > >>>>>> wrote: > >>>>>> > >>>>>>> I think we do capture the INFO logs of the test runs on AZP. > >>>>>>> > >>>>>>> I am also not sure whether we really caught slow tests with > >> Junit's > >>>>>> timeout > >>>>>>> rule before. I think the default is usually to increase the > >> timeout > >>>> to > >>>>>> make > >>>>>>> the test pass. One way to find slow tests is to measure the time > >>> and > >>>>> look > >>>>>>> at the outliers. > >>>>>>> > >>>>>>> Cheers, > >>>>>>> Till > >>>>>>> > >>>>>>> On Tue, Apr 27, 2021 at 3:49 PM Dong Lin <[hidden email]> > >>>> wrote: > >>>>>>>> There is one more point that may be useful to consider here. > >>>>>>>> > >>>>>>>> In order to debug deadlock that is not easily reproducible, it > >> is > >>>>>> likely > >>>>>>>> not sufficient to see only the thread dump to figure out the > >> root > >>>>>> cause. > >>>>>>> We > >>>>>>>> likely need to enable the INFO level logging. Since AZP does > >> not > >>>>>> provide > >>>>>>>> INFO level logging by default, we either need to reproduce the > >>> bug > >>>>>>> locally > >>>>>>>> or change the AZP log4j temporarily. This further reduces the > >>>> benefit > >>>>>> of > >>>>>>>> logging the thread dump (which comes at the cost of letting the > >>> AZP > >>>>> job > >>>>>>>> hang). > >>>>>>>> > >>>>>>>> > >>>>>>>> On Tue, Apr 27, 2021 at 9:34 PM Dong Lin <[hidden email]> > >>>>> wrote: > >>>>>>>>> Just to make sure I understand the proposal correctly: is the > >>>>>> proposal > >>>>>>> to > >>>>>>>>> disallow the usage of @Test(timeout=...) for Flink Junit > >> tests? > >>>>>>>>> Here is my understanding of the pros/cons according to the > >>>>> discussion > >>>>>>> so > >>>>>>>>> far. > >>>>>>>>> > >>>>>>>>> Pros of allowing timeout: > >>>>>>>>> 1) When there are tests that are unreasonably slow, it helps > >> us > >>>>>>>>> catch those tests and thus increase the quality of unit > >> tests. > >>>>>>>>> 2) When there are tests that cause deadlock, it helps the AZP > >>> job > >>>>>> fail > >>>>>>>>> fast instead of being blocked for 4 hours. This saves > >> resources > >>>> and > >>>>>>> also > >>>>>>>>> allows developers to get their PR tested again earlier > >> (useful > >>>> when > >>>>>> the > >>>>>>>>> test failure is not relevant to their PR). > >>>>>>>>> > >>>>>>>>> Cons of allowing timeout: > >>>>>>>>> 1) When there are tests that cause deadlock, we could not see > >>> the > >>>>>>> thread > >>>>>>>>> dump of all threads, which makes debugging the issue harder. > >>>>>>>>> > >>>>>>>>> I would suggest that we should still allow timeout because > >> the > >>>> pros > >>>>>>>>> outweigh the cons. > >>>>>>>>> > >>>>>>>>> As far as I can tell, if we allow timeout and encounter a > >>>> deadlock > >>>>>> bug > >>>>>>> in > >>>>>>>>> AZP, we still know which test (or test suite) fails. There > >> is a > >>>>> good > >>>>>>>> chance > >>>>>>>>> we can reproduce the deadlock locally (by running it 100 > >> times) > >>>> and > >>>>>> get > >>>>>>>> the > >>>>>>>>> debug information we need. In the rare case where the > >> deadlock > >>>>>> happens > >>>>>>>> only > >>>>>>>>> on AZP, we can just disable the timeout for that particular > >>> test. > >>>>> So > >>>>>>> the > >>>>>>>>> lack of thread dump is not really a concern. > >>>>>>>>> > >>>>>>>>> On the other hand, if we disallow timeout, it will be very > >> hard > >>>> for > >>>>>> us > >>>>>>> to > >>>>>>>>> catch low-quality tests. I don't know if there is a good > >>>>> alternative > >>>>>>> way > >>>>>>>> to > >>>>>>>>> catch those tests. > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> On Mon, Apr 26, 2021 at 3:54 PM Dawid Wysakowicz < > >>>>>>> [hidden email] > >>>>>>>>> wrote: > >>>>>>>>> > >>>>>>>>>> Hi devs! > >>>>>>>>>> > >>>>>>>>>> I wanted to bring up something that was discussed in a few > >>>>>> independent > >>>>>>>>>> groups of people in the past days. I'd like to revise using > >>>>> timeouts > >>>>>>> in > >>>>>>>>>> our JUnit tests. The suggestion would be not to use them > >>>> anymore. > >>>>>> The > >>>>>>>>>> problem with timeouts is that we have no thread dump and > >> stack > >>>>>> traces > >>>>>>> of > >>>>>>>>>> the system as it hangs. If we were not using a timeout, the > >> CI > >>>>>> runner > >>>>>>>>>> would have caught the timeout and created a thread dump > >> which > >>>>> often > >>>>>>> is a > >>>>>>>>>> great starting point for debugging. > >>>>>>>>>> > >>>>>>>>>> This problem has been spotted e.g. during debugging > >>>>> FLINK-22416[1]. > >>>>>> In > >>>>>>>>>> the past thread dumps were not always taken for hanging > >> tests, > >>>> but > >>>>>> it > >>>>>>>>>> was changed quite recently in FLINK-21346[2]. I am happy to > >>> hear > >>>>>> your > >>>>>>>>>> opinions on it. If there are no objections I would like to > >> add > >>>> the > >>>>>>>>>> suggestion to the Coding Guidelines[3] > >>>>>>>>>> > >>>>>>>>>> Best, > >>>>>>>>>> > >>>>>>>>>> Dawid > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> [1] https://issues.apache.org/jira/browse/FLINK-22416 > >>>>>>>>>> > >>>>>>>>>> [2] https://issues.apache.org/jira/browse/FLINK-21346 > >>>>>>>>>> > >>>>>>>>>> [3] > >>>>>>>>>> > >>>>>>>>>> > >> > https://flink.apache.org/contributing/code-style-and-quality-java.html#java-language-features-and-libraries > >>>>>>>>>> > >>>>>>>>>> > > |
Free forum by Nabble | Edit this page |