Hello everybody,
in the past few days me and my colleagues ran some tests with Flink on YARN and detected a possible inconsistent behavior in the way the contents of the flink/lib directory is shipped to the cluster when run on YARN, depending on the fact that the jobs are deployed individually or onto a long-running session. After some discussion on the user mailing list <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Flink-and-YARN-ship-folder-td5458.html> we were under the impression that the contents of that folder are always supposed to be copied so that all the nodes have access to them. Furthermore, we've found a comment in the code <https://github.com/apache/flink/blob/master/flink-yarn/src/main/java/org/apache/flink/yarn/FlinkYarnClientBase.java#L254-L263> that states: // remove uberjar from ship list (by default everything in the lib/ folder is added to // the list of files to ship, but we handle the uberjar separately. However, after having a look at some portions of the code, I'm not really sure if this is actually the case or not. The Flink long-running YARN session actually ships the contents because it's specified in the yarn-session.sh script <https://github.com/apache/flink/blob/master/flink-dist/src/main/flink-bin/yarn-bin/yarn-session.sh#L55>, however running a single job on YARN does not automatically ship the contents of the lib folder. The behavior is not documented an I'd like to write some lines in the docs to make clear of what is shipped in which case. Also, if there is an agreement on the behavior that the single jobs on YARN should have, I can also provide a fix for it. My feeling is that running a job on YARN should end up in having more or less the same effect, regardless of the way the job is run. Let me know what you think, thank you for your attention. -- BR, Stefano Baghino Software Engineer @ Radicalbit |
On Tue, Mar 22, 2016 at 8:42 PM, Stefano Baghino
<[hidden email]> wrote: > My feeling is that running a job on YARN should > end up in having more or less the same effect, regardless of the way the > job is run. +1 I think that the current behaviour is buggy. The resource management is currently undergoing a massive refactoring (https://github.com/apache/flink/pull/1741). Maybe it's already fixed there (if the issue is independent of the scripts). Would be great to have a fix for this. If #1751 does not fix it, feel free to open an issue and PR. :-) – Ufuk |
Thanks for pointing out Max's work (awesome PR, btw). It actually seem to
have introduced an environment variable regarding ship directories, it would be good to have his feedback on this. On Tue, Mar 22, 2016 at 10:24 PM, Ufuk Celebi <[hidden email]> wrote: > On Tue, Mar 22, 2016 at 8:42 PM, Stefano Baghino > <[hidden email]> wrote: > > My feeling is that running a job on YARN should > > end up in having more or less the same effect, regardless of the way the > > job is run. > > +1 > > I think that the current behaviour is buggy. The resource management > is currently undergoing a massive refactoring > (https://github.com/apache/flink/pull/1741). Maybe it's already fixed > there (if the issue is independent of the scripts). > > Would be great to have a fix for this. If #1751 does not fix it, feel > free to open an issue and PR. :-) > > – Ufuk > -- BR, Stefano Baghino Software Engineer @ Radicalbit |
Hi Stefano,
Thanks for pointing out this bug. Your analysis is correct. The per-job cluster does not ship the /lib directory by default. Would you like to open an issue/PR? We should let the ship_path default to the /lib directory. The mechanism with the environment variables is the same. They used to be defined in a different location (FlinkYarnClient) and have been moved to a separate class (YarnConfigKeys). Cheers, Max On Wed, Mar 23, 2016 at 10:06 AM, Stefano Baghino < [hidden email]> wrote: > Thanks for pointing out Max's work (awesome PR, btw). It actually seem to > have introduced an environment variable regarding ship directories, it > would be good to have his feedback on this. > > On Tue, Mar 22, 2016 at 10:24 PM, Ufuk Celebi <[hidden email]> wrote: > > > On Tue, Mar 22, 2016 at 8:42 PM, Stefano Baghino > > <[hidden email]> wrote: > > > My feeling is that running a job on YARN should > > > end up in having more or less the same effect, regardless of the way > the > > > job is run. > > > > +1 > > > > I think that the current behaviour is buggy. The resource management > > is currently undergoing a massive refactoring > > (https://github.com/apache/flink/pull/1741). Maybe it's already fixed > > there (if the issue is independent of the scripts). > > > > Would be great to have a fix for this. If #1751 does not fix it, feel > > free to open an issue and PR. :-) > > > > – Ufuk > > > > > > -- > BR, > Stefano Baghino > > Software Engineer @ Radicalbit > |
Yup, I shall open an issue for both this one and my other thread (re:
Kerberos). Thanks for the pointer on this issue. On Tue, Mar 29, 2016 at 12:44 PM, Maximilian Michels <[hidden email]> wrote: > Hi Stefano, > > Thanks for pointing out this bug. Your analysis is correct. The per-job > cluster does not ship the /lib directory by default. Would you like to open > an issue/PR? We should let the ship_path default to the /lib directory. > > The mechanism with the environment variables is the same. They used to be > defined in a different location (FlinkYarnClient) and have been moved to a > separate class (YarnConfigKeys). > > Cheers, > Max > > > > On Wed, Mar 23, 2016 at 10:06 AM, Stefano Baghino < > [hidden email]> wrote: > > > Thanks for pointing out Max's work (awesome PR, btw). It actually seem to > > have introduced an environment variable regarding ship directories, it > > would be good to have his feedback on this. > > > > On Tue, Mar 22, 2016 at 10:24 PM, Ufuk Celebi <[hidden email]> wrote: > > > > > On Tue, Mar 22, 2016 at 8:42 PM, Stefano Baghino > > > <[hidden email]> wrote: > > > > My feeling is that running a job on YARN should > > > > end up in having more or less the same effect, regardless of the way > > the > > > > job is run. > > > > > > +1 > > > > > > I think that the current behaviour is buggy. The resource management > > > is currently undergoing a massive refactoring > > > (https://github.com/apache/flink/pull/1741). Maybe it's already fixed > > > there (if the issue is independent of the scripts). > > > > > > Would be great to have a fix for this. If #1751 does not fix it, feel > > > free to open an issue and PR. :-) > > > > > > – Ufuk > > > > > > > > > > > -- > > BR, > > Stefano Baghino > > > > Software Engineer @ Radicalbit > > > -- BR, Stefano Baghino Software Engineer @ Radicalbit |
Free forum by Nabble | Edit this page |