Hi Peter and Kostas,
Thanks for creating this FLIP. Moving the JobGraph compilation to the cluster makes a lot of sense to me. FLIP-40 had the exactly same idea, but is currently dormant and can probably be superseded by this FLIP. After reading the FLIP, I still have a few questions. 1. What exactly the job submission interface will look like after this FLIP? The FLIP template has a Public Interface section but was removed from this FLIP. 2. How will the new ClusterEntrypoint fetch the jars from external storage? What external storage will be supported out of the box? Will this "jar fetcher" be pluggable? If so, how does the API look like and how will users specify the custom "jar fetcher"? 3. It sounds that in this FLIP, the "session cluster" running the application has the same lifecycle as the user application. How will the session cluster be teared down after the application finishes? Will the ClusterEntrypoint do that? Will there be an option of not tearing the cluster down? Maybe they have been discussed in the ML earlier, but I think they should be part of the FLIP also. Thanks, Jiangjie (Becket) Qin On Thu, Mar 5, 2020 at 10:09 PM Kostas Kloudas <[hidden email]> wrote: > Also from my side +1 to start voting. > > Cheers, > Kostas > > On Thu, Mar 5, 2020 at 7:45 AM tison <[hidden email]> wrote: > > > > +1 to star voting. > > > > Best, > > tison. > > > > > > Yang Wang <[hidden email]> 于2020年3月5日周四 下午2:29写道: > >> > >> Hi Peter, > >> Really thanks for your response. > >> > >> Hi all @Kostas Kloudas @Zili Chen @Peter Huang @Rong Rong > >> It seems that we have reached an agreement. The “application mode” is > regarded as the enhanced “per-job”. It is > >> orthogonal with “cluster deploy”. Currently, we bind the “per-job” to > `run-user-main-on-client` and “application mode” > >> to `run-user-main-on-cluster`. > >> > >> Do you have other concerns to moving FLIP-85 to voting? > >> > >> > >> Best, > >> Yang > >> > >> Peter Huang <[hidden email]> 于2020年3月5日周四 下午12:48写道: > >>> > >>> Hi Yang and Kostas, > >>> > >>> Thanks for the clarification. It makes more sense to me if the long > term goal is to replace per job mode to application mode > >>> in the future (at the time that multiple execute can be supported). > Before that, It will be better to keep the concept of > >>> application mode internally. As Yang suggested, User only need to use > a `-R/-- remote-deploy` cli option to launch > >>> a per job cluster with the main function executed in cluster > entry-point. +1 for the execution plan. > >>> > >>> > >>> > >>> Best Regards > >>> Peter Huang > >>> > >>> > >>> > >>> > >>> On Tue, Mar 3, 2020 at 7:11 AM Yang Wang <[hidden email]> > wrote: > >>>> > >>>> Hi Peter, > >>>> > >>>> Having the application mode does not mean we will drop the > cluster-deploy > >>>> option. I just want to share some thoughts about “Application Mode”. > >>>> > >>>> > >>>> 1. The application mode could cover the per-job sematic. Its lifecyle > is bound > >>>> to the user `main()`. And all the jobs in the user main will be > executed in a same > >>>> Flink cluster. In first phase of FLIP-85 implementation, running user > main on the > >>>> cluster side could be supported in application mode. > >>>> > >>>> 2. Maybe in the future, we also need to support multiple `execute()` > on client side > >>>> in a same Flink cluster. Then the per-job mode will evolve to > application mode. > >>>> > >>>> 3. From user perspective, only a `-R/-- remote-deploy` cli option is > visible. They > >>>> are not aware of the application mode. > >>>> > >>>> 4. In the first phase, the application mode is working as > “per-job”(only one job in > >>>> the user main). We just leave more potential for the future. > >>>> > >>>> > >>>> I am not against with calling it “cluster deploy mode” if you all > think it is clearer for users. > >>>> > >>>> > >>>> > >>>> Best, > >>>> Yang > >>>> > >>>> Kostas Kloudas <[hidden email]> 于2020年3月3日周二 下午6:49写道: > >>>>> > >>>>> Hi Peter, > >>>>> > >>>>> I understand your point. This is why I was also a bit torn about the > >>>>> name and my proposal was a bit aligned with yours (something along > the > >>>>> lines of "cluster deploy" mode). > >>>>> > >>>>> But many of the other participants in the discussion suggested the > >>>>> "Application Mode". I think that the reasoning is that now the user's > >>>>> Application is more self-contained. > >>>>> It will be submitted to the cluster and the user can just disconnect. > >>>>> In addition, as discussed briefly in the doc, in the future there may > >>>>> be better support for multi-execute applications which will bring us > >>>>> one step closer to the true "Application Mode". But this is how I > >>>>> interpreted their arguments, of course they can also express their > >>>>> thoughts on the topic :) > >>>>> > >>>>> Cheers, > >>>>> Kostas > >>>>> > >>>>> On Mon, Mar 2, 2020 at 6:15 PM Peter Huang < > [hidden email]> wrote: > >>>>> > > >>>>> > Hi Kostas, > >>>>> > > >>>>> > Thanks for updating the wiki. We have aligned with the > implementations in the doc. But I feel it is still a little bit confusing > of the naming from a user's perspective. It is well known that Flink > support per job cluster and session cluster. The concept is in the layer of > how a job is managed within Flink. The method introduced util now is a kind > of mixing job and session cluster to promising the implementation > complexity. We probably don't need to label it as Application Model as the > same layer of per job cluster and session cluster. Conceptually, I think it > is still a cluster mode implementation for per job cluster. > >>>>> > > >>>>> > To minimize the confusion of users, I think it would be better > just an option of per job cluster for each type of cluster manager. How do > you think? > >>>>> > > >>>>> > > >>>>> > Best Regards > >>>>> > Peter Huang > >>>>> > > >>>>> > > >>>>> > > >>>>> > > >>>>> > > >>>>> > > >>>>> > > >>>>> > > >>>>> > On Mon, Mar 2, 2020 at 7:22 AM Kostas Kloudas <[hidden email]> > wrote: > >>>>> >> > >>>>> >> Hi Yang, > >>>>> >> > >>>>> >> The difference between per-job and application mode is that, as > you > >>>>> >> described, in the per-job mode the main is executed on the client > >>>>> >> while in the application mode, the main is executed on the > cluster. > >>>>> >> I do not think we have to offer "application mode" with running > the > >>>>> >> main on the client side as this is exactly what the per-job mode > does > >>>>> >> currently and, as you described also, it would be redundant. > >>>>> >> > >>>>> >> Sorry if this was not clear in the document. > >>>>> >> > >>>>> >> Cheers, > >>>>> >> Kostas > >>>>> >> > >>>>> >> On Mon, Mar 2, 2020 at 3:17 PM Yang Wang <[hidden email]> > wrote: > >>>>> >> > > >>>>> >> > Hi Kostas, > >>>>> >> > > >>>>> >> > Thanks a lot for your conclusion and updating the FLIP-85 WIKI. > Currently, i have no more > >>>>> >> > questions about motivation, approach, fault tolerance and the > first phase implementation. > >>>>> >> > > >>>>> >> > I think the new title "Flink Application Mode" makes a lot > senses to me. Especially for the > >>>>> >> > containerized environment, the cluster deploy option will be > very useful. > >>>>> >> > > >>>>> >> > Just one concern, how do we introduce this new application mode > to our users? > >>>>> >> > Each user program(i.e. `main()`) is an application. Currently, > we intend to only support one > >>>>> >> > `execute()`. So what's the difference between per-job and > application mode? > >>>>> >> > > >>>>> >> > For per-job, user `main()` is always executed on client side. > And For application mode, user > >>>>> >> > `main()` could be executed on client or master side(configured > via cli option). > >>>>> >> > Right? We need to have a clear concept. Otherwise, the users > will be more and more confusing. > >>>>> >> > > >>>>> >> > > >>>>> >> > Best, > >>>>> >> > Yang > >>>>> >> > > >>>>> >> > Kostas Kloudas <[hidden email]> 于2020年3月2日周一 下午5:58写道: > >>>>> >> >> > >>>>> >> >> Hi all, > >>>>> >> >> > >>>>> >> >> I update > https://cwiki.apache.org/confluence/display/FLINK/FLIP-85+Flink+Application+Mode > >>>>> >> >> based on the discussion we had here: > >>>>> >> >> > >>>>> >> >> > https://docs.google.com/document/d/1ji72s3FD9DYUyGuKnJoO4ApzV-nSsZa0-bceGXW7Ocw/edit# > >>>>> >> >> > >>>>> >> >> Please let me know what you think and please keep the > discussion in the ML :) > >>>>> >> >> > >>>>> >> >> Thanks for starting the discussion and I hope that soon we > will be > >>>>> >> >> able to vote on the FLIP. > >>>>> >> >> > >>>>> >> >> Cheers, > >>>>> >> >> Kostas > >>>>> >> >> > >>>>> >> >> On Thu, Jan 16, 2020 at 3:40 AM Yang Wang < > [hidden email]> wrote: > >>>>> >> >> > > >>>>> >> >> > Hi all, > >>>>> >> >> > > >>>>> >> >> > Thanks a lot for the feedback from @Kostas Kloudas. Your all > concerns are > >>>>> >> >> > on point. The FLIP-85 is mainly > >>>>> >> >> > focused on supporting cluster mode for per-job. Since it is > more urgent and > >>>>> >> >> > have much more use > >>>>> >> >> > cases both in Yarn and Kubernetes deployment. For session > cluster, we could > >>>>> >> >> > have more discussion > >>>>> >> >> > in a new thread later. > >>>>> >> >> > > >>>>> >> >> > #1, How to download the user jars and dependencies for > per-job in cluster > >>>>> >> >> > mode? > >>>>> >> >> > For Yarn, we could register the user jars and dependencies as > >>>>> >> >> > LocalResource. They will be distributed > >>>>> >> >> > by Yarn. And once the JobManager and TaskManager launched, > the jars are > >>>>> >> >> > already exists. > >>>>> >> >> > For Standalone per-job and K8s, we expect that the user jars > >>>>> >> >> > and dependencies are built into the image. > >>>>> >> >> > Or the InitContainer could be used for downloading. It is > natively > >>>>> >> >> > distributed and we will not have bottleneck. > >>>>> >> >> > > >>>>> >> >> > #2, Job graph recovery > >>>>> >> >> > We could have an optimization to store job graph on the DFS. > However, i > >>>>> >> >> > suggest building a new jobgraph > >>>>> >> >> > from the configuration is the default option. Since we will > not always have > >>>>> >> >> > a DFS store when deploying a > >>>>> >> >> > Flink per-job cluster. Of course, we assume that using the > same > >>>>> >> >> > configuration(e.g. job_id, user_jar, main_class, > >>>>> >> >> > main_args, parallelism, savepoint_settings, etc.) will get a > same job > >>>>> >> >> > graph. I think the standalone per-job > >>>>> >> >> > already has the similar behavior. > >>>>> >> >> > > >>>>> >> >> > #3, What happens with jobs that have multiple execute calls? > >>>>> >> >> > Currently, it is really a problem. Even we use a local > client on Flink > >>>>> >> >> > master side, it will have different behavior with > >>>>> >> >> > client mode. For client mode, if we execute multiple times, > then we will > >>>>> >> >> > deploy multiple Flink clusters for each execute. > >>>>> >> >> > I am not pretty sure whether it is reasonable. However, i > still think using > >>>>> >> >> > the local client is a good choice. We could > >>>>> >> >> > continue the discussion in a new thread. @Zili Chen < > [hidden email]> Do > >>>>> >> >> > you want to drive this? > >>>>> >> >> > > >>>>> >> >> > > >>>>> >> >> > > >>>>> >> >> > Best, > >>>>> >> >> > Yang > >>>>> >> >> > > >>>>> >> >> > Peter Huang <[hidden email]> 于2020年1月16日周四 > 上午1:55写道: > >>>>> >> >> > > >>>>> >> >> > > Hi Kostas, > >>>>> >> >> > > > >>>>> >> >> > > Thanks for this feedback. I can't agree more about the > opinion. The > >>>>> >> >> > > cluster mode should be added > >>>>> >> >> > > first in per job cluster. > >>>>> >> >> > > > >>>>> >> >> > > 1) For job cluster implementation > >>>>> >> >> > > 1. Job graph recovery from configuration or store as > static job graph as > >>>>> >> >> > > session cluster. I think the static one will be better for > less recovery > >>>>> >> >> > > time. > >>>>> >> >> > > Let me update the doc for details. > >>>>> >> >> > > > >>>>> >> >> > > 2. For job execute multiple times, I think @Zili Chen > >>>>> >> >> > > <[hidden email]> has proposed the local client > solution that can > >>>>> >> >> > > the run program actually in the cluster entry point. We > can put the > >>>>> >> >> > > implementation in the second stage, > >>>>> >> >> > > or even a new FLIP for further discussion. > >>>>> >> >> > > > >>>>> >> >> > > 2) For session cluster implementation > >>>>> >> >> > > We can disable the cluster mode for the session cluster in > the first > >>>>> >> >> > > stage. I agree the jar downloading will be a painful thing. > >>>>> >> >> > > We can consider about PoC and performance evaluation > first. If the end to > >>>>> >> >> > > end experience is good enough, then we can consider > >>>>> >> >> > > proceeding with the solution. > >>>>> >> >> > > > >>>>> >> >> > > Looking forward to more opinions from @Yang Wang < > [hidden email]> @Zili > >>>>> >> >> > > Chen <[hidden email]> @Dian Fu < > [hidden email]>. > >>>>> >> >> > > > >>>>> >> >> > > > >>>>> >> >> > > Best Regards > >>>>> >> >> > > Peter Huang > >>>>> >> >> > > > >>>>> >> >> > > On Wed, Jan 15, 2020 at 7:50 AM Kostas Kloudas < > [hidden email]> wrote: > >>>>> >> >> > > > >>>>> >> >> > >> Hi all, > >>>>> >> >> > >> > >>>>> >> >> > >> I am writing here as the discussion on the Google Doc > seems to be a > >>>>> >> >> > >> bit difficult to follow. > >>>>> >> >> > >> > >>>>> >> >> > >> I think that in order to be able to make progress, it > would be helpful > >>>>> >> >> > >> to focus on per-job mode for now. > >>>>> >> >> > >> The reason is that: > >>>>> >> >> > >> 1) making the (unique) JobSubmitHandler responsible for > creating the > >>>>> >> >> > >> jobgraphs, > >>>>> >> >> > >> which includes downloading dependencies, is not an > optimal solution > >>>>> >> >> > >> 2) even if we put the responsibility on the JobMaster, > currently each > >>>>> >> >> > >> job has its own > >>>>> >> >> > >> JobMaster but they all run on the same process, so we > have again a > >>>>> >> >> > >> single entity. > >>>>> >> >> > >> > >>>>> >> >> > >> Of course after this is done, and if we feel comfortable > with the > >>>>> >> >> > >> solution, then we can go to the session mode. > >>>>> >> >> > >> > >>>>> >> >> > >> A second comment has to do with fault-tolerance in the > per-job, > >>>>> >> >> > >> cluster-deploy mode. > >>>>> >> >> > >> In the document, it is suggested that upon recovery, the > JobMaster of > >>>>> >> >> > >> each job re-creates the JobGraph. > >>>>> >> >> > >> I am just wondering if it is better to create and store > the jobGraph > >>>>> >> >> > >> upon submission and only fetch it > >>>>> >> >> > >> upon recovery so that we have a static jobGraph. > >>>>> >> >> > >> > >>>>> >> >> > >> Finally, I have a question which is what happens with > jobs that have > >>>>> >> >> > >> multiple execute calls? > >>>>> >> >> > >> The semantics seem to change compared to the current > behaviour, right? > >>>>> >> >> > >> > >>>>> >> >> > >> Cheers, > >>>>> >> >> > >> Kostas > >>>>> >> >> > >> > >>>>> >> >> > >> On Wed, Jan 8, 2020 at 8:05 PM tison < > [hidden email]> wrote: > >>>>> >> >> > >> > > >>>>> >> >> > >> > not always, Yang Wang is also not yet a committer but > he can join the > >>>>> >> >> > >> > channel. I cannot find the id by clicking “Add new > member in channel” so > >>>>> >> >> > >> > come to you and ask for try out the link. Possibly I > will find other > >>>>> >> >> > >> ways > >>>>> >> >> > >> > but the original purpose is that the slack channel is a > public area we > >>>>> >> >> > >> > discuss about developing... > >>>>> >> >> > >> > Best, > >>>>> >> >> > >> > tison. > >>>>> >> >> > >> > > >>>>> >> >> > >> > > >>>>> >> >> > >> > Peter Huang <[hidden email]> 于2020年1月9日周四 > 上午2:44写道: > >>>>> >> >> > >> > > >>>>> >> >> > >> > > Hi Tison, > >>>>> >> >> > >> > > > >>>>> >> >> > >> > > I am not the committer of Flink yet. I think I can't > join it also. > >>>>> >> >> > >> > > > >>>>> >> >> > >> > > > >>>>> >> >> > >> > > Best Regards > >>>>> >> >> > >> > > Peter Huang > >>>>> >> >> > >> > > > >>>>> >> >> > >> > > On Wed, Jan 8, 2020 at 9:39 AM tison < > [hidden email]> wrote: > >>>>> >> >> > >> > > > >>>>> >> >> > >> > > > Hi Peter, > >>>>> >> >> > >> > > > > >>>>> >> >> > >> > > > Could you try out this link? > >>>>> >> >> > >> > > https://the-asf.slack.com/messages/CNA3ADZPH > >>>>> >> >> > >> > > > > >>>>> >> >> > >> > > > Best, > >>>>> >> >> > >> > > > tison. > >>>>> >> >> > >> > > > > >>>>> >> >> > >> > > > > >>>>> >> >> > >> > > > Peter Huang <[hidden email]> > 于2020年1月9日周四 上午1:22写道: > >>>>> >> >> > >> > > > > >>>>> >> >> > >> > > > > Hi Tison, > >>>>> >> >> > >> > > > > > >>>>> >> >> > >> > > > > I can't join the group with shared link. Would > you please add me > >>>>> >> >> > >> into > >>>>> >> >> > >> > > the > >>>>> >> >> > >> > > > > group? My slack account is huangzhenqiu0825. > >>>>> >> >> > >> > > > > Thank you in advance. > >>>>> >> >> > >> > > > > > >>>>> >> >> > >> > > > > > >>>>> >> >> > >> > > > > Best Regards > >>>>> >> >> > >> > > > > Peter Huang > >>>>> >> >> > >> > > > > > >>>>> >> >> > >> > > > > On Wed, Jan 8, 2020 at 12:02 AM tison < > [hidden email]> > >>>>> >> >> > >> wrote: > >>>>> >> >> > >> > > > > > >>>>> >> >> > >> > > > > > Hi Peter, > >>>>> >> >> > >> > > > > > > >>>>> >> >> > >> > > > > > As described above, this effort should get > attention from people > >>>>> >> >> > >> > > > > developing > >>>>> >> >> > >> > > > > > FLIP-73 a.k.a. Executor abstractions. I > recommend you to join > >>>>> >> >> > >> the > >>>>> >> >> > >> > > > public > >>>>> >> >> > >> > > > > > slack channel[1] for Flink Client API > Enhancement and you can > >>>>> >> >> > >> try to > >>>>> >> >> > >> > > > > share > >>>>> >> >> > >> > > > > > you detailed thoughts there. It possibly gets > more concrete > >>>>> >> >> > >> > > attentions. > >>>>> >> >> > >> > > > > > > >>>>> >> >> > >> > > > > > Best, > >>>>> >> >> > >> > > > > > tison. > >>>>> >> >> > >> > > > > > > >>>>> >> >> > >> > > > > > [1] > >>>>> >> >> > >> > > > > > > >>>>> >> >> > >> > > > > > > >>>>> >> >> > >> > > > > > >>>>> >> >> > >> > > > > >>>>> >> >> > >> > > > >>>>> >> >> > >> > https://slack.com/share/IS21SJ75H/Rk8HhUly9FuEHb7oGwBZ33uL/enQtODg2MDYwNjE5MTg3LTA2MjIzNDc1M2ZjZDVlMjdlZjk1M2RkYmJhNjAwMTk2ZDZkODQ4NmY5YmI4OGRhNWJkYTViMTM1NzlmMzc4OWM > >>>>> >> >> > >> > > > > > > >>>>> >> >> > >> > > > > > > >>>>> >> >> > >> > > > > > Peter Huang <[hidden email]> > 于2020年1月7日周二 上午5:09写道: > >>>>> >> >> > >> > > > > > > >>>>> >> >> > >> > > > > > > Dear All, > >>>>> >> >> > >> > > > > > > > >>>>> >> >> > >> > > > > > > Happy new year! According to existing > feedback from the > >>>>> >> >> > >> community, > >>>>> >> >> > >> > > we > >>>>> >> >> > >> > > > > > > revised the doc with the consideration of > session cluster > >>>>> >> >> > >> support, > >>>>> >> >> > >> > > > and > >>>>> >> >> > >> > > > > > > concrete interface changes needed and > execution plan. Please > >>>>> >> >> > >> take > >>>>> >> >> > >> > > one > >>>>> >> >> > >> > > > > > more > >>>>> >> >> > >> > > > > > > round of review at your most convenient time. > >>>>> >> >> > >> > > > > > > > >>>>> >> >> > >> > > > > > > > >>>>> >> >> > >> > > > > > > > >>>>> >> >> > >> > > > > > > >>>>> >> >> > >> > > > > > >>>>> >> >> > >> > > > > >>>>> >> >> > >> > > > >>>>> >> >> > >> > https://docs.google.com/document/d/1aAwVjdZByA-0CHbgv16Me-vjaaDMCfhX7TzVVTuifYM/edit# > >>>>> >> >> > >> > > > > > > > >>>>> >> >> > >> > > > > > > > >>>>> >> >> > >> > > > > > > Best Regards > >>>>> >> >> > >> > > > > > > Peter Huang > >>>>> >> >> > >> > > > > > > > >>>>> >> >> > >> > > > > > > > >>>>> >> >> > >> > > > > > > > >>>>> >> >> > >> > > > > > > > >>>>> >> >> > >> > > > > > > > >>>>> >> >> > >> > > > > > > On Thu, Jan 2, 2020 at 11:29 AM Peter Huang < > >>>>> >> >> > >> > > > > [hidden email]> > >>>>> >> >> > >> > > > > > > wrote: > >>>>> >> >> > >> > > > > > > > >>>>> >> >> > >> > > > > > > > Hi Dian, > >>>>> >> >> > >> > > > > > > > Thanks for giving us valuable feedbacks. > >>>>> >> >> > >> > > > > > > > > >>>>> >> >> > >> > > > > > > > 1) It's better to have a whole design for > this feature > >>>>> >> >> > >> > > > > > > > For the suggestion of enabling the cluster > mode also session > >>>>> >> >> > >> > > > > cluster, I > >>>>> >> >> > >> > > > > > > > think Flink already supported it. > WebSubmissionExtension > >>>>> >> >> > >> already > >>>>> >> >> > >> > > > > allows > >>>>> >> >> > >> > > > > > > > users to start a job with the specified jar > by using web UI. > >>>>> >> >> > >> > > > > > > > But we need to enable the feature from CLI > for both local > >>>>> >> >> > >> jar, > >>>>> >> >> > >> > > > remote > >>>>> >> >> > >> > > > > > > jar. > >>>>> >> >> > >> > > > > > > > I will align with Yang Wang first about the > details and > >>>>> >> >> > >> update > >>>>> >> >> > >> > > the > >>>>> >> >> > >> > > > > > design > >>>>> >> >> > >> > > > > > > > doc. > >>>>> >> >> > >> > > > > > > > > >>>>> >> >> > >> > > > > > > > 2) It's better to consider the convenience > for users, such > >>>>> >> >> > >> as > >>>>> >> >> > >> > > > > debugging > >>>>> >> >> > >> > > > > > > > > >>>>> >> >> > >> > > > > > > > I am wondering whether we can store the > exception in > >>>>> >> >> > >> jobgragh > >>>>> >> >> > >> > > > > > > > generation in application master. As no > streaming graph can > >>>>> >> >> > >> be > >>>>> >> >> > >> > > > > > scheduled > >>>>> >> >> > >> > > > > > > in > >>>>> >> >> > >> > > > > > > > this case, there will be no more TM will be > requested from > >>>>> >> >> > >> > > FlinkRM. > >>>>> >> >> > >> > > > > > > > If the AM is still running, users can still > query it from > >>>>> >> >> > >> CLI. As > >>>>> >> >> > >> > > > it > >>>>> >> >> > >> > > > > > > > requires more change, we can get some > feedback from < > >>>>> >> >> > >> > > > > > [hidden email] > >>>>> >> >> > >> > > > > > > > > >>>>> >> >> > >> > > > > > > > and @[hidden email] <[hidden email]>. > >>>>> >> >> > >> > > > > > > > > >>>>> >> >> > >> > > > > > > > 3) It's better to consider the impact to > the stability of > >>>>> >> >> > >> the > >>>>> >> >> > >> > > > cluster > >>>>> >> >> > >> > > > > > > > > >>>>> >> >> > >> > > > > > > > I agree with Yang Wang's opinion. > >>>>> >> >> > >> > > > > > > > > >>>>> >> >> > >> > > > > > > > > >>>>> >> >> > >> > > > > > > > > >>>>> >> >> > >> > > > > > > > Best Regards > >>>>> >> >> > >> > > > > > > > Peter Huang > >>>>> >> >> > >> > > > > > > > > >>>>> >> >> > >> > > > > > > > > >>>>> >> >> > >> > > > > > > > On Sun, Dec 29, 2019 at 9:44 PM Dian Fu < > >>>>> >> >> > >> [hidden email]> > >>>>> >> >> > >> > > > > wrote: > >>>>> >> >> > >> > > > > > > > > >>>>> >> >> > >> > > > > > > >> Hi all, > >>>>> >> >> > >> > > > > > > >> > >>>>> >> >> > >> > > > > > > >> Sorry to jump into this discussion. Thanks > everyone for the > >>>>> >> >> > >> > > > > > discussion. > >>>>> >> >> > >> > > > > > > >> I'm very interested in this topic although > I'm not an > >>>>> >> >> > >> expert in > >>>>> >> >> > >> > > > this > >>>>> >> >> > >> > > > > > > part. > >>>>> >> >> > >> > > > > > > >> So I'm glad to share my thoughts as > following: > >>>>> >> >> > >> > > > > > > >> > >>>>> >> >> > >> > > > > > > >> 1) It's better to have a whole design for > this feature > >>>>> >> >> > >> > > > > > > >> As we know, there are two deployment > modes: per-job mode > >>>>> >> >> > >> and > >>>>> >> >> > >> > > > session > >>>>> >> >> > >> > > > > > > >> mode. I'm wondering which mode really > needs this feature. > >>>>> >> >> > >> As the > >>>>> >> >> > >> > > > > > design > >>>>> >> >> > >> > > > > > > doc > >>>>> >> >> > >> > > > > > > >> mentioned, per-job mode is more used for > streaming jobs and > >>>>> >> >> > >> > > > session > >>>>> >> >> > >> > > > > > > mode is > >>>>> >> >> > >> > > > > > > >> usually used for batch jobs(Of course, the > job types and > >>>>> >> >> > >> the > >>>>> >> >> > >> > > > > > deployment > >>>>> >> >> > >> > > > > > > >> modes are orthogonal). Usually streaming > job is only > >>>>> >> >> > >> needed to > >>>>> >> >> > >> > > be > >>>>> >> >> > >> > > > > > > submitted > >>>>> >> >> > >> > > > > > > >> once and it will run for days or weeks, > while batch jobs > >>>>> >> >> > >> will be > >>>>> >> >> > >> > > > > > > submitted > >>>>> >> >> > >> > > > > > > >> more frequently compared with streaming > jobs. This means > >>>>> >> >> > >> that > >>>>> >> >> > >> > > > maybe > >>>>> >> >> > >> > > > > > > session > >>>>> >> >> > >> > > > > > > >> mode also needs this feature. However, if > we support this > >>>>> >> >> > >> > > feature > >>>>> >> >> > >> > > > in > >>>>> >> >> > >> > > > > > > >> session mode, the application master will > become the new > >>>>> >> >> > >> > > > centralized > >>>>> >> >> > >> > > > > > > >> service(which should be solved). So in > this case, it's > >>>>> >> >> > >> better to > >>>>> >> >> > >> > > > > have > >>>>> >> >> > >> > > > > > a > >>>>> >> >> > >> > > > > > > >> complete design for both per-job mode and > session mode. > >>>>> >> >> > >> > > > Furthermore, > >>>>> >> >> > >> > > > > > > even > >>>>> >> >> > >> > > > > > > >> if we can do it phase by phase, we need to > have a whole > >>>>> >> >> > >> picture > >>>>> >> >> > >> > > of > >>>>> >> >> > >> > > > > how > >>>>> >> >> > >> > > > > > > it > >>>>> >> >> > >> > > > > > > >> works in both per-job mode and session > mode. > >>>>> >> >> > >> > > > > > > >> > >>>>> >> >> > >> > > > > > > >> 2) It's better to consider the convenience > for users, such > >>>>> >> >> > >> as > >>>>> >> >> > >> > > > > > debugging > >>>>> >> >> > >> > > > > > > >> After we finish this feature, the job > graph will be > >>>>> >> >> > >> compiled in > >>>>> >> >> > >> > > > the > >>>>> >> >> > >> > > > > > > >> application master, which means that users > cannot easily > >>>>> >> >> > >> get the > >>>>> >> >> > >> > > > > > > exception > >>>>> >> >> > >> > > > > > > >> message synchorousely in the job client if > there are > >>>>> >> >> > >> problems > >>>>> >> >> > >> > > > during > >>>>> >> >> > >> > > > > > the > >>>>> >> >> > >> > > > > > > >> job graph compiling (especially for > platform users), such > >>>>> >> >> > >> as the > >>>>> >> >> > >> > > > > > > resource > >>>>> >> >> > >> > > > > > > >> path is incorrect, the user program itself > has some > >>>>> >> >> > >> problems, > >>>>> >> >> > >> > > etc. > >>>>> >> >> > >> > > > > > What > >>>>> >> >> > >> > > > > > > I'm > >>>>> >> >> > >> > > > > > > >> thinking is that maybe we should throw the > exceptions as > >>>>> >> >> > >> early > >>>>> >> >> > >> > > as > >>>>> >> >> > >> > > > > > > possible > >>>>> >> >> > >> > > > > > > >> (during job submission stage). > >>>>> >> >> > >> > > > > > > >> > >>>>> >> >> > >> > > > > > > >> 3) It's better to consider the impact to > the stability of > >>>>> >> >> > >> the > >>>>> >> >> > >> > > > > cluster > >>>>> >> >> > >> > > > > > > >> If we perform the compiling in the > application master, we > >>>>> >> >> > >> should > >>>>> >> >> > >> > > > > > > consider > >>>>> >> >> > >> > > > > > > >> the impact of the compiling errors. > Although YARN could > >>>>> >> >> > >> resume > >>>>> >> >> > >> > > the > >>>>> >> >> > >> > > > > > > >> application master in case of failures, > but in some case > >>>>> >> >> > >> the > >>>>> >> >> > >> > > > > compiling > >>>>> >> >> > >> > > > > > > >> failure may be a waste of cluster resource > and may impact > >>>>> >> >> > >> the > >>>>> >> >> > >> > > > > > stability > >>>>> >> >> > >> > > > > > > the > >>>>> >> >> > >> > > > > > > >> cluster and the other jobs in the cluster, > such as the > >>>>> >> >> > >> resource > >>>>> >> >> > >> > > > path > >>>>> >> >> > >> > > > > > is > >>>>> >> >> > >> > > > > > > >> incorrect, the user program itself has > some problems(in > >>>>> >> >> > >> this > >>>>> >> >> > >> > > case, > >>>>> >> >> > >> > > > > job > >>>>> >> >> > >> > > > > > > >> failover cannot solve this kind of > problems) etc. In the > >>>>> >> >> > >> current > >>>>> >> >> > >> > > > > > > >> implemention, the compiling errors are > handled in the > >>>>> >> >> > >> client > >>>>> >> >> > >> > > side > >>>>> >> >> > >> > > > > and > >>>>> >> >> > >> > > > > > > there > >>>>> >> >> > >> > > > > > > >> is no impact to the cluster at all. > >>>>> >> >> > >> > > > > > > >> > >>>>> >> >> > >> > > > > > > >> Regarding to 1), it's clearly pointed in > the design doc > >>>>> >> >> > >> that > >>>>> >> >> > >> > > only > >>>>> >> >> > >> > > > > > > per-job > >>>>> >> >> > >> > > > > > > >> mode will be supported. However, I think > it's better to > >>>>> >> >> > >> also > >>>>> >> >> > >> > > > > consider > >>>>> >> >> > >> > > > > > > the > >>>>> >> >> > >> > > > > > > >> session mode in the design doc. > >>>>> >> >> > >> > > > > > > >> Regarding to 2) and 3), I have not seen > related sections > >>>>> >> >> > >> in the > >>>>> >> >> > >> > > > > design > >>>>> >> >> > >> > > > > > > >> doc. It will be good if we can cover them > in the design > >>>>> >> >> > >> doc. > >>>>> >> >> > >> > > > > > > >> > >>>>> >> >> > >> > > > > > > >> Feel free to correct me If there is > anything I > >>>>> >> >> > >> misunderstand. > >>>>> >> >> > >> > > > > > > >> > >>>>> >> >> > >> > > > > > > >> Regards, > >>>>> >> >> > >> > > > > > > >> Dian > >>>>> >> >> > >> > > > > > > >> > >>>>> >> >> > >> > > > > > > >> > >>>>> >> >> > >> > > > > > > >> > 在 2019年12月27日,上午3:13,Peter Huang < > >>>>> >> >> > >> [hidden email]> > >>>>> >> >> > >> > > > 写道: > >>>>> >> >> > >> > > > > > > >> > > >>>>> >> >> > >> > > > > > > >> > Hi Yang, > >>>>> >> >> > >> > > > > > > >> > > >>>>> >> >> > >> > > > > > > >> > I can't agree more. The effort > definitely needs to align > >>>>> >> >> > >> with > >>>>> >> >> > >> > > > the > >>>>> >> >> > >> > > > > > > final > >>>>> >> >> > >> > > > > > > >> > goal of FLIP-73. > >>>>> >> >> > >> > > > > > > >> > I am thinking about whether we can > achieve the goal with > >>>>> >> >> > >> two > >>>>> >> >> > >> > > > > phases. > >>>>> >> >> > >> > > > > > > >> > > >>>>> >> >> > >> > > > > > > >> > 1) Phase I > >>>>> >> >> > >> > > > > > > >> > As the CLiFrontend will not be > depreciated soon. We can > >>>>> >> >> > >> still > >>>>> >> >> > >> > > > use > >>>>> >> >> > >> > > > > > the > >>>>> >> >> > >> > > > > > > >> > deployMode flag there, > >>>>> >> >> > >> > > > > > > >> > pass the program info through Flink > configuration, use > >>>>> >> >> > >> the > >>>>> >> >> > >> > > > > > > >> > ClassPathJobGraphRetriever > >>>>> >> >> > >> > > > > > > >> > to generate the job graph in > ClusterEntrypoints of yarn > >>>>> >> >> > >> and > >>>>> >> >> > >> > > > > > > Kubernetes. > >>>>> >> >> > >> > > > > > > >> > > >>>>> >> >> > >> > > > > > > >> > 2) Phase II > >>>>> >> >> > >> > > > > > > >> > In AbstractJobClusterExecutor, the job > graph is > >>>>> >> >> > >> generated in > >>>>> >> >> > >> > > > the > >>>>> >> >> > >> > > > > > > >> execute > >>>>> >> >> > >> > > > > > > >> > function. We can still > >>>>> >> >> > >> > > > > > > >> > use the deployMode in it. With > deployMode = cluster, the > >>>>> >> >> > >> > > execute > >>>>> >> >> > >> > > > > > > >> function > >>>>> >> >> > >> > > > > > > >> > only starts the cluster. > >>>>> >> >> > >> > > > > > > >> > > >>>>> >> >> > >> > > > > > > >> > When > {Yarn/Kuberneates}PerJobClusterEntrypoint starts, > >>>>> >> >> > >> It will > >>>>> >> >> > >> > > > > start > >>>>> >> >> > >> > > > > > > the > >>>>> >> >> > >> > > > > > > >> > dispatch first, then we can use > >>>>> >> >> > >> > > > > > > >> > a ClusterEnvironment similar to > ContextEnvironment to > >>>>> >> >> > >> submit > >>>>> >> >> > >> > > the > >>>>> >> >> > >> > > > > job > >>>>> >> >> > >> > > > > > > >> with > >>>>> >> >> > >> > > > > > > >> > jobName the local > >>>>> >> >> > >> > > > > > > >> > dispatcher. For the details, we need > more investigation. > >>>>> >> >> > >> Let's > >>>>> >> >> > >> > > > > wait > >>>>> >> >> > >> > > > > > > >> > for @Aljoscha > >>>>> >> >> > >> > > > > > > >> > Krettek <[hidden email]> @Till > Rohrmann < > >>>>> >> >> > >> > > > > [hidden email] > >>>>> >> >> > >> > > > > > >'s > >>>>> >> >> > >> > > > > > > >> > feedback after the holiday season. > >>>>> >> >> > >> > > > > > > >> > > >>>>> >> >> > >> > > > > > > >> > Thank you in advance. Merry Chrismas and > Happy New > >>>>> >> >> > >> Year!!! > >>>>> >> >> > >> > > > > > > >> > > >>>>> >> >> > >> > > > > > > >> > > >>>>> >> >> > >> > > > > > > >> > Best Regards > >>>>> >> >> > >> > > > > > > >> > Peter Huang > >>>>> >> >> > >> > > > > > > >> > > >>>>> >> >> > >> > > > > > > >> > > >>>>> >> >> > >> > > > > > > >> > > >>>>> >> >> > >> > > > > > > >> > > >>>>> >> >> > >> > > > > > > >> > > >>>>> >> >> > >> > > > > > > >> > > >>>>> >> >> > >> > > > > > > >> > > >>>>> >> >> > >> > > > > > > >> > > >>>>> >> >> > >> > > > > > > >> > On Wed, Dec 25, 2019 at 1:08 AM Yang > Wang < > >>>>> >> >> > >> > > > [hidden email]> > >>>>> >> >> > >> > > > > > > >> wrote: > >>>>> >> >> > >> > > > > > > >> > > >>>>> >> >> > >> > > > > > > >> >> Hi Peter, > >>>>> >> >> > >> > > > > > > >> >> > >>>>> >> >> > >> > > > > > > >> >> I think we need to reconsider tison's > suggestion > >>>>> >> >> > >> seriously. > >>>>> >> >> > >> > > > After > >>>>> >> >> > >> > > > > > > >> FLIP-73, > >>>>> >> >> > >> > > > > > > >> >> the deployJobCluster has > >>>>> >> >> > >> > > > > > > >> >> beenmoved into > `JobClusterExecutor#execute`. It should > >>>>> >> >> > >> not be > >>>>> >> >> > >> > > > > > > perceived > >>>>> >> >> > >> > > > > > > >> >> for `CliFrontend`. That > >>>>> >> >> > >> > > > > > > >> >> means the user program will *ALWAYS* be > executed on > >>>>> >> >> > >> client > >>>>> >> >> > >> > > > side. > >>>>> >> >> > >> > > > > > This > >>>>> >> >> > >> > > > > > > >> is > >>>>> >> >> > >> > > > > > > >> >> the by design behavior. > >>>>> >> >> > >> > > > > > > >> >> So, we could not just add `if(client > mode) .. else > >>>>> >> >> > >> if(cluster > >>>>> >> >> > >> > > > > mode) > >>>>> >> >> > >> > > > > > > >> ...` > >>>>> >> >> > >> > > > > > > >> >> codes in `CliFrontend` to bypass > >>>>> >> >> > >> > > > > > > >> >> the executor. We need to find a clean > way to decouple > >>>>> >> >> > >> > > executing > >>>>> >> >> > >> > > > > > user > >>>>> >> >> > >> > > > > > > >> >> program and deploying per-job > >>>>> >> >> > >> > > > > > > >> >> cluster. Based on this, we could > support to execute user > >>>>> >> >> > >> > > > program > >>>>> >> >> > >> > > > > on > >>>>> >> >> > >> > > > > > > >> client > >>>>> >> >> > >> > > > > > > >> >> or master side. > >>>>> >> >> > >> > > > > > > >> >> > >>>>> >> >> > >> > > > > > > >> >> Maybe Aljoscha and Jeff could give some > good > >>>>> >> >> > >> suggestions. > >>>>> >> >> > >> > > > > > > >> >> > >>>>> >> >> > >> > > > > > > >> >> > >>>>> >> >> > >> > > > > > > >> >> > >>>>> >> >> > >> > > > > > > >> >> Best, > >>>>> >> >> > >> > > > > > > >> >> Yang > >>>>> >> >> > >> > > > > > > >> >> > >>>>> >> >> > >> > > > > > > >> >> Peter Huang <[hidden email]> > 于2019年12月25日周三 > >>>>> >> >> > >> > > > > 上午4:03写道: > >>>>> >> >> > >> > > > > > > >> >> > >>>>> >> >> > >> > > > > > > >> >>> Hi Jingjing, > >>>>> >> >> > >> > > > > > > >> >>> > >>>>> >> >> > >> > > > > > > >> >>> The improvement proposed is a > deployment option for > >>>>> >> >> > >> CLI. For > >>>>> >> >> > >> > > > SQL > >>>>> >> >> > >> > > > > > > based > >>>>> >> >> > >> > > > > > > >> >>> Flink application, It is more > convenient to use the > >>>>> >> >> > >> existing > >>>>> >> >> > >> > > > > model > >>>>> >> >> > >> > > > > > > in > >>>>> >> >> > >> > > > > > > >> >>> SqlClient in which > >>>>> >> >> > >> > > > > > > >> >>> the job graph is generated within > SqlClient. After > >>>>> >> >> > >> adding > >>>>> >> >> > >> > > the > >>>>> >> >> > >> > > > > > > delayed > >>>>> >> >> > >> > > > > > > >> job > >>>>> >> >> > >> > > > > > > >> >>> graph generation, I think there is no > change is needed > >>>>> >> >> > >> for > >>>>> >> >> > >> > > > your > >>>>> >> >> > >> > > > > > > side. > >>>>> >> >> > >> > > > > > > >> >>> > >>>>> >> >> > >> > > > > > > >> >>> > >>>>> >> >> > >> > > > > > > >> >>> Best Regards > >>>>> >> >> > >> > > > > > > >> >>> Peter Huang > >>>>> >> >> > >> > > > > > > >> >>> > >>>>> >> >> > >> > > > > > > >> >>> > >>>>> >> >> > >> > > > > > > >> >>> On Wed, Dec 18, 2019 at 6:01 AM > jingjing bai < > >>>>> >> >> > >> > > > > > > >> [hidden email]> > >>>>> >> >> > >> > > > > > > >> >>> wrote: > >>>>> >> >> > >> > > > > > > >> >>> > >>>>> >> >> > >> > > > > > > >> >>>> hi peter: > >>>>> >> >> > >> > > > > > > >> >>>> we had extension SqlClent to > support sql job > >>>>> >> >> > >> submit in > >>>>> >> >> > >> > > web > >>>>> >> >> > >> > > > > > base > >>>>> >> >> > >> > > > > > > on > >>>>> >> >> > >> > > > > > > >> >>>> flink 1.9. we support submit to > yarn on per job > >>>>> >> >> > >> mode too. > >>>>> >> >> > >> > > > > > > >> >>>> in this case, the job graph > generated on client > >>>>> >> >> > >> side > >>>>> >> >> > >> > > . I > >>>>> >> >> > >> > > > > > think > >>>>> >> >> > >> > > > > > > >> >>> this > >>>>> >> >> > >> > > > > > > >> >>>> discuss Mainly to improve api > programme. but in my > >>>>> >> >> > >> case , > >>>>> >> >> > >> > > > > there > >>>>> >> >> > >> > > > > > is > >>>>> >> >> > >> > > > > > > >> no > >>>>> >> >> > >> > > > > > > >> >>>> jar to upload but only a sql string . > >>>>> >> >> > >> > > > > > > >> >>>> do u had more suggestion to > improve for sql mode > >>>>> >> >> > >> or it > >>>>> >> >> > >> > > is > >>>>> >> >> > >> > > > > > only a > >>>>> >> >> > >> > > > > > > >> >>>> switch for api programme? > >>>>> >> >> > >> > > > > > > >> >>>> > >>>>> >> >> > >> > > > > > > >> >>>> > >>>>> >> >> > >> > > > > > > >> >>>> best > >>>>> >> >> > >> > > > > > > >> >>>> bai jj > >>>>> >> >> > >> > > > > > > >> >>>> > >>>>> >> >> > >> > > > > > > >> >>>> > >>>>> >> >> > >> > > > > > > >> >>>> Yang Wang <[hidden email]> > 于2019年12月18日周三 > >>>>> >> >> > >> 下午7:21写道: > >>>>> >> >> > >> > > > > > > >> >>>> > >>>>> >> >> > >> > > > > > > >> >>>>> I just want to revive this > discussion. > >>>>> >> >> > >> > > > > > > >> >>>>> > >>>>> >> >> > >> > > > > > > >> >>>>> Recently, i am thinking about how to > natively run > >>>>> >> >> > >> flink > >>>>> >> >> > >> > > > > per-job > >>>>> >> >> > >> > > > > > > >> >>> cluster on > >>>>> >> >> > >> > > > > > > >> >>>>> Kubernetes. > >>>>> >> >> > >> > > > > > > >> >>>>> The per-job mode on Kubernetes is > very different > >>>>> >> >> > >> from on > >>>>> >> >> > >> > > > Yarn. > >>>>> >> >> > >> > > > > > And > >>>>> >> >> > >> > > > > > > >> we > >>>>> >> >> > >> > > > > > > >> >>> will > >>>>> >> >> > >> > > > > > > >> >>>>> have > >>>>> >> >> > >> > > > > > > >> >>>>> the same deployment requirements to > the client and > >>>>> >> >> > >> entry > >>>>> >> >> > >> > > > > point. > >>>>> >> >> > >> > > > > > > >> >>>>> > >>>>> >> >> > >> > > > > > > >> >>>>> 1. Flink client not always need a > local jar to start > >>>>> >> >> > >> a > >>>>> >> >> > >> > > Flink > >>>>> >> >> > >> > > > > > > per-job > >>>>> >> >> > >> > > > > > > >> >>>>> cluster. We could > >>>>> >> >> > >> > > > > > > >> >>>>> support multiple schemas. For > example, > >>>>> >> >> > >> > > > file:///path/of/my.jar > >>>>> >> >> > >> > > > > > > means > >>>>> >> >> > >> > > > > > > >> a > >>>>> >> >> > >> > > > > > > >> >>> jar > >>>>> >> >> > >> > > > > > > >> >>>>> located > >>>>> >> >> > >> > > > > > > >> >>>>> at client side, > >>>>> >> >> > >> hdfs://myhdfs/user/myname/flink/my.jar > >>>>> >> >> > >> > > > means a > >>>>> >> >> > >> > > > > > jar > >>>>> >> >> > >> > > > > > > >> >>> located > >>>>> >> >> > >> > > > > > > >> >>>>> at > >>>>> >> >> > >> > > > > > > >> >>>>> remote hdfs, > local:///path/in/image/my.jar means a > >>>>> >> >> > >> jar > >>>>> >> >> > >> > > > located > >>>>> >> >> > >> > > > > > at > >>>>> >> >> > >> > > > > > > >> >>>>> jobmanager side. > >>>>> >> >> > >> > > > > > > >> >>>>> > >>>>> >> >> > >> > > > > > > >> >>>>> 2. Support running user program on > master side. This > >>>>> >> >> > >> also > >>>>> >> >> > >> > > > > means > >>>>> >> >> > >> > > > > > > the > >>>>> >> >> > >> > > > > > > >> >>> entry > >>>>> >> >> > >> > > > > > > >> >>>>> point > >>>>> >> >> > >> > > > > > > >> >>>>> will generate the job graph on > master side. We could > >>>>> >> >> > >> use > >>>>> >> >> > >> > > the > >>>>> >> >> > >> > > > > > > >> >>>>> ClasspathJobGraphRetriever > >>>>> >> >> > >> > > > > > > >> >>>>> or start a local Flink client to > achieve this > >>>>> >> >> > >> purpose. > >>>>> >> >> > >> > > > > > > >> >>>>> > >>>>> >> >> > >> > > > > > > >> >>>>> > >>>>> >> >> > >> > > > > > > >> >>>>> cc tison, Aljoscha & Kostas Do you > think this is the > >>>>> >> >> > >> right > >>>>> >> >> > >> > > > > > > >> direction we > >>>>> >> >> > >> > > > > > > >> >>>>> need to work? > >>>>> >> >> > >> > > > > > > >> >>>>> > >>>>> >> >> > >> > > > > > > >> >>>>> tison <[hidden email]> > 于2019年12月12日周四 > >>>>> >> >> > >> 下午4:48写道: > >>>>> >> >> > >> > > > > > > >> >>>>> > >>>>> >> >> > >> > > > > > > >> >>>>>> A quick idea is that we separate > the deployment > >>>>> >> >> > >> from user > >>>>> >> >> > >> > > > > > program > >>>>> >> >> > >> > > > > > > >> >>> that > >>>>> >> >> > >> > > > > > > >> >>>>> it > >>>>> >> >> > >> > > > > > > >> >>>>>> has always been done > >>>>> >> >> > >> > > > > > > >> >>>>>> outside the program. On user > program executed there > >>>>> >> >> > >> is > >>>>> >> >> > >> > > > > always a > >>>>> >> >> > >> > > > > > > >> >>>>>> ClusterClient that communicates with > >>>>> >> >> > >> > > > > > > >> >>>>>> an existing cluster, remote or > local. It will be > >>>>> >> >> > >> another > >>>>> >> >> > >> > > > > thread > >>>>> >> >> > >> > > > > > > so > >>>>> >> >> > >> > > > > > > >> >>> just > >>>>> >> >> > >> > > > > > > >> >>>>> for > >>>>> >> >> > >> > > > > > > >> >>>>>> your information. > >>>>> >> >> > >> > > > > > > >> >>>>>> > >>>>> >> >> > >> > > > > > > >> >>>>>> Best, > >>>>> >> >> > >> > > > > > > >> >>>>>> tison. > >>>>> >> >> > >> > > > > > > >> >>>>>> > >>>>> >> >> > >> > > > > > > >> >>>>>> > >>>>> >> >> > >> > > > > > > >> >>>>>> tison <[hidden email]> > 于2019年12月12日周四 > >>>>> >> >> > >> 下午4:40写道: > >>>>> >> >> > >> > > > > > > >> >>>>>> > >>>>> >> >> > >> > > > > > > >> >>>>>>> Hi Peter, > >>>>> >> >> > >> > > > > > > >> >>>>>>> > >>>>> >> >> > >> > > > > > > >> >>>>>>> Another concern I realized > recently is that with > >>>>> >> >> > >> current > >>>>> >> >> > >> > > > > > > Executors > >>>>> >> >> > >> > > > > > > >> >>>>>>> abstraction(FLIP-73) > >>>>> >> >> > >> > > > > > > >> >>>>>>> I'm afraid that user program is > designed to ALWAYS > >>>>> >> >> > >> run > >>>>> >> >> > >> > > on > >>>>> >> >> > >> > > > > the > >>>>> >> >> > >> > > > > > > >> >>> client > >>>>> >> >> > >> > > > > > > >> >>>>>> side. > >>>>> >> >> > >> > > > > > > >> >>>>>>> Specifically, > >>>>> >> >> > >> > > > > > > >> >>>>>>> we deploy the job in executor when > env.execute > >>>>> >> >> > >> called. > >>>>> >> >> > >> > > > This > >>>>> >> >> > >> > > > > > > >> >>>>> abstraction > >>>>> >> >> > >> > > > > > > >> >>>>>>> possibly prevents > >>>>> >> >> > >> > > > > > > >> >>>>>>> Flink runs user program on the > cluster side. > >>>>> >> >> > >> > > > > > > >> >>>>>>> > >>>>> >> >> > >> > > > > > > >> >>>>>>> For your proposal, in this case we > already > >>>>> >> >> > >> compiled the > >>>>> >> >> > >> > > > > > program > >>>>> >> >> > >> > > > > > > >> and > >>>>> >> >> > >> > > > > > > >> >>>>> run > >>>>> >> >> > >> > > > > > > >> >>>>>> on > >>>>> >> >> > >> > > > > > > >> >>>>>>> the client side, > >>>>> >> >> > >> > > > > > > >> >>>>>>> even we deploy a cluster and > retrieve job graph > >>>>> >> >> > >> from > >>>>> >> >> > >> > > > program > >>>>> >> >> > >> > > > > > > >> >>>>> metadata, it > >>>>> >> >> > >> > > > > > > >> >>>>>>> doesn't make > >>>>> >> >> > >> > > > > > > >> >>>>>>> many sense. > >>>>> >> >> > >> > > > > > > >> >>>>>>> > >>>>> >> >> > >> > > > > > > >> >>>>>>> cc Aljoscha & Kostas what do you > think about this > >>>>> >> >> > >> > > > > constraint? > >>>>> >> >> > >> > > > > > > >> >>>>>>> > >>>>> >> >> > >> > > > > > > >> >>>>>>> Best, > >>>>> >> >> > >> > > > > > > >> >>>>>>> tison. > >>>>> >> >> > >> > > > > > > >> >>>>>>> > >>>>> >> >> > >> > > > > > > >> >>>>>>> > >>>>> >> >> > >> > > > > > > >> >>>>>>> Peter Huang < > [hidden email]> > >>>>> >> >> > >> 于2019年12月10日周二 > >>>>> >> >> > >> > > > > > > >> 下午12:45写道: > >>>>> >> >> > >> > > > > > > >> >>>>>>> > >>>>> >> >> > >> > > > > > > >> >>>>>>>> Hi Tison, > >>>>> >> >> > >> > > > > > > >> >>>>>>>> > >>>>> >> >> > >> > > > > > > >> >>>>>>>> Yes, you are right. I think I > made the wrong > >>>>> >> >> > >> argument > >>>>> >> >> > >> > > in > >>>>> >> >> > >> > > > > the > >>>>> >> >> > >> > > > > > > doc. > >>>>> >> >> > >> > > > > > > >> >>>>>>>> Basically, the packaging jar > problem is only for > >>>>> >> >> > >> > > platform > >>>>> >> >> > >> > > > > > > users. > >>>>> >> >> > >> > > > > > > >> >>> In > >>>>> >> >> > >> > > > > > > >> >>>>> our > >>>>> >> >> > >> > > > > > > >> >>>>>>>> internal deploy service, > >>>>> >> >> > >> > > > > > > >> >>>>>>>> we further optimized the > deployment latency by > >>>>> >> >> > >> letting > >>>>> >> >> > >> > > > > users > >>>>> >> >> > >> > > > > > to > >>>>> >> >> > >> > > > > > > >> >>>>>> packaging > >>>>> >> >> > >> > > > > > > >> >>>>>>>> flink-runtime together with the > uber jar, so that > >>>>> >> >> > >> we > >>>>> >> >> > >> > > > don't > >>>>> >> >> > >> > > > > > need > >>>>> >> >> > >> > > > > > > >> to > >>>>> >> >> > >> > > > > > > >> >>>>>>>> consider > >>>>> >> >> > >> > > > > > > >> >>>>>>>> multiple flink version > >>>>> >> >> > >> > > > > > > >> >>>>>>>> support for now. In the session > client mode, as > >>>>> >> >> > >> Flink > >>>>> >> >> > >> > > > libs > >>>>> >> >> > >> > > > > > will > >>>>> >> >> > >> > > > > > > >> be > >>>>> >> >> > >> > > > > > > >> >>>>>> shipped > >>>>> >> >> > >> > > > > > > >> >>>>>>>> anyway as local resources of > yarn. Users actually > >>>>> >> >> > >> don't > >>>>> >> >> > >> > > > > need > >>>>> >> >> > >> > > > > > to > >>>>> >> >> > >> > > > > > > >> >>>>> package > >>>>> >> >> > >> > > > > > > >> >>>>>>>> those libs into job jar. > >>>>> >> >> > >> > > > > > > >> >>>>>>>> > >>>>> >> >> > >> > > > > > > >> >>>>>>>> > >>>>> >> >> > >> > > > > > > >> >>>>>>>> > >>>>> >> >> > >> > > > > > > >> >>>>>>>> Best Regards > >>>>> >> >> > >> > > > > > > >> >>>>>>>> Peter Huang > >>>>> >> >> > >> > > > > > > >> >>>>>>>> > >>>>> >> >> > >> > > > > > > >> >>>>>>>> On Mon, Dec 9, 2019 at 8:35 PM > tison < > >>>>> >> >> > >> > > > [hidden email] > >>>>> >> >> > >> > > > > > > >>>>> >> >> > >> > > > > > > >> >>> wrote: > >>>>> >> >> > >> > > > > > > >> >>>>>>>> > >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> 3. What do you mean about the > package? Do users > >>>>> >> >> > >> need > >>>>> >> >> > >> > > to > >>>>> >> >> > >> > > > > > > >> >>> compile > >>>>> >> >> > >> > > > > > > >> >>>>>> their > >>>>> >> >> > >> > > > > > > >> >>>>>>>>> jars > >>>>> >> >> > >> > > > > > > >> >>>>>>>>> inlcuding flink-clients, > flink-optimizer, > >>>>> >> >> > >> flink-table > >>>>> >> >> > >> > > > > codes? > >>>>> >> >> > >> > > > > > > >> >>>>>>>>> > >>>>> >> >> > >> > > > > > > >> >>>>>>>>> The answer should be no because > they exist in > >>>>> >> >> > >> system > >>>>> >> >> > >> > > > > > > classpath. > >>>>> >> >> > >> > > > > > > >> >>>>>>>>> > >>>>> >> >> > >> > > > > > > >> >>>>>>>>> Best, > >>>>> >> >> > >> > > > > > > >> >>>>>>>>> tison. > >>>>> >> >> > >> > > > > > > >> >>>>>>>>> > >>>>> >> >> > >> > > > > > > >> >>>>>>>>> > >>>>> >> >> > >> > > > > > > >> >>>>>>>>> Yang Wang <[hidden email]> > 于2019年12月10日周二 > >>>>> >> >> > >> > > > > 下午12:18写道: > >>>>> >> >> > >> > > > > > > >> >>>>>>>>> > >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> Hi Peter, > >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> > >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> Thanks a lot for starting this > discussion. I > >>>>> >> >> > >> think > >>>>> >> >> > >> > > this > >>>>> >> >> > >> > > > > is > >>>>> >> >> > >> > > > > > a > >>>>> >> >> > >> > > > > > > >> >>> very > >>>>> >> >> > >> > > > > > > >> >>>>>>>> useful > >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> feature. > >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> > >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> Not only for Yarn, i am focused > on flink on > >>>>> >> >> > >> > > Kubernetes > >>>>> >> >> > >> > > > > > > >> >>>>> integration > >>>>> >> >> > >> > > > > > > >> >>>>>> and > >>>>> >> >> > >> > > > > > > >> >>>>>>>>> come > >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> across the same > >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> problem. I do not want the job > graph generated > >>>>> >> >> > >> on > >>>>> >> >> > >> > > > client > >>>>> >> >> > >> > > > > > > side. > >>>>> >> >> > >> > > > > > > >> >>>>>>>> Instead, > >>>>> >> >> > >> > > > > > > >> >>>>>>>>> the > >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> user jars are built in > >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> a user-defined image. When the > job manager > >>>>> >> >> > >> launched, > >>>>> >> >> > >> > > we > >>>>> >> >> > >> > > > > > just > >>>>> >> >> > >> > > > > > > >> >>>>> need to > >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> generate the job graph > >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> based on local user jars. > >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> > >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> I have some small suggestion > about this. > >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> > >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> 1. `ProgramJobGraphRetriever` > is very similar to > >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> `ClasspathJobGraphRetriever`, > the differences > >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> are the former needs > `ProgramMetadata` and the > >>>>> >> >> > >> latter > >>>>> >> >> > >> > > > > needs > >>>>> >> >> > >> > > > > > > >> >>> some > >>>>> >> >> > >> > > > > > > >> >>>>>>>>> arguments. > >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> Is it possible to > >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> have an unified > `JobGraphRetriever` to support > >>>>> >> >> > >> both? > >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> 2. Is it possible to not use a > local user jar to > >>>>> >> >> > >> > > start > >>>>> >> >> > >> > > > a > >>>>> >> >> > >> > > > > > > >> >>> per-job > >>>>> >> >> > >> > > > > > > >> >>>>>>>> cluster? > >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> In your case, the user jars has > >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> existed on hdfs already and we > do need to > >>>>> >> >> > >> download > >>>>> >> >> > >> > > the > >>>>> >> >> > >> > > > > jars > >>>>> >> >> > >> > > > > > > to > >>>>> >> >> > >> > > > > > > >> >>>>>>>> deployer > >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> service. Currently, we > >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> always need a local user jar to > start a flink > >>>>> >> >> > >> > > cluster. > >>>>> >> >> > >> > > > It > >>>>> >> >> > >> > > > > > is > >>>>> >> >> > >> > > > > > > >> >>> be > >>>>> >> >> > >> > > > > > > >> >>>>>> great > >>>>> >> >> > >> > > > > > > >> >>>>>>>> if > >>>>> >> >> > >> > > > > > > >> >>>>>>>>> we > >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> could support remote user jars. > >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>>> In the implementation, we > assume users package > >>>>> >> >> > >> > > > > > > >> >>> flink-clients, > >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> flink-optimizer, flink-table > together within > >>>>> >> >> > >> the job > >>>>> >> >> > >> > > > jar. > >>>>> >> >> > >> > > > > > > >> >>>>> Otherwise, > >>>>> >> >> > >> > > > > > > >> >>>>>>>> the > >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> job graph generation within > >>>>> >> >> > >> JobClusterEntryPoint will > >>>>> >> >> > >> > > > > fail. > >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> 3. What do you mean about the > package? Do users > >>>>> >> >> > >> need > >>>>> >> >> > >> > > to > >>>>> >> >> > >> > > > > > > >> >>> compile > >>>>> >> >> > >> > > > > > > >> >>>>>> their > >>>>> >> >> > >> > > > > > > >> >>>>>>>>> jars > >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> inlcuding flink-clients, > flink-optimizer, > >>>>> >> >> > >> flink-table > >>>>> >> >> > >> > > > > > codes? > >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> > >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> > >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> > >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> Best, > >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> Yang > >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> > >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> Peter Huang < > [hidden email]> > >>>>> >> >> > >> > > > 于2019年12月10日周二 > >>>>> >> >> > >> > > > > > > >> >>>>> 上午2:37写道: > >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> > >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> Dear All, > >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> > >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> Recently, the Flink community > starts to > >>>>> >> >> > >> improve the > >>>>> >> >> > >> > > > yarn > >>>>> >> >> > >> > > > > > > >> >>>>> cluster > >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> descriptor > >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> to make job jar and config > files configurable > >>>>> >> >> > >> from > >>>>> >> >> > >> > > > CLI. > >>>>> >> >> > >> > > > > It > >>>>> >> >> > >> > > > > > > >> >>>>>> improves > >>>>> >> >> > >> > > > > > > >> >>>>>>>> the > >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> flexibility of Flink > deployment Yarn Per Job > >>>>> >> >> > >> Mode. > >>>>> >> >> > >> > > > For > >>>>> >> >> > >> > > > > > > >> >>>>> platform > >>>>> >> >> > >> > > > > > > >> >>>>>>>> users > >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> who > >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> manage tens of hundreds of > streaming pipelines > >>>>> >> >> > >> for > >>>>> >> >> > >> > > the > >>>>> >> >> > >> > > > > > whole > >>>>> >> >> > >> > > > > > > >> >>>>> org > >>>>> >> >> > >> > > > > > > >> >>>>>> or > >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> company, we found the job > graph generation in > >>>>> >> >> > >> > > > > client-side > >>>>> >> >> > >> > > > > > is > >>>>> >> >> > >> > > > > > > >> >>>>>> another > >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> pinpoint. Thus, we want to > propose a > >>>>> >> >> > >> configurable > >>>>> >> >> > >> > > > > feature > >>>>> >> >> > >> > > > > > > >> >>> for > >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> FlinkYarnSessionCli. The > feature can allow > >>>>> >> >> > >> users to > >>>>> >> >> > >> > > > > choose > >>>>> >> >> > >> > > > > > > >> >>> the > >>>>> >> >> > >> > > > > > > >> >>>>> job > >>>>> >> >> > >> > > > > > > >> >>>>>>>>> graph > >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> generation in Flink > ClusterEntryPoint so that > >>>>> >> >> > >> the > >>>>> >> >> > >> > > job > >>>>> >> >> > >> > > > > jar > >>>>> >> >> > >> > > > > > > >> >>>>> doesn't > >>>>> >> >> > >> > > > > > > >> >>>>>>>> need > >>>>> >> >> > >> > > > > > > >> >>>>>>>>> to > >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> be locally for the job graph > generation. The > >>>>> >> >> > >> > > proposal > >>>>> >> >> > >> > > > is > >>>>> >> >> > >> > > > > > > >> >>>>> organized > >>>>> >> >> > >> > > > > > > >> >>>>>>>> as a > >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> FLIP > >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> > >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> > >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> > >>>>> >> >> > >> > > > > > > >> >>>>>>>>> > >>>>> >> >> > >> > > > > > > >> >>>>>>>> > >>>>> >> >> > >> > > > > > > >> >>>>>> > >>>>> >> >> > >> > > > > > > >> >>>>> > >>>>> >> >> > >> > > > > > > >> >>> > >>>>> >> >> > >> > > > > > > >> > >>>>> >> >> > >> > > > > > > > >>>>> >> >> > >> > > > > > > >>>>> >> >> > >> > > > > > >>>>> >> >> > >> > > > > >>>>> >> >> > >> > > > >>>>> >> >> > >> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-85+Delayed+JobGraph+Generation > >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> . > >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> > >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> Any questions and suggestions > are welcomed. > >>>>> >> >> > >> Thank > >>>>> >> >> > >> > > you > >>>>> >> >> > >> > > > in > >>>>> >> >> > >> > > > > > > >> >>>>> advance. > >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> > >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> > >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> Best Regards > >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> Peter Huang > >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> > >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> > >>>>> >> >> > >> > > > > > > >> >>>>>>>>> > >>>>> >> >> > >> > > > > > > >> >>>>>>>> > >>>>> >> >> > >> > > > > > > >> >>>>>>> > >>>>> >> >> > >> > > > > > > >> >>>>>> > >>>>> >> >> > >> > > > > > > >> >>>>> > >>>>> >> >> > >> > > > > > > >> >>>> > >>>>> >> >> > >> > > > > > > >> >>> > >>>>> >> >> > >> > > > > > > >> >> > >>>>> >> >> > >> > > > > > > >> > >>>>> >> >> > >> > > > > > > >> > >>>>> >> >> > >> > > > > > > > >>>>> >> >> > >> > > > > > > >>>>> >> >> > >> > > > > > >>>>> >> >> > >> > > > > >>>>> >> >> > >> > > > >>>>> >> >> > >> > >>>>> >> >> > > > >>>>> >> >> > |
Hi Becket,
Thanks for your attention on FLIP-85! I answered your question inline. 1. What exactly the job submission interface will look like after this FLIP? The FLIP template has a Public Interface section but was removed from this FLIP. As Yang mentioned in this thread above: From user perspective, only a `-R/-- remote-deploy` cli option is visible. They are not aware of the application mode. 2. How will the new ClusterEntrypoint fetch the jars from external storage? What external storage will be supported out of the box? Will this "jar fetcher" be pluggable? If so, how does the API look like and how will users specify the custom "jar fetcher"? It depends actually. Here are several points: i. Currently, shipping user files is handled by Flink, dependencies fetching can be handled by Flink. ii. Current, we only support local file system shipfiles. When in Application Mode, to support meaningful jar fetch we should support user to configure richer shipfiles schema at first. iii. Dependencies fetching varies from deployments. That is, on YARN, its convention is through HDFS; on Kubernetes, its convention is configured resource server and fetched by initContainer. Thus, in the First phase of Application Mode dependencies fetching is totally handled within Flink. 3. It sounds that in this FLIP, the "session cluster" running the application has the same lifecycle as the user application. How will the session cluster be teared down after the application finishes? Will the ClusterEntrypoint do that? Will there be an option of not tearing the cluster down? The precondition we tear down the cluster is *both* i. user main reached to its end ii. all jobs submitted(current, at most one) reached global terminate state For the "how", it is an implementation topic, but conceptually it is ClusterEntrypoint's responsibility. >Will there be an option of not tearing the cluster down? I think the answer is "No" because the cluster is designed to be bounded with an Application. User logic that communicates with the job is always in its `main`, and for history information we have history server. Best, tison. Becket Qin <[hidden email]> 于2020年3月9日周一 上午8:12写道: > Hi Peter and Kostas, > > Thanks for creating this FLIP. Moving the JobGraph compilation to the > cluster makes a lot of sense to me. FLIP-40 had the exactly same idea, but > is currently dormant and can probably be superseded by this FLIP. After > reading the FLIP, I still have a few questions. > > 1. What exactly the job submission interface will look like after this > FLIP? The FLIP template has a Public Interface section but was removed from > this FLIP. > 2. How will the new ClusterEntrypoint fetch the jars from external > storage? What external storage will be supported out of the box? Will this > "jar fetcher" be pluggable? If so, how does the API look like and how will > users specify the custom "jar fetcher"? > 3. It sounds that in this FLIP, the "session cluster" running the > application has the same lifecycle as the user application. How will the > session cluster be teared down after the application finishes? Will the > ClusterEntrypoint do that? Will there be an option of not tearing the > cluster down? > > Maybe they have been discussed in the ML earlier, but I think they should > be part of the FLIP also. > > Thanks, > > Jiangjie (Becket) Qin > > On Thu, Mar 5, 2020 at 10:09 PM Kostas Kloudas <[hidden email]> wrote: > >> Also from my side +1 to start voting. >> >> Cheers, >> Kostas >> >> On Thu, Mar 5, 2020 at 7:45 AM tison <[hidden email]> wrote: >> > >> > +1 to star voting. >> > >> > Best, >> > tison. >> > >> > >> > Yang Wang <[hidden email]> 于2020年3月5日周四 下午2:29写道: >> >> >> >> Hi Peter, >> >> Really thanks for your response. >> >> >> >> Hi all @Kostas Kloudas @Zili Chen @Peter Huang @Rong Rong >> >> It seems that we have reached an agreement. The “application mode” is >> regarded as the enhanced “per-job”. It is >> >> orthogonal with “cluster deploy”. Currently, we bind the “per-job” to >> `run-user-main-on-client` and “application mode” >> >> to `run-user-main-on-cluster`. >> >> >> >> Do you have other concerns to moving FLIP-85 to voting? >> >> >> >> >> >> Best, >> >> Yang >> >> >> >> Peter Huang <[hidden email]> 于2020年3月5日周四 下午12:48写道: >> >>> >> >>> Hi Yang and Kostas, >> >>> >> >>> Thanks for the clarification. It makes more sense to me if the long >> term goal is to replace per job mode to application mode >> >>> in the future (at the time that multiple execute can be supported). >> Before that, It will be better to keep the concept of >> >>> application mode internally. As Yang suggested, User only need to >> use a `-R/-- remote-deploy` cli option to launch >> >>> a per job cluster with the main function executed in cluster >> entry-point. +1 for the execution plan. >> >>> >> >>> >> >>> >> >>> Best Regards >> >>> Peter Huang >> >>> >> >>> >> >>> >> >>> >> >>> On Tue, Mar 3, 2020 at 7:11 AM Yang Wang <[hidden email]> >> wrote: >> >>>> >> >>>> Hi Peter, >> >>>> >> >>>> Having the application mode does not mean we will drop the >> cluster-deploy >> >>>> option. I just want to share some thoughts about “Application Mode”. >> >>>> >> >>>> >> >>>> 1. The application mode could cover the per-job sematic. Its >> lifecyle is bound >> >>>> to the user `main()`. And all the jobs in the user main will be >> executed in a same >> >>>> Flink cluster. In first phase of FLIP-85 implementation, running >> user main on the >> >>>> cluster side could be supported in application mode. >> >>>> >> >>>> 2. Maybe in the future, we also need to support multiple `execute()` >> on client side >> >>>> in a same Flink cluster. Then the per-job mode will evolve to >> application mode. >> >>>> >> >>>> 3. From user perspective, only a `-R/-- remote-deploy` cli option is >> visible. They >> >>>> are not aware of the application mode. >> >>>> >> >>>> 4. In the first phase, the application mode is working as >> “per-job”(only one job in >> >>>> the user main). We just leave more potential for the future. >> >>>> >> >>>> >> >>>> I am not against with calling it “cluster deploy mode” if you all >> think it is clearer for users. >> >>>> >> >>>> >> >>>> >> >>>> Best, >> >>>> Yang >> >>>> >> >>>> Kostas Kloudas <[hidden email]> 于2020年3月3日周二 下午6:49写道: >> >>>>> >> >>>>> Hi Peter, >> >>>>> >> >>>>> I understand your point. This is why I was also a bit torn about the >> >>>>> name and my proposal was a bit aligned with yours (something along >> the >> >>>>> lines of "cluster deploy" mode). >> >>>>> >> >>>>> But many of the other participants in the discussion suggested the >> >>>>> "Application Mode". I think that the reasoning is that now the >> user's >> >>>>> Application is more self-contained. >> >>>>> It will be submitted to the cluster and the user can just >> disconnect. >> >>>>> In addition, as discussed briefly in the doc, in the future there >> may >> >>>>> be better support for multi-execute applications which will bring us >> >>>>> one step closer to the true "Application Mode". But this is how I >> >>>>> interpreted their arguments, of course they can also express their >> >>>>> thoughts on the topic :) >> >>>>> >> >>>>> Cheers, >> >>>>> Kostas >> >>>>> >> >>>>> On Mon, Mar 2, 2020 at 6:15 PM Peter Huang < >> [hidden email]> wrote: >> >>>>> > >> >>>>> > Hi Kostas, >> >>>>> > >> >>>>> > Thanks for updating the wiki. We have aligned with the >> implementations in the doc. But I feel it is still a little bit confusing >> of the naming from a user's perspective. It is well known that Flink >> support per job cluster and session cluster. The concept is in the layer of >> how a job is managed within Flink. The method introduced util now is a kind >> of mixing job and session cluster to promising the implementation >> complexity. We probably don't need to label it as Application Model as the >> same layer of per job cluster and session cluster. Conceptually, I think it >> is still a cluster mode implementation for per job cluster. >> >>>>> > >> >>>>> > To minimize the confusion of users, I think it would be better >> just an option of per job cluster for each type of cluster manager. How do >> you think? >> >>>>> > >> >>>>> > >> >>>>> > Best Regards >> >>>>> > Peter Huang >> >>>>> > >> >>>>> > >> >>>>> > >> >>>>> > >> >>>>> > >> >>>>> > >> >>>>> > >> >>>>> > >> >>>>> > On Mon, Mar 2, 2020 at 7:22 AM Kostas Kloudas <[hidden email]> >> wrote: >> >>>>> >> >> >>>>> >> Hi Yang, >> >>>>> >> >> >>>>> >> The difference between per-job and application mode is that, as >> you >> >>>>> >> described, in the per-job mode the main is executed on the client >> >>>>> >> while in the application mode, the main is executed on the >> cluster. >> >>>>> >> I do not think we have to offer "application mode" with running >> the >> >>>>> >> main on the client side as this is exactly what the per-job mode >> does >> >>>>> >> currently and, as you described also, it would be redundant. >> >>>>> >> >> >>>>> >> Sorry if this was not clear in the document. >> >>>>> >> >> >>>>> >> Cheers, >> >>>>> >> Kostas >> >>>>> >> >> >>>>> >> On Mon, Mar 2, 2020 at 3:17 PM Yang Wang <[hidden email]> >> wrote: >> >>>>> >> > >> >>>>> >> > Hi Kostas, >> >>>>> >> > >> >>>>> >> > Thanks a lot for your conclusion and updating the FLIP-85 >> WIKI. Currently, i have no more >> >>>>> >> > questions about motivation, approach, fault tolerance and the >> first phase implementation. >> >>>>> >> > >> >>>>> >> > I think the new title "Flink Application Mode" makes a lot >> senses to me. Especially for the >> >>>>> >> > containerized environment, the cluster deploy option will be >> very useful. >> >>>>> >> > >> >>>>> >> > Just one concern, how do we introduce this new application >> mode to our users? >> >>>>> >> > Each user program(i.e. `main()`) is an application. Currently, >> we intend to only support one >> >>>>> >> > `execute()`. So what's the difference between per-job and >> application mode? >> >>>>> >> > >> >>>>> >> > For per-job, user `main()` is always executed on client side. >> And For application mode, user >> >>>>> >> > `main()` could be executed on client or master side(configured >> via cli option). >> >>>>> >> > Right? We need to have a clear concept. Otherwise, the users >> will be more and more confusing. >> >>>>> >> > >> >>>>> >> > >> >>>>> >> > Best, >> >>>>> >> > Yang >> >>>>> >> > >> >>>>> >> > Kostas Kloudas <[hidden email]> 于2020年3月2日周一 下午5:58写道: >> >>>>> >> >> >> >>>>> >> >> Hi all, >> >>>>> >> >> >> >>>>> >> >> I update >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-85+Flink+Application+Mode >> >>>>> >> >> based on the discussion we had here: >> >>>>> >> >> >> >>>>> >> >> >> https://docs.google.com/document/d/1ji72s3FD9DYUyGuKnJoO4ApzV-nSsZa0-bceGXW7Ocw/edit# >> >>>>> >> >> >> >>>>> >> >> Please let me know what you think and please keep the >> discussion in the ML :) >> >>>>> >> >> >> >>>>> >> >> Thanks for starting the discussion and I hope that soon we >> will be >> >>>>> >> >> able to vote on the FLIP. >> >>>>> >> >> >> >>>>> >> >> Cheers, >> >>>>> >> >> Kostas >> >>>>> >> >> >> >>>>> >> >> On Thu, Jan 16, 2020 at 3:40 AM Yang Wang < >> [hidden email]> wrote: >> >>>>> >> >> > >> >>>>> >> >> > Hi all, >> >>>>> >> >> > >> >>>>> >> >> > Thanks a lot for the feedback from @Kostas Kloudas. Your >> all concerns are >> >>>>> >> >> > on point. The FLIP-85 is mainly >> >>>>> >> >> > focused on supporting cluster mode for per-job. Since it is >> more urgent and >> >>>>> >> >> > have much more use >> >>>>> >> >> > cases both in Yarn and Kubernetes deployment. For session >> cluster, we could >> >>>>> >> >> > have more discussion >> >>>>> >> >> > in a new thread later. >> >>>>> >> >> > >> >>>>> >> >> > #1, How to download the user jars and dependencies for >> per-job in cluster >> >>>>> >> >> > mode? >> >>>>> >> >> > For Yarn, we could register the user jars and dependencies >> as >> >>>>> >> >> > LocalResource. They will be distributed >> >>>>> >> >> > by Yarn. And once the JobManager and TaskManager launched, >> the jars are >> >>>>> >> >> > already exists. >> >>>>> >> >> > For Standalone per-job and K8s, we expect that the user jars >> >>>>> >> >> > and dependencies are built into the image. >> >>>>> >> >> > Or the InitContainer could be used for downloading. It is >> natively >> >>>>> >> >> > distributed and we will not have bottleneck. >> >>>>> >> >> > >> >>>>> >> >> > #2, Job graph recovery >> >>>>> >> >> > We could have an optimization to store job graph on the >> DFS. However, i >> >>>>> >> >> > suggest building a new jobgraph >> >>>>> >> >> > from the configuration is the default option. Since we will >> not always have >> >>>>> >> >> > a DFS store when deploying a >> >>>>> >> >> > Flink per-job cluster. Of course, we assume that using the >> same >> >>>>> >> >> > configuration(e.g. job_id, user_jar, main_class, >> >>>>> >> >> > main_args, parallelism, savepoint_settings, etc.) will get >> a same job >> >>>>> >> >> > graph. I think the standalone per-job >> >>>>> >> >> > already has the similar behavior. >> >>>>> >> >> > >> >>>>> >> >> > #3, What happens with jobs that have multiple execute calls? >> >>>>> >> >> > Currently, it is really a problem. Even we use a local >> client on Flink >> >>>>> >> >> > master side, it will have different behavior with >> >>>>> >> >> > client mode. For client mode, if we execute multiple times, >> then we will >> >>>>> >> >> > deploy multiple Flink clusters for each execute. >> >>>>> >> >> > I am not pretty sure whether it is reasonable. However, i >> still think using >> >>>>> >> >> > the local client is a good choice. We could >> >>>>> >> >> > continue the discussion in a new thread. @Zili Chen < >> [hidden email]> Do >> >>>>> >> >> > you want to drive this? >> >>>>> >> >> > >> >>>>> >> >> > >> >>>>> >> >> > >> >>>>> >> >> > Best, >> >>>>> >> >> > Yang >> >>>>> >> >> > >> >>>>> >> >> > Peter Huang <[hidden email]> 于2020年1月16日周四 >> 上午1:55写道: >> >>>>> >> >> > >> >>>>> >> >> > > Hi Kostas, >> >>>>> >> >> > > >> >>>>> >> >> > > Thanks for this feedback. I can't agree more about the >> opinion. The >> >>>>> >> >> > > cluster mode should be added >> >>>>> >> >> > > first in per job cluster. >> >>>>> >> >> > > >> >>>>> >> >> > > 1) For job cluster implementation >> >>>>> >> >> > > 1. Job graph recovery from configuration or store as >> static job graph as >> >>>>> >> >> > > session cluster. I think the static one will be better >> for less recovery >> >>>>> >> >> > > time. >> >>>>> >> >> > > Let me update the doc for details. >> >>>>> >> >> > > >> >>>>> >> >> > > 2. For job execute multiple times, I think @Zili Chen >> >>>>> >> >> > > <[hidden email]> has proposed the local client >> solution that can >> >>>>> >> >> > > the run program actually in the cluster entry point. We >> can put the >> >>>>> >> >> > > implementation in the second stage, >> >>>>> >> >> > > or even a new FLIP for further discussion. >> >>>>> >> >> > > >> >>>>> >> >> > > 2) For session cluster implementation >> >>>>> >> >> > > We can disable the cluster mode for the session cluster >> in the first >> >>>>> >> >> > > stage. I agree the jar downloading will be a painful >> thing. >> >>>>> >> >> > > We can consider about PoC and performance evaluation >> first. If the end to >> >>>>> >> >> > > end experience is good enough, then we can consider >> >>>>> >> >> > > proceeding with the solution. >> >>>>> >> >> > > >> >>>>> >> >> > > Looking forward to more opinions from @Yang Wang < >> [hidden email]> @Zili >> >>>>> >> >> > > Chen <[hidden email]> @Dian Fu < >> [hidden email]>. >> >>>>> >> >> > > >> >>>>> >> >> > > >> >>>>> >> >> > > Best Regards >> >>>>> >> >> > > Peter Huang >> >>>>> >> >> > > >> >>>>> >> >> > > On Wed, Jan 15, 2020 at 7:50 AM Kostas Kloudas < >> [hidden email]> wrote: >> >>>>> >> >> > > >> >>>>> >> >> > >> Hi all, >> >>>>> >> >> > >> >> >>>>> >> >> > >> I am writing here as the discussion on the Google Doc >> seems to be a >> >>>>> >> >> > >> bit difficult to follow. >> >>>>> >> >> > >> >> >>>>> >> >> > >> I think that in order to be able to make progress, it >> would be helpful >> >>>>> >> >> > >> to focus on per-job mode for now. >> >>>>> >> >> > >> The reason is that: >> >>>>> >> >> > >> 1) making the (unique) JobSubmitHandler responsible for >> creating the >> >>>>> >> >> > >> jobgraphs, >> >>>>> >> >> > >> which includes downloading dependencies, is not an >> optimal solution >> >>>>> >> >> > >> 2) even if we put the responsibility on the JobMaster, >> currently each >> >>>>> >> >> > >> job has its own >> >>>>> >> >> > >> JobMaster but they all run on the same process, so we >> have again a >> >>>>> >> >> > >> single entity. >> >>>>> >> >> > >> >> >>>>> >> >> > >> Of course after this is done, and if we feel comfortable >> with the >> >>>>> >> >> > >> solution, then we can go to the session mode. >> >>>>> >> >> > >> >> >>>>> >> >> > >> A second comment has to do with fault-tolerance in the >> per-job, >> >>>>> >> >> > >> cluster-deploy mode. >> >>>>> >> >> > >> In the document, it is suggested that upon recovery, the >> JobMaster of >> >>>>> >> >> > >> each job re-creates the JobGraph. >> >>>>> >> >> > >> I am just wondering if it is better to create and store >> the jobGraph >> >>>>> >> >> > >> upon submission and only fetch it >> >>>>> >> >> > >> upon recovery so that we have a static jobGraph. >> >>>>> >> >> > >> >> >>>>> >> >> > >> Finally, I have a question which is what happens with >> jobs that have >> >>>>> >> >> > >> multiple execute calls? >> >>>>> >> >> > >> The semantics seem to change compared to the current >> behaviour, right? >> >>>>> >> >> > >> >> >>>>> >> >> > >> Cheers, >> >>>>> >> >> > >> Kostas >> >>>>> >> >> > >> >> >>>>> >> >> > >> On Wed, Jan 8, 2020 at 8:05 PM tison < >> [hidden email]> wrote: >> >>>>> >> >> > >> > >> >>>>> >> >> > >> > not always, Yang Wang is also not yet a committer but >> he can join the >> >>>>> >> >> > >> > channel. I cannot find the id by clicking “Add new >> member in channel” so >> >>>>> >> >> > >> > come to you and ask for try out the link. Possibly I >> will find other >> >>>>> >> >> > >> ways >> >>>>> >> >> > >> > but the original purpose is that the slack channel is >> a public area we >> >>>>> >> >> > >> > discuss about developing... >> >>>>> >> >> > >> > Best, >> >>>>> >> >> > >> > tison. >> >>>>> >> >> > >> > >> >>>>> >> >> > >> > >> >>>>> >> >> > >> > Peter Huang <[hidden email]> 于2020年1月9日周四 >> 上午2:44写道: >> >>>>> >> >> > >> > >> >>>>> >> >> > >> > > Hi Tison, >> >>>>> >> >> > >> > > >> >>>>> >> >> > >> > > I am not the committer of Flink yet. I think I can't >> join it also. >> >>>>> >> >> > >> > > >> >>>>> >> >> > >> > > >> >>>>> >> >> > >> > > Best Regards >> >>>>> >> >> > >> > > Peter Huang >> >>>>> >> >> > >> > > >> >>>>> >> >> > >> > > On Wed, Jan 8, 2020 at 9:39 AM tison < >> [hidden email]> wrote: >> >>>>> >> >> > >> > > >> >>>>> >> >> > >> > > > Hi Peter, >> >>>>> >> >> > >> > > > >> >>>>> >> >> > >> > > > Could you try out this link? >> >>>>> >> >> > >> > > https://the-asf.slack.com/messages/CNA3ADZPH >> >>>>> >> >> > >> > > > >> >>>>> >> >> > >> > > > Best, >> >>>>> >> >> > >> > > > tison. >> >>>>> >> >> > >> > > > >> >>>>> >> >> > >> > > > >> >>>>> >> >> > >> > > > Peter Huang <[hidden email]> >> 于2020年1月9日周四 上午1:22写道: >> >>>>> >> >> > >> > > > >> >>>>> >> >> > >> > > > > Hi Tison, >> >>>>> >> >> > >> > > > > >> >>>>> >> >> > >> > > > > I can't join the group with shared link. Would >> you please add me >> >>>>> >> >> > >> into >> >>>>> >> >> > >> > > the >> >>>>> >> >> > >> > > > > group? My slack account is huangzhenqiu0825. >> >>>>> >> >> > >> > > > > Thank you in advance. >> >>>>> >> >> > >> > > > > >> >>>>> >> >> > >> > > > > >> >>>>> >> >> > >> > > > > Best Regards >> >>>>> >> >> > >> > > > > Peter Huang >> >>>>> >> >> > >> > > > > >> >>>>> >> >> > >> > > > > On Wed, Jan 8, 2020 at 12:02 AM tison < >> [hidden email]> >> >>>>> >> >> > >> wrote: >> >>>>> >> >> > >> > > > > >> >>>>> >> >> > >> > > > > > Hi Peter, >> >>>>> >> >> > >> > > > > > >> >>>>> >> >> > >> > > > > > As described above, this effort should get >> attention from people >> >>>>> >> >> > >> > > > > developing >> >>>>> >> >> > >> > > > > > FLIP-73 a.k.a. Executor abstractions. I >> recommend you to join >> >>>>> >> >> > >> the >> >>>>> >> >> > >> > > > public >> >>>>> >> >> > >> > > > > > slack channel[1] for Flink Client API >> Enhancement and you can >> >>>>> >> >> > >> try to >> >>>>> >> >> > >> > > > > share >> >>>>> >> >> > >> > > > > > you detailed thoughts there. It possibly gets >> more concrete >> >>>>> >> >> > >> > > attentions. >> >>>>> >> >> > >> > > > > > >> >>>>> >> >> > >> > > > > > Best, >> >>>>> >> >> > >> > > > > > tison. >> >>>>> >> >> > >> > > > > > >> >>>>> >> >> > >> > > > > > [1] >> >>>>> >> >> > >> > > > > > >> >>>>> >> >> > >> > > > > > >> >>>>> >> >> > >> > > > > >> >>>>> >> >> > >> > > > >> >>>>> >> >> > >> > > >> >>>>> >> >> > >> >> https://slack.com/share/IS21SJ75H/Rk8HhUly9FuEHb7oGwBZ33uL/enQtODg2MDYwNjE5MTg3LTA2MjIzNDc1M2ZjZDVlMjdlZjk1M2RkYmJhNjAwMTk2ZDZkODQ4NmY5YmI4OGRhNWJkYTViMTM1NzlmMzc4OWM >> >>>>> >> >> > >> > > > > > >> >>>>> >> >> > >> > > > > > >> >>>>> >> >> > >> > > > > > Peter Huang <[hidden email]> >> 于2020年1月7日周二 上午5:09写道: >> >>>>> >> >> > >> > > > > > >> >>>>> >> >> > >> > > > > > > Dear All, >> >>>>> >> >> > >> > > > > > > >> >>>>> >> >> > >> > > > > > > Happy new year! According to existing >> feedback from the >> >>>>> >> >> > >> community, >> >>>>> >> >> > >> > > we >> >>>>> >> >> > >> > > > > > > revised the doc with the consideration of >> session cluster >> >>>>> >> >> > >> support, >> >>>>> >> >> > >> > > > and >> >>>>> >> >> > >> > > > > > > concrete interface changes needed and >> execution plan. Please >> >>>>> >> >> > >> take >> >>>>> >> >> > >> > > one >> >>>>> >> >> > >> > > > > > more >> >>>>> >> >> > >> > > > > > > round of review at your most convenient time. >> >>>>> >> >> > >> > > > > > > >> >>>>> >> >> > >> > > > > > > >> >>>>> >> >> > >> > > > > > > >> >>>>> >> >> > >> > > > > > >> >>>>> >> >> > >> > > > > >> >>>>> >> >> > >> > > > >> >>>>> >> >> > >> > > >> >>>>> >> >> > >> >> https://docs.google.com/document/d/1aAwVjdZByA-0CHbgv16Me-vjaaDMCfhX7TzVVTuifYM/edit# >> >>>>> >> >> > >> > > > > > > >> >>>>> >> >> > >> > > > > > > >> >>>>> >> >> > >> > > > > > > Best Regards >> >>>>> >> >> > >> > > > > > > Peter Huang >> >>>>> >> >> > >> > > > > > > >> >>>>> >> >> > >> > > > > > > >> >>>>> >> >> > >> > > > > > > >> >>>>> >> >> > >> > > > > > > >> >>>>> >> >> > >> > > > > > > >> >>>>> >> >> > >> > > > > > > On Thu, Jan 2, 2020 at 11:29 AM Peter Huang < >> >>>>> >> >> > >> > > > > [hidden email]> >> >>>>> >> >> > >> > > > > > > wrote: >> >>>>> >> >> > >> > > > > > > >> >>>>> >> >> > >> > > > > > > > Hi Dian, >> >>>>> >> >> > >> > > > > > > > Thanks for giving us valuable feedbacks. >> >>>>> >> >> > >> > > > > > > > >> >>>>> >> >> > >> > > > > > > > 1) It's better to have a whole design for >> this feature >> >>>>> >> >> > >> > > > > > > > For the suggestion of enabling the cluster >> mode also session >> >>>>> >> >> > >> > > > > cluster, I >> >>>>> >> >> > >> > > > > > > > think Flink already supported it. >> WebSubmissionExtension >> >>>>> >> >> > >> already >> >>>>> >> >> > >> > > > > allows >> >>>>> >> >> > >> > > > > > > > users to start a job with the specified >> jar by using web UI. >> >>>>> >> >> > >> > > > > > > > But we need to enable the feature from CLI >> for both local >> >>>>> >> >> > >> jar, >> >>>>> >> >> > >> > > > remote >> >>>>> >> >> > >> > > > > > > jar. >> >>>>> >> >> > >> > > > > > > > I will align with Yang Wang first about >> the details and >> >>>>> >> >> > >> update >> >>>>> >> >> > >> > > the >> >>>>> >> >> > >> > > > > > design >> >>>>> >> >> > >> > > > > > > > doc. >> >>>>> >> >> > >> > > > > > > > >> >>>>> >> >> > >> > > > > > > > 2) It's better to consider the convenience >> for users, such >> >>>>> >> >> > >> as >> >>>>> >> >> > >> > > > > debugging >> >>>>> >> >> > >> > > > > > > > >> >>>>> >> >> > >> > > > > > > > I am wondering whether we can store the >> exception in >> >>>>> >> >> > >> jobgragh >> >>>>> >> >> > >> > > > > > > > generation in application master. As no >> streaming graph can >> >>>>> >> >> > >> be >> >>>>> >> >> > >> > > > > > scheduled >> >>>>> >> >> > >> > > > > > > in >> >>>>> >> >> > >> > > > > > > > this case, there will be no more TM will >> be requested from >> >>>>> >> >> > >> > > FlinkRM. >> >>>>> >> >> > >> > > > > > > > If the AM is still running, users can >> still query it from >> >>>>> >> >> > >> CLI. As >> >>>>> >> >> > >> > > > it >> >>>>> >> >> > >> > > > > > > > requires more change, we can get some >> feedback from < >> >>>>> >> >> > >> > > > > > [hidden email] >> >>>>> >> >> > >> > > > > > > > >> >>>>> >> >> > >> > > > > > > > and @[hidden email] <[hidden email]>. >> >>>>> >> >> > >> > > > > > > > >> >>>>> >> >> > >> > > > > > > > 3) It's better to consider the impact to >> the stability of >> >>>>> >> >> > >> the >> >>>>> >> >> > >> > > > cluster >> >>>>> >> >> > >> > > > > > > > >> >>>>> >> >> > >> > > > > > > > I agree with Yang Wang's opinion. >> >>>>> >> >> > >> > > > > > > > >> >>>>> >> >> > >> > > > > > > > >> >>>>> >> >> > >> > > > > > > > >> >>>>> >> >> > >> > > > > > > > Best Regards >> >>>>> >> >> > >> > > > > > > > Peter Huang >> >>>>> >> >> > >> > > > > > > > >> >>>>> >> >> > >> > > > > > > > >> >>>>> >> >> > >> > > > > > > > On Sun, Dec 29, 2019 at 9:44 PM Dian Fu < >> >>>>> >> >> > >> [hidden email]> >> >>>>> >> >> > >> > > > > wrote: >> >>>>> >> >> > >> > > > > > > > >> >>>>> >> >> > >> > > > > > > >> Hi all, >> >>>>> >> >> > >> > > > > > > >> >> >>>>> >> >> > >> > > > > > > >> Sorry to jump into this discussion. >> Thanks everyone for the >> >>>>> >> >> > >> > > > > > discussion. >> >>>>> >> >> > >> > > > > > > >> I'm very interested in this topic >> although I'm not an >> >>>>> >> >> > >> expert in >> >>>>> >> >> > >> > > > this >> >>>>> >> >> > >> > > > > > > part. >> >>>>> >> >> > >> > > > > > > >> So I'm glad to share my thoughts as >> following: >> >>>>> >> >> > >> > > > > > > >> >> >>>>> >> >> > >> > > > > > > >> 1) It's better to have a whole design for >> this feature >> >>>>> >> >> > >> > > > > > > >> As we know, there are two deployment >> modes: per-job mode >> >>>>> >> >> > >> and >> >>>>> >> >> > >> > > > session >> >>>>> >> >> > >> > > > > > > >> mode. I'm wondering which mode really >> needs this feature. >> >>>>> >> >> > >> As the >> >>>>> >> >> > >> > > > > > design >> >>>>> >> >> > >> > > > > > > doc >> >>>>> >> >> > >> > > > > > > >> mentioned, per-job mode is more used for >> streaming jobs and >> >>>>> >> >> > >> > > > session >> >>>>> >> >> > >> > > > > > > mode is >> >>>>> >> >> > >> > > > > > > >> usually used for batch jobs(Of course, >> the job types and >> >>>>> >> >> > >> the >> >>>>> >> >> > >> > > > > > deployment >> >>>>> >> >> > >> > > > > > > >> modes are orthogonal). Usually streaming >> job is only >> >>>>> >> >> > >> needed to >> >>>>> >> >> > >> > > be >> >>>>> >> >> > >> > > > > > > submitted >> >>>>> >> >> > >> > > > > > > >> once and it will run for days or weeks, >> while batch jobs >> >>>>> >> >> > >> will be >> >>>>> >> >> > >> > > > > > > submitted >> >>>>> >> >> > >> > > > > > > >> more frequently compared with streaming >> jobs. This means >> >>>>> >> >> > >> that >> >>>>> >> >> > >> > > > maybe >> >>>>> >> >> > >> > > > > > > session >> >>>>> >> >> > >> > > > > > > >> mode also needs this feature. However, if >> we support this >> >>>>> >> >> > >> > > feature >> >>>>> >> >> > >> > > > in >> >>>>> >> >> > >> > > > > > > >> session mode, the application master will >> become the new >> >>>>> >> >> > >> > > > centralized >> >>>>> >> >> > >> > > > > > > >> service(which should be solved). So in >> this case, it's >> >>>>> >> >> > >> better to >> >>>>> >> >> > >> > > > > have >> >>>>> >> >> > >> > > > > > a >> >>>>> >> >> > >> > > > > > > >> complete design for both per-job mode and >> session mode. >> >>>>> >> >> > >> > > > Furthermore, >> >>>>> >> >> > >> > > > > > > even >> >>>>> >> >> > >> > > > > > > >> if we can do it phase by phase, we need >> to have a whole >> >>>>> >> >> > >> picture >> >>>>> >> >> > >> > > of >> >>>>> >> >> > >> > > > > how >> >>>>> >> >> > >> > > > > > > it >> >>>>> >> >> > >> > > > > > > >> works in both per-job mode and session >> mode. >> >>>>> >> >> > >> > > > > > > >> >> >>>>> >> >> > >> > > > > > > >> 2) It's better to consider the >> convenience for users, such >> >>>>> >> >> > >> as >> >>>>> >> >> > >> > > > > > debugging >> >>>>> >> >> > >> > > > > > > >> After we finish this feature, the job >> graph will be >> >>>>> >> >> > >> compiled in >> >>>>> >> >> > >> > > > the >> >>>>> >> >> > >> > > > > > > >> application master, which means that >> users cannot easily >> >>>>> >> >> > >> get the >> >>>>> >> >> > >> > > > > > > exception >> >>>>> >> >> > >> > > > > > > >> message synchorousely in the job client >> if there are >> >>>>> >> >> > >> problems >> >>>>> >> >> > >> > > > during >> >>>>> >> >> > >> > > > > > the >> >>>>> >> >> > >> > > > > > > >> job graph compiling (especially for >> platform users), such >> >>>>> >> >> > >> as the >> >>>>> >> >> > >> > > > > > > resource >> >>>>> >> >> > >> > > > > > > >> path is incorrect, the user program >> itself has some >> >>>>> >> >> > >> problems, >> >>>>> >> >> > >> > > etc. >> >>>>> >> >> > >> > > > > > What >> >>>>> >> >> > >> > > > > > > I'm >> >>>>> >> >> > >> > > > > > > >> thinking is that maybe we should throw >> the exceptions as >> >>>>> >> >> > >> early >> >>>>> >> >> > >> > > as >> >>>>> >> >> > >> > > > > > > possible >> >>>>> >> >> > >> > > > > > > >> (during job submission stage). >> >>>>> >> >> > >> > > > > > > >> >> >>>>> >> >> > >> > > > > > > >> 3) It's better to consider the impact to >> the stability of >> >>>>> >> >> > >> the >> >>>>> >> >> > >> > > > > cluster >> >>>>> >> >> > >> > > > > > > >> If we perform the compiling in the >> application master, we >> >>>>> >> >> > >> should >> >>>>> >> >> > >> > > > > > > consider >> >>>>> >> >> > >> > > > > > > >> the impact of the compiling errors. >> Although YARN could >> >>>>> >> >> > >> resume >> >>>>> >> >> > >> > > the >> >>>>> >> >> > >> > > > > > > >> application master in case of failures, >> but in some case >> >>>>> >> >> > >> the >> >>>>> >> >> > >> > > > > compiling >> >>>>> >> >> > >> > > > > > > >> failure may be a waste of cluster >> resource and may impact >> >>>>> >> >> > >> the >> >>>>> >> >> > >> > > > > > stability >> >>>>> >> >> > >> > > > > > > the >> >>>>> >> >> > >> > > > > > > >> cluster and the other jobs in the >> cluster, such as the >> >>>>> >> >> > >> resource >> >>>>> >> >> > >> > > > path >> >>>>> >> >> > >> > > > > > is >> >>>>> >> >> > >> > > > > > > >> incorrect, the user program itself has >> some problems(in >> >>>>> >> >> > >> this >> >>>>> >> >> > >> > > case, >> >>>>> >> >> > >> > > > > job >> >>>>> >> >> > >> > > > > > > >> failover cannot solve this kind of >> problems) etc. In the >> >>>>> >> >> > >> current >> >>>>> >> >> > >> > > > > > > >> implemention, the compiling errors are >> handled in the >> >>>>> >> >> > >> client >> >>>>> >> >> > >> > > side >> >>>>> >> >> > >> > > > > and >> >>>>> >> >> > >> > > > > > > there >> >>>>> >> >> > >> > > > > > > >> is no impact to the cluster at all. >> >>>>> >> >> > >> > > > > > > >> >> >>>>> >> >> > >> > > > > > > >> Regarding to 1), it's clearly pointed in >> the design doc >> >>>>> >> >> > >> that >> >>>>> >> >> > >> > > only >> >>>>> >> >> > >> > > > > > > per-job >> >>>>> >> >> > >> > > > > > > >> mode will be supported. However, I think >> it's better to >> >>>>> >> >> > >> also >> >>>>> >> >> > >> > > > > consider >> >>>>> >> >> > >> > > > > > > the >> >>>>> >> >> > >> > > > > > > >> session mode in the design doc. >> >>>>> >> >> > >> > > > > > > >> Regarding to 2) and 3), I have not seen >> related sections >> >>>>> >> >> > >> in the >> >>>>> >> >> > >> > > > > design >> >>>>> >> >> > >> > > > > > > >> doc. It will be good if we can cover them >> in the design >> >>>>> >> >> > >> doc. >> >>>>> >> >> > >> > > > > > > >> >> >>>>> >> >> > >> > > > > > > >> Feel free to correct me If there is >> anything I >> >>>>> >> >> > >> misunderstand. >> >>>>> >> >> > >> > > > > > > >> >> >>>>> >> >> > >> > > > > > > >> Regards, >> >>>>> >> >> > >> > > > > > > >> Dian >> >>>>> >> >> > >> > > > > > > >> >> >>>>> >> >> > >> > > > > > > >> >> >>>>> >> >> > >> > > > > > > >> > 在 2019年12月27日,上午3:13,Peter Huang < >> >>>>> >> >> > >> [hidden email]> >> >>>>> >> >> > >> > > > 写道: >> >>>>> >> >> > >> > > > > > > >> > >> >>>>> >> >> > >> > > > > > > >> > Hi Yang, >> >>>>> >> >> > >> > > > > > > >> > >> >>>>> >> >> > >> > > > > > > >> > I can't agree more. The effort >> definitely needs to align >> >>>>> >> >> > >> with >> >>>>> >> >> > >> > > > the >> >>>>> >> >> > >> > > > > > > final >> >>>>> >> >> > >> > > > > > > >> > goal of FLIP-73. >> >>>>> >> >> > >> > > > > > > >> > I am thinking about whether we can >> achieve the goal with >> >>>>> >> >> > >> two >> >>>>> >> >> > >> > > > > phases. >> >>>>> >> >> > >> > > > > > > >> > >> >>>>> >> >> > >> > > > > > > >> > 1) Phase I >> >>>>> >> >> > >> > > > > > > >> > As the CLiFrontend will not be >> depreciated soon. We can >> >>>>> >> >> > >> still >> >>>>> >> >> > >> > > > use >> >>>>> >> >> > >> > > > > > the >> >>>>> >> >> > >> > > > > > > >> > deployMode flag there, >> >>>>> >> >> > >> > > > > > > >> > pass the program info through Flink >> configuration, use >> >>>>> >> >> > >> the >> >>>>> >> >> > >> > > > > > > >> > ClassPathJobGraphRetriever >> >>>>> >> >> > >> > > > > > > >> > to generate the job graph in >> ClusterEntrypoints of yarn >> >>>>> >> >> > >> and >> >>>>> >> >> > >> > > > > > > Kubernetes. >> >>>>> >> >> > >> > > > > > > >> > >> >>>>> >> >> > >> > > > > > > >> > 2) Phase II >> >>>>> >> >> > >> > > > > > > >> > In AbstractJobClusterExecutor, the job >> graph is >> >>>>> >> >> > >> generated in >> >>>>> >> >> > >> > > > the >> >>>>> >> >> > >> > > > > > > >> execute >> >>>>> >> >> > >> > > > > > > >> > function. We can still >> >>>>> >> >> > >> > > > > > > >> > use the deployMode in it. With >> deployMode = cluster, the >> >>>>> >> >> > >> > > execute >> >>>>> >> >> > >> > > > > > > >> function >> >>>>> >> >> > >> > > > > > > >> > only starts the cluster. >> >>>>> >> >> > >> > > > > > > >> > >> >>>>> >> >> > >> > > > > > > >> > When >> {Yarn/Kuberneates}PerJobClusterEntrypoint starts, >> >>>>> >> >> > >> It will >> >>>>> >> >> > >> > > > > start >> >>>>> >> >> > >> > > > > > > the >> >>>>> >> >> > >> > > > > > > >> > dispatch first, then we can use >> >>>>> >> >> > >> > > > > > > >> > a ClusterEnvironment similar to >> ContextEnvironment to >> >>>>> >> >> > >> submit >> >>>>> >> >> > >> > > the >> >>>>> >> >> > >> > > > > job >> >>>>> >> >> > >> > > > > > > >> with >> >>>>> >> >> > >> > > > > > > >> > jobName the local >> >>>>> >> >> > >> > > > > > > >> > dispatcher. For the details, we need >> more investigation. >> >>>>> >> >> > >> Let's >> >>>>> >> >> > >> > > > > wait >> >>>>> >> >> > >> > > > > > > >> > for @Aljoscha >> >>>>> >> >> > >> > > > > > > >> > Krettek <[hidden email]> @Till >> Rohrmann < >> >>>>> >> >> > >> > > > > [hidden email] >> >>>>> >> >> > >> > > > > > >'s >> >>>>> >> >> > >> > > > > > > >> > feedback after the holiday season. >> >>>>> >> >> > >> > > > > > > >> > >> >>>>> >> >> > >> > > > > > > >> > Thank you in advance. Merry Chrismas >> and Happy New >> >>>>> >> >> > >> Year!!! >> >>>>> >> >> > >> > > > > > > >> > >> >>>>> >> >> > >> > > > > > > >> > >> >>>>> >> >> > >> > > > > > > >> > Best Regards >> >>>>> >> >> > >> > > > > > > >> > Peter Huang >> >>>>> >> >> > >> > > > > > > >> > >> >>>>> >> >> > >> > > > > > > >> > >> >>>>> >> >> > >> > > > > > > >> > >> >>>>> >> >> > >> > > > > > > >> > >> >>>>> >> >> > >> > > > > > > >> > >> >>>>> >> >> > >> > > > > > > >> > >> >>>>> >> >> > >> > > > > > > >> > >> >>>>> >> >> > >> > > > > > > >> > >> >>>>> >> >> > >> > > > > > > >> > On Wed, Dec 25, 2019 at 1:08 AM Yang >> Wang < >> >>>>> >> >> > >> > > > [hidden email]> >> >>>>> >> >> > >> > > > > > > >> wrote: >> >>>>> >> >> > >> > > > > > > >> > >> >>>>> >> >> > >> > > > > > > >> >> Hi Peter, >> >>>>> >> >> > >> > > > > > > >> >> >> >>>>> >> >> > >> > > > > > > >> >> I think we need to reconsider tison's >> suggestion >> >>>>> >> >> > >> seriously. >> >>>>> >> >> > >> > > > After >> >>>>> >> >> > >> > > > > > > >> FLIP-73, >> >>>>> >> >> > >> > > > > > > >> >> the deployJobCluster has >> >>>>> >> >> > >> > > > > > > >> >> beenmoved into >> `JobClusterExecutor#execute`. It should >> >>>>> >> >> > >> not be >> >>>>> >> >> > >> > > > > > > perceived >> >>>>> >> >> > >> > > > > > > >> >> for `CliFrontend`. That >> >>>>> >> >> > >> > > > > > > >> >> means the user program will *ALWAYS* >> be executed on >> >>>>> >> >> > >> client >> >>>>> >> >> > >> > > > side. >> >>>>> >> >> > >> > > > > > This >> >>>>> >> >> > >> > > > > > > >> is >> >>>>> >> >> > >> > > > > > > >> >> the by design behavior. >> >>>>> >> >> > >> > > > > > > >> >> So, we could not just add `if(client >> mode) .. else >> >>>>> >> >> > >> if(cluster >> >>>>> >> >> > >> > > > > mode) >> >>>>> >> >> > >> > > > > > > >> ...` >> >>>>> >> >> > >> > > > > > > >> >> codes in `CliFrontend` to bypass >> >>>>> >> >> > >> > > > > > > >> >> the executor. We need to find a clean >> way to decouple >> >>>>> >> >> > >> > > executing >> >>>>> >> >> > >> > > > > > user >> >>>>> >> >> > >> > > > > > > >> >> program and deploying per-job >> >>>>> >> >> > >> > > > > > > >> >> cluster. Based on this, we could >> support to execute user >> >>>>> >> >> > >> > > > program >> >>>>> >> >> > >> > > > > on >> >>>>> >> >> > >> > > > > > > >> client >> >>>>> >> >> > >> > > > > > > >> >> or master side. >> >>>>> >> >> > >> > > > > > > >> >> >> >>>>> >> >> > >> > > > > > > >> >> Maybe Aljoscha and Jeff could give >> some good >> >>>>> >> >> > >> suggestions. >> >>>>> >> >> > >> > > > > > > >> >> >> >>>>> >> >> > >> > > > > > > >> >> >> >>>>> >> >> > >> > > > > > > >> >> >> >>>>> >> >> > >> > > > > > > >> >> Best, >> >>>>> >> >> > >> > > > > > > >> >> Yang >> >>>>> >> >> > >> > > > > > > >> >> >> >>>>> >> >> > >> > > > > > > >> >> Peter Huang < >> [hidden email]> 于2019年12月25日周三 >> >>>>> >> >> > >> > > > > 上午4:03写道: >> >>>>> >> >> > >> > > > > > > >> >> >> >>>>> >> >> > >> > > > > > > >> >>> Hi Jingjing, >> >>>>> >> >> > >> > > > > > > >> >>> >> >>>>> >> >> > >> > > > > > > >> >>> The improvement proposed is a >> deployment option for >> >>>>> >> >> > >> CLI. For >> >>>>> >> >> > >> > > > SQL >> >>>>> >> >> > >> > > > > > > based >> >>>>> >> >> > >> > > > > > > >> >>> Flink application, It is more >> convenient to use the >> >>>>> >> >> > >> existing >> >>>>> >> >> > >> > > > > model >> >>>>> >> >> > >> > > > > > > in >> >>>>> >> >> > >> > > > > > > >> >>> SqlClient in which >> >>>>> >> >> > >> > > > > > > >> >>> the job graph is generated within >> SqlClient. After >> >>>>> >> >> > >> adding >> >>>>> >> >> > >> > > the >> >>>>> >> >> > >> > > > > > > delayed >> >>>>> >> >> > >> > > > > > > >> job >> >>>>> >> >> > >> > > > > > > >> >>> graph generation, I think there is no >> change is needed >> >>>>> >> >> > >> for >> >>>>> >> >> > >> > > > your >> >>>>> >> >> > >> > > > > > > side. >> >>>>> >> >> > >> > > > > > > >> >>> >> >>>>> >> >> > >> > > > > > > >> >>> >> >>>>> >> >> > >> > > > > > > >> >>> Best Regards >> >>>>> >> >> > >> > > > > > > >> >>> Peter Huang >> >>>>> >> >> > >> > > > > > > >> >>> >> >>>>> >> >> > >> > > > > > > >> >>> >> >>>>> >> >> > >> > > > > > > >> >>> On Wed, Dec 18, 2019 at 6:01 AM >> jingjing bai < >> >>>>> >> >> > >> > > > > > > >> [hidden email]> >> >>>>> >> >> > >> > > > > > > >> >>> wrote: >> >>>>> >> >> > >> > > > > > > >> >>> >> >>>>> >> >> > >> > > > > > > >> >>>> hi peter: >> >>>>> >> >> > >> > > > > > > >> >>>> we had extension SqlClent to >> support sql job >> >>>>> >> >> > >> submit in >> >>>>> >> >> > >> > > web >> >>>>> >> >> > >> > > > > > base >> >>>>> >> >> > >> > > > > > > on >> >>>>> >> >> > >> > > > > > > >> >>>> flink 1.9. we support submit to >> yarn on per job >> >>>>> >> >> > >> mode too. >> >>>>> >> >> > >> > > > > > > >> >>>> in this case, the job graph >> generated on client >> >>>>> >> >> > >> side >> >>>>> >> >> > >> > > . I >> >>>>> >> >> > >> > > > > > think >> >>>>> >> >> > >> > > > > > > >> >>> this >> >>>>> >> >> > >> > > > > > > >> >>>> discuss Mainly to improve api >> programme. but in my >> >>>>> >> >> > >> case , >> >>>>> >> >> > >> > > > > there >> >>>>> >> >> > >> > > > > > is >> >>>>> >> >> > >> > > > > > > >> no >> >>>>> >> >> > >> > > > > > > >> >>>> jar to upload but only a sql string . >> >>>>> >> >> > >> > > > > > > >> >>>> do u had more suggestion to >> improve for sql mode >> >>>>> >> >> > >> or it >> >>>>> >> >> > >> > > is >> >>>>> >> >> > >> > > > > > only a >> >>>>> >> >> > >> > > > > > > >> >>>> switch for api programme? >> >>>>> >> >> > >> > > > > > > >> >>>> >> >>>>> >> >> > >> > > > > > > >> >>>> >> >>>>> >> >> > >> > > > > > > >> >>>> best >> >>>>> >> >> > >> > > > > > > >> >>>> bai jj >> >>>>> >> >> > >> > > > > > > >> >>>> >> >>>>> >> >> > >> > > > > > > >> >>>> >> >>>>> >> >> > >> > > > > > > >> >>>> Yang Wang <[hidden email]> >> 于2019年12月18日周三 >> >>>>> >> >> > >> 下午7:21写道: >> >>>>> >> >> > >> > > > > > > >> >>>> >> >>>>> >> >> > >> > > > > > > >> >>>>> I just want to revive this >> discussion. >> >>>>> >> >> > >> > > > > > > >> >>>>> >> >>>>> >> >> > >> > > > > > > >> >>>>> Recently, i am thinking about how >> to natively run >> >>>>> >> >> > >> flink >> >>>>> >> >> > >> > > > > per-job >> >>>>> >> >> > >> > > > > > > >> >>> cluster on >> >>>>> >> >> > >> > > > > > > >> >>>>> Kubernetes. >> >>>>> >> >> > >> > > > > > > >> >>>>> The per-job mode on Kubernetes is >> very different >> >>>>> >> >> > >> from on >> >>>>> >> >> > >> > > > Yarn. >> >>>>> >> >> > >> > > > > > And >> >>>>> >> >> > >> > > > > > > >> we >> >>>>> >> >> > >> > > > > > > >> >>> will >> >>>>> >> >> > >> > > > > > > >> >>>>> have >> >>>>> >> >> > >> > > > > > > >> >>>>> the same deployment requirements to >> the client and >> >>>>> >> >> > >> entry >> >>>>> >> >> > >> > > > > point. >> >>>>> >> >> > >> > > > > > > >> >>>>> >> >>>>> >> >> > >> > > > > > > >> >>>>> 1. Flink client not always need a >> local jar to start >> >>>>> >> >> > >> a >> >>>>> >> >> > >> > > Flink >> >>>>> >> >> > >> > > > > > > per-job >> >>>>> >> >> > >> > > > > > > >> >>>>> cluster. We could >> >>>>> >> >> > >> > > > > > > >> >>>>> support multiple schemas. For >> example, >> >>>>> >> >> > >> > > > file:///path/of/my.jar >> >>>>> >> >> > >> > > > > > > means >> >>>>> >> >> > >> > > > > > > >> a >> >>>>> >> >> > >> > > > > > > >> >>> jar >> >>>>> >> >> > >> > > > > > > >> >>>>> located >> >>>>> >> >> > >> > > > > > > >> >>>>> at client side, >> >>>>> >> >> > >> hdfs://myhdfs/user/myname/flink/my.jar >> >>>>> >> >> > >> > > > means a >> >>>>> >> >> > >> > > > > > jar >> >>>>> >> >> > >> > > > > > > >> >>> located >> >>>>> >> >> > >> > > > > > > >> >>>>> at >> >>>>> >> >> > >> > > > > > > >> >>>>> remote hdfs, >> local:///path/in/image/my.jar means a >> >>>>> >> >> > >> jar >> >>>>> >> >> > >> > > > located >> >>>>> >> >> > >> > > > > > at >> >>>>> >> >> > >> > > > > > > >> >>>>> jobmanager side. >> >>>>> >> >> > >> > > > > > > >> >>>>> >> >>>>> >> >> > >> > > > > > > >> >>>>> 2. Support running user program on >> master side. This >> >>>>> >> >> > >> also >> >>>>> >> >> > >> > > > > means >> >>>>> >> >> > >> > > > > > > the >> >>>>> >> >> > >> > > > > > > >> >>> entry >> >>>>> >> >> > >> > > > > > > >> >>>>> point >> >>>>> >> >> > >> > > > > > > >> >>>>> will generate the job graph on >> master side. We could >> >>>>> >> >> > >> use >> >>>>> >> >> > >> > > the >> >>>>> >> >> > >> > > > > > > >> >>>>> ClasspathJobGraphRetriever >> >>>>> >> >> > >> > > > > > > >> >>>>> or start a local Flink client to >> achieve this >> >>>>> >> >> > >> purpose. >> >>>>> >> >> > >> > > > > > > >> >>>>> >> >>>>> >> >> > >> > > > > > > >> >>>>> >> >>>>> >> >> > >> > > > > > > >> >>>>> cc tison, Aljoscha & Kostas Do you >> think this is the >> >>>>> >> >> > >> right >> >>>>> >> >> > >> > > > > > > >> direction we >> >>>>> >> >> > >> > > > > > > >> >>>>> need to work? >> >>>>> >> >> > >> > > > > > > >> >>>>> >> >>>>> >> >> > >> > > > > > > >> >>>>> tison <[hidden email]> >> 于2019年12月12日周四 >> >>>>> >> >> > >> 下午4:48写道: >> >>>>> >> >> > >> > > > > > > >> >>>>> >> >>>>> >> >> > >> > > > > > > >> >>>>>> A quick idea is that we separate >> the deployment >> >>>>> >> >> > >> from user >> >>>>> >> >> > >> > > > > > program >> >>>>> >> >> > >> > > > > > > >> >>> that >> >>>>> >> >> > >> > > > > > > >> >>>>> it >> >>>>> >> >> > >> > > > > > > >> >>>>>> has always been done >> >>>>> >> >> > >> > > > > > > >> >>>>>> outside the program. On user >> program executed there >> >>>>> >> >> > >> is >> >>>>> >> >> > >> > > > > always a >> >>>>> >> >> > >> > > > > > > >> >>>>>> ClusterClient that communicates >> with >> >>>>> >> >> > >> > > > > > > >> >>>>>> an existing cluster, remote or >> local. It will be >> >>>>> >> >> > >> another >> >>>>> >> >> > >> > > > > thread >> >>>>> >> >> > >> > > > > > > so >> >>>>> >> >> > >> > > > > > > >> >>> just >> >>>>> >> >> > >> > > > > > > >> >>>>> for >> >>>>> >> >> > >> > > > > > > >> >>>>>> your information. >> >>>>> >> >> > >> > > > > > > >> >>>>>> >> >>>>> >> >> > >> > > > > > > >> >>>>>> Best, >> >>>>> >> >> > >> > > > > > > >> >>>>>> tison. >> >>>>> >> >> > >> > > > > > > >> >>>>>> >> >>>>> >> >> > >> > > > > > > >> >>>>>> >> >>>>> >> >> > >> > > > > > > >> >>>>>> tison <[hidden email]> >> 于2019年12月12日周四 >> >>>>> >> >> > >> 下午4:40写道: >> >>>>> >> >> > >> > > > > > > >> >>>>>> >> >>>>> >> >> > >> > > > > > > >> >>>>>>> Hi Peter, >> >>>>> >> >> > >> > > > > > > >> >>>>>>> >> >>>>> >> >> > >> > > > > > > >> >>>>>>> Another concern I realized >> recently is that with >> >>>>> >> >> > >> current >> >>>>> >> >> > >> > > > > > > Executors >> >>>>> >> >> > >> > > > > > > >> >>>>>>> abstraction(FLIP-73) >> >>>>> >> >> > >> > > > > > > >> >>>>>>> I'm afraid that user program is >> designed to ALWAYS >> >>>>> >> >> > >> run >> >>>>> >> >> > >> > > on >> >>>>> >> >> > >> > > > > the >> >>>>> >> >> > >> > > > > > > >> >>> client >> >>>>> >> >> > >> > > > > > > >> >>>>>> side. >> >>>>> >> >> > >> > > > > > > >> >>>>>>> Specifically, >> >>>>> >> >> > >> > > > > > > >> >>>>>>> we deploy the job in executor >> when env.execute >> >>>>> >> >> > >> called. >> >>>>> >> >> > >> > > > This >> >>>>> >> >> > >> > > > > > > >> >>>>> abstraction >> >>>>> >> >> > >> > > > > > > >> >>>>>>> possibly prevents >> >>>>> >> >> > >> > > > > > > >> >>>>>>> Flink runs user program on the >> cluster side. >> >>>>> >> >> > >> > > > > > > >> >>>>>>> >> >>>>> >> >> > >> > > > > > > >> >>>>>>> For your proposal, in this case >> we already >> >>>>> >> >> > >> compiled the >> >>>>> >> >> > >> > > > > > program >> >>>>> >> >> > >> > > > > > > >> and >> >>>>> >> >> > >> > > > > > > >> >>>>> run >> >>>>> >> >> > >> > > > > > > >> >>>>>> on >> >>>>> >> >> > >> > > > > > > >> >>>>>>> the client side, >> >>>>> >> >> > >> > > > > > > >> >>>>>>> even we deploy a cluster and >> retrieve job graph >> >>>>> >> >> > >> from >> >>>>> >> >> > >> > > > program >> >>>>> >> >> > >> > > > > > > >> >>>>> metadata, it >> >>>>> >> >> > >> > > > > > > >> >>>>>>> doesn't make >> >>>>> >> >> > >> > > > > > > >> >>>>>>> many sense. >> >>>>> >> >> > >> > > > > > > >> >>>>>>> >> >>>>> >> >> > >> > > > > > > >> >>>>>>> cc Aljoscha & Kostas what do you >> think about this >> >>>>> >> >> > >> > > > > constraint? >> >>>>> >> >> > >> > > > > > > >> >>>>>>> >> >>>>> >> >> > >> > > > > > > >> >>>>>>> Best, >> >>>>> >> >> > >> > > > > > > >> >>>>>>> tison. >> >>>>> >> >> > >> > > > > > > >> >>>>>>> >> >>>>> >> >> > >> > > > > > > >> >>>>>>> >> >>>>> >> >> > >> > > > > > > >> >>>>>>> Peter Huang < >> [hidden email]> >> >>>>> >> >> > >> 于2019年12月10日周二 >> >>>>> >> >> > >> > > > > > > >> 下午12:45写道: >> >>>>> >> >> > >> > > > > > > >> >>>>>>> >> >>>>> >> >> > >> > > > > > > >> >>>>>>>> Hi Tison, >> >>>>> >> >> > >> > > > > > > >> >>>>>>>> >> >>>>> >> >> > >> > > > > > > >> >>>>>>>> Yes, you are right. I think I >> made the wrong >> >>>>> >> >> > >> argument >> >>>>> >> >> > >> > > in >> >>>>> >> >> > >> > > > > the >> >>>>> >> >> > >> > > > > > > doc. >> >>>>> >> >> > >> > > > > > > >> >>>>>>>> Basically, the packaging jar >> problem is only for >> >>>>> >> >> > >> > > platform >> >>>>> >> >> > >> > > > > > > users. >> >>>>> >> >> > >> > > > > > > >> >>> In >> >>>>> >> >> > >> > > > > > > >> >>>>> our >> >>>>> >> >> > >> > > > > > > >> >>>>>>>> internal deploy service, >> >>>>> >> >> > >> > > > > > > >> >>>>>>>> we further optimized the >> deployment latency by >> >>>>> >> >> > >> letting >> >>>>> >> >> > >> > > > > users >> >>>>> >> >> > >> > > > > > to >> >>>>> >> >> > >> > > > > > > >> >>>>>> packaging >> >>>>> >> >> > >> > > > > > > >> >>>>>>>> flink-runtime together with the >> uber jar, so that >> >>>>> >> >> > >> we >> >>>>> >> >> > >> > > > don't >> >>>>> >> >> > >> > > > > > need >> >>>>> >> >> > >> > > > > > > >> to >> >>>>> >> >> > >> > > > > > > >> >>>>>>>> consider >> >>>>> >> >> > >> > > > > > > >> >>>>>>>> multiple flink version >> >>>>> >> >> > >> > > > > > > >> >>>>>>>> support for now. In the session >> client mode, as >> >>>>> >> >> > >> Flink >> >>>>> >> >> > >> > > > libs >> >>>>> >> >> > >> > > > > > will >> >>>>> >> >> > >> > > > > > > >> be >> >>>>> >> >> > >> > > > > > > >> >>>>>> shipped >> >>>>> >> >> > >> > > > > > > >> >>>>>>>> anyway as local resources of >> yarn. Users actually >> >>>>> >> >> > >> don't >> >>>>> >> >> > >> > > > > need >> >>>>> >> >> > >> > > > > > to >> >>>>> >> >> > >> > > > > > > >> >>>>> package >> >>>>> >> >> > >> > > > > > > >> >>>>>>>> those libs into job jar. >> >>>>> >> >> > >> > > > > > > >> >>>>>>>> >> >>>>> >> >> > >> > > > > > > >> >>>>>>>> >> >>>>> >> >> > >> > > > > > > >> >>>>>>>> >> >>>>> >> >> > >> > > > > > > >> >>>>>>>> Best Regards >> >>>>> >> >> > >> > > > > > > >> >>>>>>>> Peter Huang >> >>>>> >> >> > >> > > > > > > >> >>>>>>>> >> >>>>> >> >> > >> > > > > > > >> >>>>>>>> On Mon, Dec 9, 2019 at 8:35 PM >> tison < >> >>>>> >> >> > >> > > > [hidden email] >> >>>>> >> >> > >> > > > > > >> >>>>> >> >> > >> > > > > > > >> >>> wrote: >> >>>>> >> >> > >> > > > > > > >> >>>>>>>> >> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> 3. What do you mean about the >> package? Do users >> >>>>> >> >> > >> need >> >>>>> >> >> > >> > > to >> >>>>> >> >> > >> > > > > > > >> >>> compile >> >>>>> >> >> > >> > > > > > > >> >>>>>> their >> >>>>> >> >> > >> > > > > > > >> >>>>>>>>> jars >> >>>>> >> >> > >> > > > > > > >> >>>>>>>>> inlcuding flink-clients, >> flink-optimizer, >> >>>>> >> >> > >> flink-table >> >>>>> >> >> > >> > > > > codes? >> >>>>> >> >> > >> > > > > > > >> >>>>>>>>> >> >>>>> >> >> > >> > > > > > > >> >>>>>>>>> The answer should be no because >> they exist in >> >>>>> >> >> > >> system >> >>>>> >> >> > >> > > > > > > classpath. >> >>>>> >> >> > >> > > > > > > >> >>>>>>>>> >> >>>>> >> >> > >> > > > > > > >> >>>>>>>>> Best, >> >>>>> >> >> > >> > > > > > > >> >>>>>>>>> tison. >> >>>>> >> >> > >> > > > > > > >> >>>>>>>>> >> >>>>> >> >> > >> > > > > > > >> >>>>>>>>> >> >>>>> >> >> > >> > > > > > > >> >>>>>>>>> Yang Wang < >> [hidden email]> 于2019年12月10日周二 >> >>>>> >> >> > >> > > > > 下午12:18写道: >> >>>>> >> >> > >> > > > > > > >> >>>>>>>>> >> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> Hi Peter, >> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> >> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> Thanks a lot for starting this >> discussion. I >> >>>>> >> >> > >> think >> >>>>> >> >> > >> > > this >> >>>>> >> >> > >> > > > > is >> >>>>> >> >> > >> > > > > > a >> >>>>> >> >> > >> > > > > > > >> >>> very >> >>>>> >> >> > >> > > > > > > >> >>>>>>>> useful >> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> feature. >> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> >> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> Not only for Yarn, i am >> focused on flink on >> >>>>> >> >> > >> > > Kubernetes >> >>>>> >> >> > >> > > > > > > >> >>>>> integration >> >>>>> >> >> > >> > > > > > > >> >>>>>> and >> >>>>> >> >> > >> > > > > > > >> >>>>>>>>> come >> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> across the same >> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> problem. I do not want the job >> graph generated >> >>>>> >> >> > >> on >> >>>>> >> >> > >> > > > client >> >>>>> >> >> > >> > > > > > > side. >> >>>>> >> >> > >> > > > > > > >> >>>>>>>> Instead, >> >>>>> >> >> > >> > > > > > > >> >>>>>>>>> the >> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> user jars are built in >> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> a user-defined image. When the >> job manager >> >>>>> >> >> > >> launched, >> >>>>> >> >> > >> > > we >> >>>>> >> >> > >> > > > > > just >> >>>>> >> >> > >> > > > > > > >> >>>>> need to >> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> generate the job graph >> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> based on local user jars. >> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> >> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> I have some small suggestion >> about this. >> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> >> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> 1. `ProgramJobGraphRetriever` >> is very similar to >> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> `ClasspathJobGraphRetriever`, >> the differences >> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> are the former needs >> `ProgramMetadata` and the >> >>>>> >> >> > >> latter >> >>>>> >> >> > >> > > > > needs >> >>>>> >> >> > >> > > > > > > >> >>> some >> >>>>> >> >> > >> > > > > > > >> >>>>>>>>> arguments. >> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> Is it possible to >> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> have an unified >> `JobGraphRetriever` to support >> >>>>> >> >> > >> both? >> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> 2. Is it possible to not use a >> local user jar to >> >>>>> >> >> > >> > > start >> >>>>> >> >> > >> > > > a >> >>>>> >> >> > >> > > > > > > >> >>> per-job >> >>>>> >> >> > >> > > > > > > >> >>>>>>>> cluster? >> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> In your case, the user jars has >> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> existed on hdfs already and we >> do need to >> >>>>> >> >> > >> download >> >>>>> >> >> > >> > > the >> >>>>> >> >> > >> > > > > jars >> >>>>> >> >> > >> > > > > > > to >> >>>>> >> >> > >> > > > > > > >> >>>>>>>> deployer >> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> service. Currently, we >> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> always need a local user jar >> to start a flink >> >>>>> >> >> > >> > > cluster. >> >>>>> >> >> > >> > > > It >> >>>>> >> >> > >> > > > > > is >> >>>>> >> >> > >> > > > > > > >> >>> be >> >>>>> >> >> > >> > > > > > > >> >>>>>> great >> >>>>> >> >> > >> > > > > > > >> >>>>>>>> if >> >>>>> >> >> > >> > > > > > > >> >>>>>>>>> we >> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> could support remote user jars. >> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>>> In the implementation, we >> assume users package >> >>>>> >> >> > >> > > > > > > >> >>> flink-clients, >> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> flink-optimizer, flink-table >> together within >> >>>>> >> >> > >> the job >> >>>>> >> >> > >> > > > jar. >> >>>>> >> >> > >> > > > > > > >> >>>>> Otherwise, >> >>>>> >> >> > >> > > > > > > >> >>>>>>>> the >> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> job graph generation within >> >>>>> >> >> > >> JobClusterEntryPoint will >> >>>>> >> >> > >> > > > > fail. >> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> 3. What do you mean about the >> package? Do users >> >>>>> >> >> > >> need >> >>>>> >> >> > >> > > to >> >>>>> >> >> > >> > > > > > > >> >>> compile >> >>>>> >> >> > >> > > > > > > >> >>>>>> their >> >>>>> >> >> > >> > > > > > > >> >>>>>>>>> jars >> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> inlcuding flink-clients, >> flink-optimizer, >> >>>>> >> >> > >> flink-table >> >>>>> >> >> > >> > > > > > codes? >> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> >> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> >> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> >> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> Best, >> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> Yang >> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> >> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> Peter Huang < >> [hidden email]> >> >>>>> >> >> > >> > > > 于2019年12月10日周二 >> >>>>> >> >> > >> > > > > > > >> >>>>> 上午2:37写道: >> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> >> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> Dear All, >> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> >> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> Recently, the Flink community >> starts to >> >>>>> >> >> > >> improve the >> >>>>> >> >> > >> > > > yarn >> >>>>> >> >> > >> > > > > > > >> >>>>> cluster >> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> descriptor >> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> to make job jar and config >> files configurable >> >>>>> >> >> > >> from >> >>>>> >> >> > >> > > > CLI. >> >>>>> >> >> > >> > > > > It >> >>>>> >> >> > >> > > > > > > >> >>>>>> improves >> >>>>> >> >> > >> > > > > > > >> >>>>>>>> the >> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> flexibility of Flink >> deployment Yarn Per Job >> >>>>> >> >> > >> Mode. >> >>>>> >> >> > >> > > > For >> >>>>> >> >> > >> > > > > > > >> >>>>> platform >> >>>>> >> >> > >> > > > > > > >> >>>>>>>> users >> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> who >> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> manage tens of hundreds of >> streaming pipelines >> >>>>> >> >> > >> for >> >>>>> >> >> > >> > > the >> >>>>> >> >> > >> > > > > > whole >> >>>>> >> >> > >> > > > > > > >> >>>>> org >> >>>>> >> >> > >> > > > > > > >> >>>>>> or >> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> company, we found the job >> graph generation in >> >>>>> >> >> > >> > > > > client-side >> >>>>> >> >> > >> > > > > > is >> >>>>> >> >> > >> > > > > > > >> >>>>>> another >> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> pinpoint. Thus, we want to >> propose a >> >>>>> >> >> > >> configurable >> >>>>> >> >> > >> > > > > feature >> >>>>> >> >> > >> > > > > > > >> >>> for >> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> FlinkYarnSessionCli. The >> feature can allow >> >>>>> >> >> > >> users to >> >>>>> >> >> > >> > > > > choose >> >>>>> >> >> > >> > > > > > > >> >>> the >> >>>>> >> >> > >> > > > > > > >> >>>>> job >> >>>>> >> >> > >> > > > > > > >> >>>>>>>>> graph >> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> generation in Flink >> ClusterEntryPoint so that >> >>>>> >> >> > >> the >> >>>>> >> >> > >> > > job >> >>>>> >> >> > >> > > > > jar >> >>>>> >> >> > >> > > > > > > >> >>>>> doesn't >> >>>>> >> >> > >> > > > > > > >> >>>>>>>> need >> >>>>> >> >> > >> > > > > > > >> >>>>>>>>> to >> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> be locally for the job graph >> generation. The >> >>>>> >> >> > >> > > proposal >> >>>>> >> >> > >> > > > is >> >>>>> >> >> > >> > > > > > > >> >>>>> organized >> >>>>> >> >> > >> > > > > > > >> >>>>>>>> as a >> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> FLIP >> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> >> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> >> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> >> >>>>> >> >> > >> > > > > > > >> >>>>>>>>> >> >>>>> >> >> > >> > > > > > > >> >>>>>>>> >> >>>>> >> >> > >> > > > > > > >> >>>>>> >> >>>>> >> >> > >> > > > > > > >> >>>>> >> >>>>> >> >> > >> > > > > > > >> >>> >> >>>>> >> >> > >> > > > > > > >> >> >>>>> >> >> > >> > > > > > > >> >>>>> >> >> > >> > > > > > >> >>>>> >> >> > >> > > > > >> >>>>> >> >> > >> > > > >> >>>>> >> >> > >> > > >> >>>>> >> >> > >> >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-85+Delayed+JobGraph+Generation >> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> . >> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> >> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> Any questions and suggestions >> are welcomed. >> >>>>> >> >> > >> Thank >> >>>>> >> >> > >> > > you >> >>>>> >> >> > >> > > > in >> >>>>> >> >> > >> > > > > > > >> >>>>> advance. >> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> >> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> >> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> Best Regards >> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> Peter Huang >> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> >> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> >> >>>>> >> >> > >> > > > > > > >> >>>>>>>>> >> >>>>> >> >> > >> > > > > > > >> >>>>>>>> >> >>>>> >> >> > >> > > > > > > >> >>>>>>> >> >>>>> >> >> > >> > > > > > > >> >>>>>> >> >>>>> >> >> > >> > > > > > > >> >>>>> >> >>>>> >> >> > >> > > > > > > >> >>>> >> >>>>> >> >> > >> > > > > > > >> >>> >> >>>>> >> >> > >> > > > > > > >> >> >> >>>>> >> >> > >> > > > > > > >> >> >>>>> >> >> > >> > > > > > > >> >> >>>>> >> >> > >> > > > > > > >> >>>>> >> >> > >> > > > > > >> >>>>> >> >> > >> > > > > >> >>>>> >> >> > >> > > > >> >>>>> >> >> > >> > > >> >>>>> >> >> > >> >> >>>>> >> >> > > >> >>>>> >> >> >> > |
Hi Becket,
Thanks for jumping out and sharing your concerns. I second tison's answer and just make some additions. > job submission interface This FLIP will introduce an interface for running user `main()` on cluster, named as “ProgramDeployer”. However, it is not a public interface. It will be used in `CliFrontend` when the remote deploy option(-R/--remote-deploy) is specified. So the only changes on user side is about the cli option. > How to fetch the jars? The “local path” and “dfs path“ could be supported to fetch the user jars and dependencies. Just like tison has said, we could ship the user jar and dependencies from client side to HDFS and use the entrypoint to fetch. Also we have some other practical ways to use the new “application mode“. 1. Upload the user jars and dependencies to the DFS(e.g. HDFS, S3, Aliyun OSS) manually or some external deployer system. For K8s, the user jars and dependencies could also be built in the docker image. 2. Specify the remote/local user jar and dependencies in `flink run`. Usually this could also be done by the external deployer system. 3. When the `ClusterEntrypoint` is launched, it will fetch the jars and files automatically. We do not need any specific fetcher implementation. Since we could leverage flink `FileSystem` to do this. Best, Yang tison <[hidden email]> 于2020年3月9日周一 上午11:34写道: > Hi Becket, > > Thanks for your attention on FLIP-85! I answered your question inline. > > 1. What exactly the job submission interface will look like after this > FLIP? The FLIP template has a Public Interface section but was removed from > this FLIP. > > As Yang mentioned in this thread above: > > From user perspective, only a `-R/-- remote-deploy` cli option is visible. > They are not aware of the application mode. > > 2. How will the new ClusterEntrypoint fetch the jars from external > storage? What external storage will be supported out of the box? Will this > "jar fetcher" be pluggable? If so, how does the API look like and how will > users specify the custom "jar fetcher"? > > It depends actually. Here are several points: > > i. Currently, shipping user files is handled by Flink, dependencies > fetching can be handled by Flink. > ii. Current, we only support local file system shipfiles. When in > Application Mode, to support meaningful jar fetch we should support user to > configure richer shipfiles schema at first. > iii. Dependencies fetching varies from deployments. That is, on YARN, its > convention is through HDFS; on Kubernetes, its convention is configured > resource server and fetched by initContainer. > > Thus, in the First phase of Application Mode dependencies fetching is > totally handled within Flink. > > 3. It sounds that in this FLIP, the "session cluster" running the > application has the same lifecycle as the user application. How will the > session cluster be teared down after the application finishes? Will the > ClusterEntrypoint do that? Will there be an option of not tearing the > cluster down? > > The precondition we tear down the cluster is *both* > > i. user main reached to its end > ii. all jobs submitted(current, at most one) reached global terminate state > > For the "how", it is an implementation topic, but conceptually it is > ClusterEntrypoint's responsibility. > > >Will there be an option of not tearing the cluster down? > > I think the answer is "No" because the cluster is designed to be bounded > with an Application. User logic that communicates with the job is always in > its `main`, and for history information we have history server. > > Best, > tison. > > > Becket Qin <[hidden email]> 于2020年3月9日周一 上午8:12写道: > >> Hi Peter and Kostas, >> >> Thanks for creating this FLIP. Moving the JobGraph compilation to the >> cluster makes a lot of sense to me. FLIP-40 had the exactly same idea, but >> is currently dormant and can probably be superseded by this FLIP. After >> reading the FLIP, I still have a few questions. >> >> 1. What exactly the job submission interface will look like after this >> FLIP? The FLIP template has a Public Interface section but was removed from >> this FLIP. >> 2. How will the new ClusterEntrypoint fetch the jars from external >> storage? What external storage will be supported out of the box? Will this >> "jar fetcher" be pluggable? If so, how does the API look like and how will >> users specify the custom "jar fetcher"? >> 3. It sounds that in this FLIP, the "session cluster" running the >> application has the same lifecycle as the user application. How will the >> session cluster be teared down after the application finishes? Will the >> ClusterEntrypoint do that? Will there be an option of not tearing the >> cluster down? >> >> Maybe they have been discussed in the ML earlier, but I think they should >> be part of the FLIP also. >> >> Thanks, >> >> Jiangjie (Becket) Qin >> >> On Thu, Mar 5, 2020 at 10:09 PM Kostas Kloudas <[hidden email]> >> wrote: >> >>> Also from my side +1 to start voting. >>> >>> Cheers, >>> Kostas >>> >>> On Thu, Mar 5, 2020 at 7:45 AM tison <[hidden email]> wrote: >>> > >>> > +1 to star voting. >>> > >>> > Best, >>> > tison. >>> > >>> > >>> > Yang Wang <[hidden email]> 于2020年3月5日周四 下午2:29写道: >>> >> >>> >> Hi Peter, >>> >> Really thanks for your response. >>> >> >>> >> Hi all @Kostas Kloudas @Zili Chen @Peter Huang @Rong Rong >>> >> It seems that we have reached an agreement. The “application mode” is >>> regarded as the enhanced “per-job”. It is >>> >> orthogonal with “cluster deploy”. Currently, we bind the “per-job” to >>> `run-user-main-on-client` and “application mode” >>> >> to `run-user-main-on-cluster`. >>> >> >>> >> Do you have other concerns to moving FLIP-85 to voting? >>> >> >>> >> >>> >> Best, >>> >> Yang >>> >> >>> >> Peter Huang <[hidden email]> 于2020年3月5日周四 下午12:48写道: >>> >>> >>> >>> Hi Yang and Kostas, >>> >>> >>> >>> Thanks for the clarification. It makes more sense to me if the long >>> term goal is to replace per job mode to application mode >>> >>> in the future (at the time that multiple execute can be supported). >>> Before that, It will be better to keep the concept of >>> >>> application mode internally. As Yang suggested, User only need to >>> use a `-R/-- remote-deploy` cli option to launch >>> >>> a per job cluster with the main function executed in cluster >>> entry-point. +1 for the execution plan. >>> >>> >>> >>> >>> >>> >>> >>> Best Regards >>> >>> Peter Huang >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> On Tue, Mar 3, 2020 at 7:11 AM Yang Wang <[hidden email]> >>> wrote: >>> >>>> >>> >>>> Hi Peter, >>> >>>> >>> >>>> Having the application mode does not mean we will drop the >>> cluster-deploy >>> >>>> option. I just want to share some thoughts about “Application Mode”. >>> >>>> >>> >>>> >>> >>>> 1. The application mode could cover the per-job sematic. Its >>> lifecyle is bound >>> >>>> to the user `main()`. And all the jobs in the user main will be >>> executed in a same >>> >>>> Flink cluster. In first phase of FLIP-85 implementation, running >>> user main on the >>> >>>> cluster side could be supported in application mode. >>> >>>> >>> >>>> 2. Maybe in the future, we also need to support multiple >>> `execute()` on client side >>> >>>> in a same Flink cluster. Then the per-job mode will evolve to >>> application mode. >>> >>>> >>> >>>> 3. From user perspective, only a `-R/-- remote-deploy` cli option >>> is visible. They >>> >>>> are not aware of the application mode. >>> >>>> >>> >>>> 4. In the first phase, the application mode is working as >>> “per-job”(only one job in >>> >>>> the user main). We just leave more potential for the future. >>> >>>> >>> >>>> >>> >>>> I am not against with calling it “cluster deploy mode” if you all >>> think it is clearer for users. >>> >>>> >>> >>>> >>> >>>> >>> >>>> Best, >>> >>>> Yang >>> >>>> >>> >>>> Kostas Kloudas <[hidden email]> 于2020年3月3日周二 下午6:49写道: >>> >>>>> >>> >>>>> Hi Peter, >>> >>>>> >>> >>>>> I understand your point. This is why I was also a bit torn about >>> the >>> >>>>> name and my proposal was a bit aligned with yours (something along >>> the >>> >>>>> lines of "cluster deploy" mode). >>> >>>>> >>> >>>>> But many of the other participants in the discussion suggested the >>> >>>>> "Application Mode". I think that the reasoning is that now the >>> user's >>> >>>>> Application is more self-contained. >>> >>>>> It will be submitted to the cluster and the user can just >>> disconnect. >>> >>>>> In addition, as discussed briefly in the doc, in the future there >>> may >>> >>>>> be better support for multi-execute applications which will bring >>> us >>> >>>>> one step closer to the true "Application Mode". But this is how I >>> >>>>> interpreted their arguments, of course they can also express their >>> >>>>> thoughts on the topic :) >>> >>>>> >>> >>>>> Cheers, >>> >>>>> Kostas >>> >>>>> >>> >>>>> On Mon, Mar 2, 2020 at 6:15 PM Peter Huang < >>> [hidden email]> wrote: >>> >>>>> > >>> >>>>> > Hi Kostas, >>> >>>>> > >>> >>>>> > Thanks for updating the wiki. We have aligned with the >>> implementations in the doc. But I feel it is still a little bit confusing >>> of the naming from a user's perspective. It is well known that Flink >>> support per job cluster and session cluster. The concept is in the layer of >>> how a job is managed within Flink. The method introduced util now is a kind >>> of mixing job and session cluster to promising the implementation >>> complexity. We probably don't need to label it as Application Model as the >>> same layer of per job cluster and session cluster. Conceptually, I think it >>> is still a cluster mode implementation for per job cluster. >>> >>>>> > >>> >>>>> > To minimize the confusion of users, I think it would be better >>> just an option of per job cluster for each type of cluster manager. How do >>> you think? >>> >>>>> > >>> >>>>> > >>> >>>>> > Best Regards >>> >>>>> > Peter Huang >>> >>>>> > >>> >>>>> > >>> >>>>> > >>> >>>>> > >>> >>>>> > >>> >>>>> > >>> >>>>> > >>> >>>>> > >>> >>>>> > On Mon, Mar 2, 2020 at 7:22 AM Kostas Kloudas < >>> [hidden email]> wrote: >>> >>>>> >> >>> >>>>> >> Hi Yang, >>> >>>>> >> >>> >>>>> >> The difference between per-job and application mode is that, as >>> you >>> >>>>> >> described, in the per-job mode the main is executed on the >>> client >>> >>>>> >> while in the application mode, the main is executed on the >>> cluster. >>> >>>>> >> I do not think we have to offer "application mode" with running >>> the >>> >>>>> >> main on the client side as this is exactly what the per-job >>> mode does >>> >>>>> >> currently and, as you described also, it would be redundant. >>> >>>>> >> >>> >>>>> >> Sorry if this was not clear in the document. >>> >>>>> >> >>> >>>>> >> Cheers, >>> >>>>> >> Kostas >>> >>>>> >> >>> >>>>> >> On Mon, Mar 2, 2020 at 3:17 PM Yang Wang <[hidden email]> >>> wrote: >>> >>>>> >> > >>> >>>>> >> > Hi Kostas, >>> >>>>> >> > >>> >>>>> >> > Thanks a lot for your conclusion and updating the FLIP-85 >>> WIKI. Currently, i have no more >>> >>>>> >> > questions about motivation, approach, fault tolerance and the >>> first phase implementation. >>> >>>>> >> > >>> >>>>> >> > I think the new title "Flink Application Mode" makes a lot >>> senses to me. Especially for the >>> >>>>> >> > containerized environment, the cluster deploy option will be >>> very useful. >>> >>>>> >> > >>> >>>>> >> > Just one concern, how do we introduce this new application >>> mode to our users? >>> >>>>> >> > Each user program(i.e. `main()`) is an application. >>> Currently, we intend to only support one >>> >>>>> >> > `execute()`. So what's the difference between per-job and >>> application mode? >>> >>>>> >> > >>> >>>>> >> > For per-job, user `main()` is always executed on client side. >>> And For application mode, user >>> >>>>> >> > `main()` could be executed on client or master >>> side(configured via cli option). >>> >>>>> >> > Right? We need to have a clear concept. Otherwise, the users >>> will be more and more confusing. >>> >>>>> >> > >>> >>>>> >> > >>> >>>>> >> > Best, >>> >>>>> >> > Yang >>> >>>>> >> > >>> >>>>> >> > Kostas Kloudas <[hidden email]> 于2020年3月2日周一 下午5:58写道: >>> >>>>> >> >> >>> >>>>> >> >> Hi all, >>> >>>>> >> >> >>> >>>>> >> >> I update >>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-85+Flink+Application+Mode >>> >>>>> >> >> based on the discussion we had here: >>> >>>>> >> >> >>> >>>>> >> >> >>> https://docs.google.com/document/d/1ji72s3FD9DYUyGuKnJoO4ApzV-nSsZa0-bceGXW7Ocw/edit# >>> >>>>> >> >> >>> >>>>> >> >> Please let me know what you think and please keep the >>> discussion in the ML :) >>> >>>>> >> >> >>> >>>>> >> >> Thanks for starting the discussion and I hope that soon we >>> will be >>> >>>>> >> >> able to vote on the FLIP. >>> >>>>> >> >> >>> >>>>> >> >> Cheers, >>> >>>>> >> >> Kostas >>> >>>>> >> >> >>> >>>>> >> >> On Thu, Jan 16, 2020 at 3:40 AM Yang Wang < >>> [hidden email]> wrote: >>> >>>>> >> >> > >>> >>>>> >> >> > Hi all, >>> >>>>> >> >> > >>> >>>>> >> >> > Thanks a lot for the feedback from @Kostas Kloudas. Your >>> all concerns are >>> >>>>> >> >> > on point. The FLIP-85 is mainly >>> >>>>> >> >> > focused on supporting cluster mode for per-job. Since it >>> is more urgent and >>> >>>>> >> >> > have much more use >>> >>>>> >> >> > cases both in Yarn and Kubernetes deployment. For session >>> cluster, we could >>> >>>>> >> >> > have more discussion >>> >>>>> >> >> > in a new thread later. >>> >>>>> >> >> > >>> >>>>> >> >> > #1, How to download the user jars and dependencies for >>> per-job in cluster >>> >>>>> >> >> > mode? >>> >>>>> >> >> > For Yarn, we could register the user jars and dependencies >>> as >>> >>>>> >> >> > LocalResource. They will be distributed >>> >>>>> >> >> > by Yarn. And once the JobManager and TaskManager launched, >>> the jars are >>> >>>>> >> >> > already exists. >>> >>>>> >> >> > For Standalone per-job and K8s, we expect that the user >>> jars >>> >>>>> >> >> > and dependencies are built into the image. >>> >>>>> >> >> > Or the InitContainer could be used for downloading. It is >>> natively >>> >>>>> >> >> > distributed and we will not have bottleneck. >>> >>>>> >> >> > >>> >>>>> >> >> > #2, Job graph recovery >>> >>>>> >> >> > We could have an optimization to store job graph on the >>> DFS. However, i >>> >>>>> >> >> > suggest building a new jobgraph >>> >>>>> >> >> > from the configuration is the default option. Since we >>> will not always have >>> >>>>> >> >> > a DFS store when deploying a >>> >>>>> >> >> > Flink per-job cluster. Of course, we assume that using the >>> same >>> >>>>> >> >> > configuration(e.g. job_id, user_jar, main_class, >>> >>>>> >> >> > main_args, parallelism, savepoint_settings, etc.) will get >>> a same job >>> >>>>> >> >> > graph. I think the standalone per-job >>> >>>>> >> >> > already has the similar behavior. >>> >>>>> >> >> > >>> >>>>> >> >> > #3, What happens with jobs that have multiple execute >>> calls? >>> >>>>> >> >> > Currently, it is really a problem. Even we use a local >>> client on Flink >>> >>>>> >> >> > master side, it will have different behavior with >>> >>>>> >> >> > client mode. For client mode, if we execute multiple >>> times, then we will >>> >>>>> >> >> > deploy multiple Flink clusters for each execute. >>> >>>>> >> >> > I am not pretty sure whether it is reasonable. However, i >>> still think using >>> >>>>> >> >> > the local client is a good choice. We could >>> >>>>> >> >> > continue the discussion in a new thread. @Zili Chen < >>> [hidden email]> Do >>> >>>>> >> >> > you want to drive this? >>> >>>>> >> >> > >>> >>>>> >> >> > >>> >>>>> >> >> > >>> >>>>> >> >> > Best, >>> >>>>> >> >> > Yang >>> >>>>> >> >> > >>> >>>>> >> >> > Peter Huang <[hidden email]> 于2020年1月16日周四 >>> 上午1:55写道: >>> >>>>> >> >> > >>> >>>>> >> >> > > Hi Kostas, >>> >>>>> >> >> > > >>> >>>>> >> >> > > Thanks for this feedback. I can't agree more about the >>> opinion. The >>> >>>>> >> >> > > cluster mode should be added >>> >>>>> >> >> > > first in per job cluster. >>> >>>>> >> >> > > >>> >>>>> >> >> > > 1) For job cluster implementation >>> >>>>> >> >> > > 1. Job graph recovery from configuration or store as >>> static job graph as >>> >>>>> >> >> > > session cluster. I think the static one will be better >>> for less recovery >>> >>>>> >> >> > > time. >>> >>>>> >> >> > > Let me update the doc for details. >>> >>>>> >> >> > > >>> >>>>> >> >> > > 2. For job execute multiple times, I think @Zili Chen >>> >>>>> >> >> > > <[hidden email]> has proposed the local client >>> solution that can >>> >>>>> >> >> > > the run program actually in the cluster entry point. We >>> can put the >>> >>>>> >> >> > > implementation in the second stage, >>> >>>>> >> >> > > or even a new FLIP for further discussion. >>> >>>>> >> >> > > >>> >>>>> >> >> > > 2) For session cluster implementation >>> >>>>> >> >> > > We can disable the cluster mode for the session cluster >>> in the first >>> >>>>> >> >> > > stage. I agree the jar downloading will be a painful >>> thing. >>> >>>>> >> >> > > We can consider about PoC and performance evaluation >>> first. If the end to >>> >>>>> >> >> > > end experience is good enough, then we can consider >>> >>>>> >> >> > > proceeding with the solution. >>> >>>>> >> >> > > >>> >>>>> >> >> > > Looking forward to more opinions from @Yang Wang < >>> [hidden email]> @Zili >>> >>>>> >> >> > > Chen <[hidden email]> @Dian Fu < >>> [hidden email]>. >>> >>>>> >> >> > > >>> >>>>> >> >> > > >>> >>>>> >> >> > > Best Regards >>> >>>>> >> >> > > Peter Huang >>> >>>>> >> >> > > >>> >>>>> >> >> > > On Wed, Jan 15, 2020 at 7:50 AM Kostas Kloudas < >>> [hidden email]> wrote: >>> >>>>> >> >> > > >>> >>>>> >> >> > >> Hi all, >>> >>>>> >> >> > >> >>> >>>>> >> >> > >> I am writing here as the discussion on the Google Doc >>> seems to be a >>> >>>>> >> >> > >> bit difficult to follow. >>> >>>>> >> >> > >> >>> >>>>> >> >> > >> I think that in order to be able to make progress, it >>> would be helpful >>> >>>>> >> >> > >> to focus on per-job mode for now. >>> >>>>> >> >> > >> The reason is that: >>> >>>>> >> >> > >> 1) making the (unique) JobSubmitHandler responsible >>> for creating the >>> >>>>> >> >> > >> jobgraphs, >>> >>>>> >> >> > >> which includes downloading dependencies, is not an >>> optimal solution >>> >>>>> >> >> > >> 2) even if we put the responsibility on the JobMaster, >>> currently each >>> >>>>> >> >> > >> job has its own >>> >>>>> >> >> > >> JobMaster but they all run on the same process, so we >>> have again a >>> >>>>> >> >> > >> single entity. >>> >>>>> >> >> > >> >>> >>>>> >> >> > >> Of course after this is done, and if we feel >>> comfortable with the >>> >>>>> >> >> > >> solution, then we can go to the session mode. >>> >>>>> >> >> > >> >>> >>>>> >> >> > >> A second comment has to do with fault-tolerance in the >>> per-job, >>> >>>>> >> >> > >> cluster-deploy mode. >>> >>>>> >> >> > >> In the document, it is suggested that upon recovery, >>> the JobMaster of >>> >>>>> >> >> > >> each job re-creates the JobGraph. >>> >>>>> >> >> > >> I am just wondering if it is better to create and store >>> the jobGraph >>> >>>>> >> >> > >> upon submission and only fetch it >>> >>>>> >> >> > >> upon recovery so that we have a static jobGraph. >>> >>>>> >> >> > >> >>> >>>>> >> >> > >> Finally, I have a question which is what happens with >>> jobs that have >>> >>>>> >> >> > >> multiple execute calls? >>> >>>>> >> >> > >> The semantics seem to change compared to the current >>> behaviour, right? >>> >>>>> >> >> > >> >>> >>>>> >> >> > >> Cheers, >>> >>>>> >> >> > >> Kostas >>> >>>>> >> >> > >> >>> >>>>> >> >> > >> On Wed, Jan 8, 2020 at 8:05 PM tison < >>> [hidden email]> wrote: >>> >>>>> >> >> > >> > >>> >>>>> >> >> > >> > not always, Yang Wang is also not yet a committer but >>> he can join the >>> >>>>> >> >> > >> > channel. I cannot find the id by clicking “Add new >>> member in channel” so >>> >>>>> >> >> > >> > come to you and ask for try out the link. Possibly I >>> will find other >>> >>>>> >> >> > >> ways >>> >>>>> >> >> > >> > but the original purpose is that the slack channel is >>> a public area we >>> >>>>> >> >> > >> > discuss about developing... >>> >>>>> >> >> > >> > Best, >>> >>>>> >> >> > >> > tison. >>> >>>>> >> >> > >> > >>> >>>>> >> >> > >> > >>> >>>>> >> >> > >> > Peter Huang <[hidden email]> >>> 于2020年1月9日周四 上午2:44写道: >>> >>>>> >> >> > >> > >>> >>>>> >> >> > >> > > Hi Tison, >>> >>>>> >> >> > >> > > >>> >>>>> >> >> > >> > > I am not the committer of Flink yet. I think I >>> can't join it also. >>> >>>>> >> >> > >> > > >>> >>>>> >> >> > >> > > >>> >>>>> >> >> > >> > > Best Regards >>> >>>>> >> >> > >> > > Peter Huang >>> >>>>> >> >> > >> > > >>> >>>>> >> >> > >> > > On Wed, Jan 8, 2020 at 9:39 AM tison < >>> [hidden email]> wrote: >>> >>>>> >> >> > >> > > >>> >>>>> >> >> > >> > > > Hi Peter, >>> >>>>> >> >> > >> > > > >>> >>>>> >> >> > >> > > > Could you try out this link? >>> >>>>> >> >> > >> > > https://the-asf.slack.com/messages/CNA3ADZPH >>> >>>>> >> >> > >> > > > >>> >>>>> >> >> > >> > > > Best, >>> >>>>> >> >> > >> > > > tison. >>> >>>>> >> >> > >> > > > >>> >>>>> >> >> > >> > > > >>> >>>>> >> >> > >> > > > Peter Huang <[hidden email]> >>> 于2020年1月9日周四 上午1:22写道: >>> >>>>> >> >> > >> > > > >>> >>>>> >> >> > >> > > > > Hi Tison, >>> >>>>> >> >> > >> > > > > >>> >>>>> >> >> > >> > > > > I can't join the group with shared link. Would >>> you please add me >>> >>>>> >> >> > >> into >>> >>>>> >> >> > >> > > the >>> >>>>> >> >> > >> > > > > group? My slack account is huangzhenqiu0825. >>> >>>>> >> >> > >> > > > > Thank you in advance. >>> >>>>> >> >> > >> > > > > >>> >>>>> >> >> > >> > > > > >>> >>>>> >> >> > >> > > > > Best Regards >>> >>>>> >> >> > >> > > > > Peter Huang >>> >>>>> >> >> > >> > > > > >>> >>>>> >> >> > >> > > > > On Wed, Jan 8, 2020 at 12:02 AM tison < >>> [hidden email]> >>> >>>>> >> >> > >> wrote: >>> >>>>> >> >> > >> > > > > >>> >>>>> >> >> > >> > > > > > Hi Peter, >>> >>>>> >> >> > >> > > > > > >>> >>>>> >> >> > >> > > > > > As described above, this effort should get >>> attention from people >>> >>>>> >> >> > >> > > > > developing >>> >>>>> >> >> > >> > > > > > FLIP-73 a.k.a. Executor abstractions. I >>> recommend you to join >>> >>>>> >> >> > >> the >>> >>>>> >> >> > >> > > > public >>> >>>>> >> >> > >> > > > > > slack channel[1] for Flink Client API >>> Enhancement and you can >>> >>>>> >> >> > >> try to >>> >>>>> >> >> > >> > > > > share >>> >>>>> >> >> > >> > > > > > you detailed thoughts there. It possibly gets >>> more concrete >>> >>>>> >> >> > >> > > attentions. >>> >>>>> >> >> > >> > > > > > >>> >>>>> >> >> > >> > > > > > Best, >>> >>>>> >> >> > >> > > > > > tison. >>> >>>>> >> >> > >> > > > > > >>> >>>>> >> >> > >> > > > > > [1] >>> >>>>> >> >> > >> > > > > > >>> >>>>> >> >> > >> > > > > > >>> >>>>> >> >> > >> > > > > >>> >>>>> >> >> > >> > > > >>> >>>>> >> >> > >> > > >>> >>>>> >> >> > >> >>> https://slack.com/share/IS21SJ75H/Rk8HhUly9FuEHb7oGwBZ33uL/enQtODg2MDYwNjE5MTg3LTA2MjIzNDc1M2ZjZDVlMjdlZjk1M2RkYmJhNjAwMTk2ZDZkODQ4NmY5YmI4OGRhNWJkYTViMTM1NzlmMzc4OWM >>> >>>>> >> >> > >> > > > > > >>> >>>>> >> >> > >> > > > > > >>> >>>>> >> >> > >> > > > > > Peter Huang <[hidden email]> >>> 于2020年1月7日周二 上午5:09写道: >>> >>>>> >> >> > >> > > > > > >>> >>>>> >> >> > >> > > > > > > Dear All, >>> >>>>> >> >> > >> > > > > > > >>> >>>>> >> >> > >> > > > > > > Happy new year! According to existing >>> feedback from the >>> >>>>> >> >> > >> community, >>> >>>>> >> >> > >> > > we >>> >>>>> >> >> > >> > > > > > > revised the doc with the consideration of >>> session cluster >>> >>>>> >> >> > >> support, >>> >>>>> >> >> > >> > > > and >>> >>>>> >> >> > >> > > > > > > concrete interface changes needed and >>> execution plan. Please >>> >>>>> >> >> > >> take >>> >>>>> >> >> > >> > > one >>> >>>>> >> >> > >> > > > > > more >>> >>>>> >> >> > >> > > > > > > round of review at your most convenient >>> time. >>> >>>>> >> >> > >> > > > > > > >>> >>>>> >> >> > >> > > > > > > >>> >>>>> >> >> > >> > > > > > > >>> >>>>> >> >> > >> > > > > > >>> >>>>> >> >> > >> > > > > >>> >>>>> >> >> > >> > > > >>> >>>>> >> >> > >> > > >>> >>>>> >> >> > >> >>> https://docs.google.com/document/d/1aAwVjdZByA-0CHbgv16Me-vjaaDMCfhX7TzVVTuifYM/edit# >>> >>>>> >> >> > >> > > > > > > >>> >>>>> >> >> > >> > > > > > > >>> >>>>> >> >> > >> > > > > > > Best Regards >>> >>>>> >> >> > >> > > > > > > Peter Huang >>> >>>>> >> >> > >> > > > > > > >>> >>>>> >> >> > >> > > > > > > >>> >>>>> >> >> > >> > > > > > > >>> >>>>> >> >> > >> > > > > > > >>> >>>>> >> >> > >> > > > > > > >>> >>>>> >> >> > >> > > > > > > On Thu, Jan 2, 2020 at 11:29 AM Peter Huang >>> < >>> >>>>> >> >> > >> > > > > [hidden email]> >>> >>>>> >> >> > >> > > > > > > wrote: >>> >>>>> >> >> > >> > > > > > > >>> >>>>> >> >> > >> > > > > > > > Hi Dian, >>> >>>>> >> >> > >> > > > > > > > Thanks for giving us valuable feedbacks. >>> >>>>> >> >> > >> > > > > > > > >>> >>>>> >> >> > >> > > > > > > > 1) It's better to have a whole design for >>> this feature >>> >>>>> >> >> > >> > > > > > > > For the suggestion of enabling the >>> cluster mode also session >>> >>>>> >> >> > >> > > > > cluster, I >>> >>>>> >> >> > >> > > > > > > > think Flink already supported it. >>> WebSubmissionExtension >>> >>>>> >> >> > >> already >>> >>>>> >> >> > >> > > > > allows >>> >>>>> >> >> > >> > > > > > > > users to start a job with the specified >>> jar by using web UI. >>> >>>>> >> >> > >> > > > > > > > But we need to enable the feature from >>> CLI for both local >>> >>>>> >> >> > >> jar, >>> >>>>> >> >> > >> > > > remote >>> >>>>> >> >> > >> > > > > > > jar. >>> >>>>> >> >> > >> > > > > > > > I will align with Yang Wang first about >>> the details and >>> >>>>> >> >> > >> update >>> >>>>> >> >> > >> > > the >>> >>>>> >> >> > >> > > > > > design >>> >>>>> >> >> > >> > > > > > > > doc. >>> >>>>> >> >> > >> > > > > > > > >>> >>>>> >> >> > >> > > > > > > > 2) It's better to consider the >>> convenience for users, such >>> >>>>> >> >> > >> as >>> >>>>> >> >> > >> > > > > debugging >>> >>>>> >> >> > >> > > > > > > > >>> >>>>> >> >> > >> > > > > > > > I am wondering whether we can store the >>> exception in >>> >>>>> >> >> > >> jobgragh >>> >>>>> >> >> > >> > > > > > > > generation in application master. As no >>> streaming graph can >>> >>>>> >> >> > >> be >>> >>>>> >> >> > >> > > > > > scheduled >>> >>>>> >> >> > >> > > > > > > in >>> >>>>> >> >> > >> > > > > > > > this case, there will be no more TM will >>> be requested from >>> >>>>> >> >> > >> > > FlinkRM. >>> >>>>> >> >> > >> > > > > > > > If the AM is still running, users can >>> still query it from >>> >>>>> >> >> > >> CLI. As >>> >>>>> >> >> > >> > > > it >>> >>>>> >> >> > >> > > > > > > > requires more change, we can get some >>> feedback from < >>> >>>>> >> >> > >> > > > > > [hidden email] >>> >>>>> >> >> > >> > > > > > > > >>> >>>>> >> >> > >> > > > > > > > and @[hidden email] <[hidden email]>. >>> >>>>> >> >> > >> > > > > > > > >>> >>>>> >> >> > >> > > > > > > > 3) It's better to consider the impact to >>> the stability of >>> >>>>> >> >> > >> the >>> >>>>> >> >> > >> > > > cluster >>> >>>>> >> >> > >> > > > > > > > >>> >>>>> >> >> > >> > > > > > > > I agree with Yang Wang's opinion. >>> >>>>> >> >> > >> > > > > > > > >>> >>>>> >> >> > >> > > > > > > > >>> >>>>> >> >> > >> > > > > > > > >>> >>>>> >> >> > >> > > > > > > > Best Regards >>> >>>>> >> >> > >> > > > > > > > Peter Huang >>> >>>>> >> >> > >> > > > > > > > >>> >>>>> >> >> > >> > > > > > > > >>> >>>>> >> >> > >> > > > > > > > On Sun, Dec 29, 2019 at 9:44 PM Dian Fu < >>> >>>>> >> >> > >> [hidden email]> >>> >>>>> >> >> > >> > > > > wrote: >>> >>>>> >> >> > >> > > > > > > > >>> >>>>> >> >> > >> > > > > > > >> Hi all, >>> >>>>> >> >> > >> > > > > > > >> >>> >>>>> >> >> > >> > > > > > > >> Sorry to jump into this discussion. >>> Thanks everyone for the >>> >>>>> >> >> > >> > > > > > discussion. >>> >>>>> >> >> > >> > > > > > > >> I'm very interested in this topic >>> although I'm not an >>> >>>>> >> >> > >> expert in >>> >>>>> >> >> > >> > > > this >>> >>>>> >> >> > >> > > > > > > part. >>> >>>>> >> >> > >> > > > > > > >> So I'm glad to share my thoughts as >>> following: >>> >>>>> >> >> > >> > > > > > > >> >>> >>>>> >> >> > >> > > > > > > >> 1) It's better to have a whole design >>> for this feature >>> >>>>> >> >> > >> > > > > > > >> As we know, there are two deployment >>> modes: per-job mode >>> >>>>> >> >> > >> and >>> >>>>> >> >> > >> > > > session >>> >>>>> >> >> > >> > > > > > > >> mode. I'm wondering which mode really >>> needs this feature. >>> >>>>> >> >> > >> As the >>> >>>>> >> >> > >> > > > > > design >>> >>>>> >> >> > >> > > > > > > doc >>> >>>>> >> >> > >> > > > > > > >> mentioned, per-job mode is more used for >>> streaming jobs and >>> >>>>> >> >> > >> > > > session >>> >>>>> >> >> > >> > > > > > > mode is >>> >>>>> >> >> > >> > > > > > > >> usually used for batch jobs(Of course, >>> the job types and >>> >>>>> >> >> > >> the >>> >>>>> >> >> > >> > > > > > deployment >>> >>>>> >> >> > >> > > > > > > >> modes are orthogonal). Usually streaming >>> job is only >>> >>>>> >> >> > >> needed to >>> >>>>> >> >> > >> > > be >>> >>>>> >> >> > >> > > > > > > submitted >>> >>>>> >> >> > >> > > > > > > >> once and it will run for days or weeks, >>> while batch jobs >>> >>>>> >> >> > >> will be >>> >>>>> >> >> > >> > > > > > > submitted >>> >>>>> >> >> > >> > > > > > > >> more frequently compared with streaming >>> jobs. This means >>> >>>>> >> >> > >> that >>> >>>>> >> >> > >> > > > maybe >>> >>>>> >> >> > >> > > > > > > session >>> >>>>> >> >> > >> > > > > > > >> mode also needs this feature. However, >>> if we support this >>> >>>>> >> >> > >> > > feature >>> >>>>> >> >> > >> > > > in >>> >>>>> >> >> > >> > > > > > > >> session mode, the application master >>> will become the new >>> >>>>> >> >> > >> > > > centralized >>> >>>>> >> >> > >> > > > > > > >> service(which should be solved). So in >>> this case, it's >>> >>>>> >> >> > >> better to >>> >>>>> >> >> > >> > > > > have >>> >>>>> >> >> > >> > > > > > a >>> >>>>> >> >> > >> > > > > > > >> complete design for both per-job mode >>> and session mode. >>> >>>>> >> >> > >> > > > Furthermore, >>> >>>>> >> >> > >> > > > > > > even >>> >>>>> >> >> > >> > > > > > > >> if we can do it phase by phase, we need >>> to have a whole >>> >>>>> >> >> > >> picture >>> >>>>> >> >> > >> > > of >>> >>>>> >> >> > >> > > > > how >>> >>>>> >> >> > >> > > > > > > it >>> >>>>> >> >> > >> > > > > > > >> works in both per-job mode and session >>> mode. >>> >>>>> >> >> > >> > > > > > > >> >>> >>>>> >> >> > >> > > > > > > >> 2) It's better to consider the >>> convenience for users, such >>> >>>>> >> >> > >> as >>> >>>>> >> >> > >> > > > > > debugging >>> >>>>> >> >> > >> > > > > > > >> After we finish this feature, the job >>> graph will be >>> >>>>> >> >> > >> compiled in >>> >>>>> >> >> > >> > > > the >>> >>>>> >> >> > >> > > > > > > >> application master, which means that >>> users cannot easily >>> >>>>> >> >> > >> get the >>> >>>>> >> >> > >> > > > > > > exception >>> >>>>> >> >> > >> > > > > > > >> message synchorousely in the job client >>> if there are >>> >>>>> >> >> > >> problems >>> >>>>> >> >> > >> > > > during >>> >>>>> >> >> > >> > > > > > the >>> >>>>> >> >> > >> > > > > > > >> job graph compiling (especially for >>> platform users), such >>> >>>>> >> >> > >> as the >>> >>>>> >> >> > >> > > > > > > resource >>> >>>>> >> >> > >> > > > > > > >> path is incorrect, the user program >>> itself has some >>> >>>>> >> >> > >> problems, >>> >>>>> >> >> > >> > > etc. >>> >>>>> >> >> > >> > > > > > What >>> >>>>> >> >> > >> > > > > > > I'm >>> >>>>> >> >> > >> > > > > > > >> thinking is that maybe we should throw >>> the exceptions as >>> >>>>> >> >> > >> early >>> >>>>> >> >> > >> > > as >>> >>>>> >> >> > >> > > > > > > possible >>> >>>>> >> >> > >> > > > > > > >> (during job submission stage). >>> >>>>> >> >> > >> > > > > > > >> >>> >>>>> >> >> > >> > > > > > > >> 3) It's better to consider the impact to >>> the stability of >>> >>>>> >> >> > >> the >>> >>>>> >> >> > >> > > > > cluster >>> >>>>> >> >> > >> > > > > > > >> If we perform the compiling in the >>> application master, we >>> >>>>> >> >> > >> should >>> >>>>> >> >> > >> > > > > > > consider >>> >>>>> >> >> > >> > > > > > > >> the impact of the compiling errors. >>> Although YARN could >>> >>>>> >> >> > >> resume >>> >>>>> >> >> > >> > > the >>> >>>>> >> >> > >> > > > > > > >> application master in case of failures, >>> but in some case >>> >>>>> >> >> > >> the >>> >>>>> >> >> > >> > > > > compiling >>> >>>>> >> >> > >> > > > > > > >> failure may be a waste of cluster >>> resource and may impact >>> >>>>> >> >> > >> the >>> >>>>> >> >> > >> > > > > > stability >>> >>>>> >> >> > >> > > > > > > the >>> >>>>> >> >> > >> > > > > > > >> cluster and the other jobs in the >>> cluster, such as the >>> >>>>> >> >> > >> resource >>> >>>>> >> >> > >> > > > path >>> >>>>> >> >> > >> > > > > > is >>> >>>>> >> >> > >> > > > > > > >> incorrect, the user program itself has >>> some problems(in >>> >>>>> >> >> > >> this >>> >>>>> >> >> > >> > > case, >>> >>>>> >> >> > >> > > > > job >>> >>>>> >> >> > >> > > > > > > >> failover cannot solve this kind of >>> problems) etc. In the >>> >>>>> >> >> > >> current >>> >>>>> >> >> > >> > > > > > > >> implemention, the compiling errors are >>> handled in the >>> >>>>> >> >> > >> client >>> >>>>> >> >> > >> > > side >>> >>>>> >> >> > >> > > > > and >>> >>>>> >> >> > >> > > > > > > there >>> >>>>> >> >> > >> > > > > > > >> is no impact to the cluster at all. >>> >>>>> >> >> > >> > > > > > > >> >>> >>>>> >> >> > >> > > > > > > >> Regarding to 1), it's clearly pointed in >>> the design doc >>> >>>>> >> >> > >> that >>> >>>>> >> >> > >> > > only >>> >>>>> >> >> > >> > > > > > > per-job >>> >>>>> >> >> > >> > > > > > > >> mode will be supported. However, I think >>> it's better to >>> >>>>> >> >> > >> also >>> >>>>> >> >> > >> > > > > consider >>> >>>>> >> >> > >> > > > > > > the >>> >>>>> >> >> > >> > > > > > > >> session mode in the design doc. >>> >>>>> >> >> > >> > > > > > > >> Regarding to 2) and 3), I have not seen >>> related sections >>> >>>>> >> >> > >> in the >>> >>>>> >> >> > >> > > > > design >>> >>>>> >> >> > >> > > > > > > >> doc. It will be good if we can cover >>> them in the design >>> >>>>> >> >> > >> doc. >>> >>>>> >> >> > >> > > > > > > >> >>> >>>>> >> >> > >> > > > > > > >> Feel free to correct me If there is >>> anything I >>> >>>>> >> >> > >> misunderstand. >>> >>>>> >> >> > >> > > > > > > >> >>> >>>>> >> >> > >> > > > > > > >> Regards, >>> >>>>> >> >> > >> > > > > > > >> Dian >>> >>>>> >> >> > >> > > > > > > >> >>> >>>>> >> >> > >> > > > > > > >> >>> >>>>> >> >> > >> > > > > > > >> > 在 2019年12月27日,上午3:13,Peter Huang < >>> >>>>> >> >> > >> [hidden email]> >>> >>>>> >> >> > >> > > > 写道: >>> >>>>> >> >> > >> > > > > > > >> > >>> >>>>> >> >> > >> > > > > > > >> > Hi Yang, >>> >>>>> >> >> > >> > > > > > > >> > >>> >>>>> >> >> > >> > > > > > > >> > I can't agree more. The effort >>> definitely needs to align >>> >>>>> >> >> > >> with >>> >>>>> >> >> > >> > > > the >>> >>>>> >> >> > >> > > > > > > final >>> >>>>> >> >> > >> > > > > > > >> > goal of FLIP-73. >>> >>>>> >> >> > >> > > > > > > >> > I am thinking about whether we can >>> achieve the goal with >>> >>>>> >> >> > >> two >>> >>>>> >> >> > >> > > > > phases. >>> >>>>> >> >> > >> > > > > > > >> > >>> >>>>> >> >> > >> > > > > > > >> > 1) Phase I >>> >>>>> >> >> > >> > > > > > > >> > As the CLiFrontend will not be >>> depreciated soon. We can >>> >>>>> >> >> > >> still >>> >>>>> >> >> > >> > > > use >>> >>>>> >> >> > >> > > > > > the >>> >>>>> >> >> > >> > > > > > > >> > deployMode flag there, >>> >>>>> >> >> > >> > > > > > > >> > pass the program info through Flink >>> configuration, use >>> >>>>> >> >> > >> the >>> >>>>> >> >> > >> > > > > > > >> > ClassPathJobGraphRetriever >>> >>>>> >> >> > >> > > > > > > >> > to generate the job graph in >>> ClusterEntrypoints of yarn >>> >>>>> >> >> > >> and >>> >>>>> >> >> > >> > > > > > > Kubernetes. >>> >>>>> >> >> > >> > > > > > > >> > >>> >>>>> >> >> > >> > > > > > > >> > 2) Phase II >>> >>>>> >> >> > >> > > > > > > >> > In AbstractJobClusterExecutor, the >>> job graph is >>> >>>>> >> >> > >> generated in >>> >>>>> >> >> > >> > > > the >>> >>>>> >> >> > >> > > > > > > >> execute >>> >>>>> >> >> > >> > > > > > > >> > function. We can still >>> >>>>> >> >> > >> > > > > > > >> > use the deployMode in it. With >>> deployMode = cluster, the >>> >>>>> >> >> > >> > > execute >>> >>>>> >> >> > >> > > > > > > >> function >>> >>>>> >> >> > >> > > > > > > >> > only starts the cluster. >>> >>>>> >> >> > >> > > > > > > >> > >>> >>>>> >> >> > >> > > > > > > >> > When >>> {Yarn/Kuberneates}PerJobClusterEntrypoint starts, >>> >>>>> >> >> > >> It will >>> >>>>> >> >> > >> > > > > start >>> >>>>> >> >> > >> > > > > > > the >>> >>>>> >> >> > >> > > > > > > >> > dispatch first, then we can use >>> >>>>> >> >> > >> > > > > > > >> > a ClusterEnvironment similar to >>> ContextEnvironment to >>> >>>>> >> >> > >> submit >>> >>>>> >> >> > >> > > the >>> >>>>> >> >> > >> > > > > job >>> >>>>> >> >> > >> > > > > > > >> with >>> >>>>> >> >> > >> > > > > > > >> > jobName the local >>> >>>>> >> >> > >> > > > > > > >> > dispatcher. For the details, we need >>> more investigation. >>> >>>>> >> >> > >> Let's >>> >>>>> >> >> > >> > > > > wait >>> >>>>> >> >> > >> > > > > > > >> > for @Aljoscha >>> >>>>> >> >> > >> > > > > > > >> > Krettek <[hidden email]> @Till >>> Rohrmann < >>> >>>>> >> >> > >> > > > > [hidden email] >>> >>>>> >> >> > >> > > > > > >'s >>> >>>>> >> >> > >> > > > > > > >> > feedback after the holiday season. >>> >>>>> >> >> > >> > > > > > > >> > >>> >>>>> >> >> > >> > > > > > > >> > Thank you in advance. Merry Chrismas >>> and Happy New >>> >>>>> >> >> > >> Year!!! >>> >>>>> >> >> > >> > > > > > > >> > >>> >>>>> >> >> > >> > > > > > > >> > >>> >>>>> >> >> > >> > > > > > > >> > Best Regards >>> >>>>> >> >> > >> > > > > > > >> > Peter Huang >>> >>>>> >> >> > >> > > > > > > >> > >>> >>>>> >> >> > >> > > > > > > >> > >>> >>>>> >> >> > >> > > > > > > >> > >>> >>>>> >> >> > >> > > > > > > >> > >>> >>>>> >> >> > >> > > > > > > >> > >>> >>>>> >> >> > >> > > > > > > >> > >>> >>>>> >> >> > >> > > > > > > >> > >>> >>>>> >> >> > >> > > > > > > >> > >>> >>>>> >> >> > >> > > > > > > >> > On Wed, Dec 25, 2019 at 1:08 AM Yang >>> Wang < >>> >>>>> >> >> > >> > > > [hidden email]> >>> >>>>> >> >> > >> > > > > > > >> wrote: >>> >>>>> >> >> > >> > > > > > > >> > >>> >>>>> >> >> > >> > > > > > > >> >> Hi Peter, >>> >>>>> >> >> > >> > > > > > > >> >> >>> >>>>> >> >> > >> > > > > > > >> >> I think we need to reconsider tison's >>> suggestion >>> >>>>> >> >> > >> seriously. >>> >>>>> >> >> > >> > > > After >>> >>>>> >> >> > >> > > > > > > >> FLIP-73, >>> >>>>> >> >> > >> > > > > > > >> >> the deployJobCluster has >>> >>>>> >> >> > >> > > > > > > >> >> beenmoved into >>> `JobClusterExecutor#execute`. It should >>> >>>>> >> >> > >> not be >>> >>>>> >> >> > >> > > > > > > perceived >>> >>>>> >> >> > >> > > > > > > >> >> for `CliFrontend`. That >>> >>>>> >> >> > >> > > > > > > >> >> means the user program will *ALWAYS* >>> be executed on >>> >>>>> >> >> > >> client >>> >>>>> >> >> > >> > > > side. >>> >>>>> >> >> > >> > > > > > This >>> >>>>> >> >> > >> > > > > > > >> is >>> >>>>> >> >> > >> > > > > > > >> >> the by design behavior. >>> >>>>> >> >> > >> > > > > > > >> >> So, we could not just add `if(client >>> mode) .. else >>> >>>>> >> >> > >> if(cluster >>> >>>>> >> >> > >> > > > > mode) >>> >>>>> >> >> > >> > > > > > > >> ...` >>> >>>>> >> >> > >> > > > > > > >> >> codes in `CliFrontend` to bypass >>> >>>>> >> >> > >> > > > > > > >> >> the executor. We need to find a clean >>> way to decouple >>> >>>>> >> >> > >> > > executing >>> >>>>> >> >> > >> > > > > > user >>> >>>>> >> >> > >> > > > > > > >> >> program and deploying per-job >>> >>>>> >> >> > >> > > > > > > >> >> cluster. Based on this, we could >>> support to execute user >>> >>>>> >> >> > >> > > > program >>> >>>>> >> >> > >> > > > > on >>> >>>>> >> >> > >> > > > > > > >> client >>> >>>>> >> >> > >> > > > > > > >> >> or master side. >>> >>>>> >> >> > >> > > > > > > >> >> >>> >>>>> >> >> > >> > > > > > > >> >> Maybe Aljoscha and Jeff could give >>> some good >>> >>>>> >> >> > >> suggestions. >>> >>>>> >> >> > >> > > > > > > >> >> >>> >>>>> >> >> > >> > > > > > > >> >> >>> >>>>> >> >> > >> > > > > > > >> >> >>> >>>>> >> >> > >> > > > > > > >> >> Best, >>> >>>>> >> >> > >> > > > > > > >> >> Yang >>> >>>>> >> >> > >> > > > > > > >> >> >>> >>>>> >> >> > >> > > > > > > >> >> Peter Huang < >>> [hidden email]> 于2019年12月25日周三 >>> >>>>> >> >> > >> > > > > 上午4:03写道: >>> >>>>> >> >> > >> > > > > > > >> >> >>> >>>>> >> >> > >> > > > > > > >> >>> Hi Jingjing, >>> >>>>> >> >> > >> > > > > > > >> >>> >>> >>>>> >> >> > >> > > > > > > >> >>> The improvement proposed is a >>> deployment option for >>> >>>>> >> >> > >> CLI. For >>> >>>>> >> >> > >> > > > SQL >>> >>>>> >> >> > >> > > > > > > based >>> >>>>> >> >> > >> > > > > > > >> >>> Flink application, It is more >>> convenient to use the >>> >>>>> >> >> > >> existing >>> >>>>> >> >> > >> > > > > model >>> >>>>> >> >> > >> > > > > > > in >>> >>>>> >> >> > >> > > > > > > >> >>> SqlClient in which >>> >>>>> >> >> > >> > > > > > > >> >>> the job graph is generated within >>> SqlClient. After >>> >>>>> >> >> > >> adding >>> >>>>> >> >> > >> > > the >>> >>>>> >> >> > >> > > > > > > delayed >>> >>>>> >> >> > >> > > > > > > >> job >>> >>>>> >> >> > >> > > > > > > >> >>> graph generation, I think there is >>> no change is needed >>> >>>>> >> >> > >> for >>> >>>>> >> >> > >> > > > your >>> >>>>> >> >> > >> > > > > > > side. >>> >>>>> >> >> > >> > > > > > > >> >>> >>> >>>>> >> >> > >> > > > > > > >> >>> >>> >>>>> >> >> > >> > > > > > > >> >>> Best Regards >>> >>>>> >> >> > >> > > > > > > >> >>> Peter Huang >>> >>>>> >> >> > >> > > > > > > >> >>> >>> >>>>> >> >> > >> > > > > > > >> >>> >>> >>>>> >> >> > >> > > > > > > >> >>> On Wed, Dec 18, 2019 at 6:01 AM >>> jingjing bai < >>> >>>>> >> >> > >> > > > > > > >> [hidden email]> >>> >>>>> >> >> > >> > > > > > > >> >>> wrote: >>> >>>>> >> >> > >> > > > > > > >> >>> >>> >>>>> >> >> > >> > > > > > > >> >>>> hi peter: >>> >>>>> >> >> > >> > > > > > > >> >>>> we had extension SqlClent to >>> support sql job >>> >>>>> >> >> > >> submit in >>> >>>>> >> >> > >> > > web >>> >>>>> >> >> > >> > > > > > base >>> >>>>> >> >> > >> > > > > > > on >>> >>>>> >> >> > >> > > > > > > >> >>>> flink 1.9. we support submit to >>> yarn on per job >>> >>>>> >> >> > >> mode too. >>> >>>>> >> >> > >> > > > > > > >> >>>> in this case, the job graph >>> generated on client >>> >>>>> >> >> > >> side >>> >>>>> >> >> > >> > > . I >>> >>>>> >> >> > >> > > > > > think >>> >>>>> >> >> > >> > > > > > > >> >>> this >>> >>>>> >> >> > >> > > > > > > >> >>>> discuss Mainly to improve api >>> programme. but in my >>> >>>>> >> >> > >> case , >>> >>>>> >> >> > >> > > > > there >>> >>>>> >> >> > >> > > > > > is >>> >>>>> >> >> > >> > > > > > > >> no >>> >>>>> >> >> > >> > > > > > > >> >>>> jar to upload but only a sql string >>> . >>> >>>>> >> >> > >> > > > > > > >> >>>> do u had more suggestion to >>> improve for sql mode >>> >>>>> >> >> > >> or it >>> >>>>> >> >> > >> > > is >>> >>>>> >> >> > >> > > > > > only a >>> >>>>> >> >> > >> > > > > > > >> >>>> switch for api programme? >>> >>>>> >> >> > >> > > > > > > >> >>>> >>> >>>>> >> >> > >> > > > > > > >> >>>> >>> >>>>> >> >> > >> > > > > > > >> >>>> best >>> >>>>> >> >> > >> > > > > > > >> >>>> bai jj >>> >>>>> >> >> > >> > > > > > > >> >>>> >>> >>>>> >> >> > >> > > > > > > >> >>>> >>> >>>>> >> >> > >> > > > > > > >> >>>> Yang Wang <[hidden email]> >>> 于2019年12月18日周三 >>> >>>>> >> >> > >> 下午7:21写道: >>> >>>>> >> >> > >> > > > > > > >> >>>> >>> >>>>> >> >> > >> > > > > > > >> >>>>> I just want to revive this >>> discussion. >>> >>>>> >> >> > >> > > > > > > >> >>>>> >>> >>>>> >> >> > >> > > > > > > >> >>>>> Recently, i am thinking about how >>> to natively run >>> >>>>> >> >> > >> flink >>> >>>>> >> >> > >> > > > > per-job >>> >>>>> >> >> > >> > > > > > > >> >>> cluster on >>> >>>>> >> >> > >> > > > > > > >> >>>>> Kubernetes. >>> >>>>> >> >> > >> > > > > > > >> >>>>> The per-job mode on Kubernetes is >>> very different >>> >>>>> >> >> > >> from on >>> >>>>> >> >> > >> > > > Yarn. >>> >>>>> >> >> > >> > > > > > And >>> >>>>> >> >> > >> > > > > > > >> we >>> >>>>> >> >> > >> > > > > > > >> >>> will >>> >>>>> >> >> > >> > > > > > > >> >>>>> have >>> >>>>> >> >> > >> > > > > > > >> >>>>> the same deployment requirements >>> to the client and >>> >>>>> >> >> > >> entry >>> >>>>> >> >> > >> > > > > point. >>> >>>>> >> >> > >> > > > > > > >> >>>>> >>> >>>>> >> >> > >> > > > > > > >> >>>>> 1. Flink client not always need a >>> local jar to start >>> >>>>> >> >> > >> a >>> >>>>> >> >> > >> > > Flink >>> >>>>> >> >> > >> > > > > > > per-job >>> >>>>> >> >> > >> > > > > > > >> >>>>> cluster. We could >>> >>>>> >> >> > >> > > > > > > >> >>>>> support multiple schemas. For >>> example, >>> >>>>> >> >> > >> > > > file:///path/of/my.jar >>> >>>>> >> >> > >> > > > > > > means >>> >>>>> >> >> > >> > > > > > > >> a >>> >>>>> >> >> > >> > > > > > > >> >>> jar >>> >>>>> >> >> > >> > > > > > > >> >>>>> located >>> >>>>> >> >> > >> > > > > > > >> >>>>> at client side, >>> >>>>> >> >> > >> hdfs://myhdfs/user/myname/flink/my.jar >>> >>>>> >> >> > >> > > > means a >>> >>>>> >> >> > >> > > > > > jar >>> >>>>> >> >> > >> > > > > > > >> >>> located >>> >>>>> >> >> > >> > > > > > > >> >>>>> at >>> >>>>> >> >> > >> > > > > > > >> >>>>> remote hdfs, >>> local:///path/in/image/my.jar means a >>> >>>>> >> >> > >> jar >>> >>>>> >> >> > >> > > > located >>> >>>>> >> >> > >> > > > > > at >>> >>>>> >> >> > >> > > > > > > >> >>>>> jobmanager side. >>> >>>>> >> >> > >> > > > > > > >> >>>>> >>> >>>>> >> >> > >> > > > > > > >> >>>>> 2. Support running user program on >>> master side. This >>> >>>>> >> >> > >> also >>> >>>>> >> >> > >> > > > > means >>> >>>>> >> >> > >> > > > > > > the >>> >>>>> >> >> > >> > > > > > > >> >>> entry >>> >>>>> >> >> > >> > > > > > > >> >>>>> point >>> >>>>> >> >> > >> > > > > > > >> >>>>> will generate the job graph on >>> master side. We could >>> >>>>> >> >> > >> use >>> >>>>> >> >> > >> > > the >>> >>>>> >> >> > >> > > > > > > >> >>>>> ClasspathJobGraphRetriever >>> >>>>> >> >> > >> > > > > > > >> >>>>> or start a local Flink client to >>> achieve this >>> >>>>> >> >> > >> purpose. >>> >>>>> >> >> > >> > > > > > > >> >>>>> >>> >>>>> >> >> > >> > > > > > > >> >>>>> >>> >>>>> >> >> > >> > > > > > > >> >>>>> cc tison, Aljoscha & Kostas Do you >>> think this is the >>> >>>>> >> >> > >> right >>> >>>>> >> >> > >> > > > > > > >> direction we >>> >>>>> >> >> > >> > > > > > > >> >>>>> need to work? >>> >>>>> >> >> > >> > > > > > > >> >>>>> >>> >>>>> >> >> > >> > > > > > > >> >>>>> tison <[hidden email]> >>> 于2019年12月12日周四 >>> >>>>> >> >> > >> 下午4:48写道: >>> >>>>> >> >> > >> > > > > > > >> >>>>> >>> >>>>> >> >> > >> > > > > > > >> >>>>>> A quick idea is that we separate >>> the deployment >>> >>>>> >> >> > >> from user >>> >>>>> >> >> > >> > > > > > program >>> >>>>> >> >> > >> > > > > > > >> >>> that >>> >>>>> >> >> > >> > > > > > > >> >>>>> it >>> >>>>> >> >> > >> > > > > > > >> >>>>>> has always been done >>> >>>>> >> >> > >> > > > > > > >> >>>>>> outside the program. On user >>> program executed there >>> >>>>> >> >> > >> is >>> >>>>> >> >> > >> > > > > always a >>> >>>>> >> >> > >> > > > > > > >> >>>>>> ClusterClient that communicates >>> with >>> >>>>> >> >> > >> > > > > > > >> >>>>>> an existing cluster, remote or >>> local. It will be >>> >>>>> >> >> > >> another >>> >>>>> >> >> > >> > > > > thread >>> >>>>> >> >> > >> > > > > > > so >>> >>>>> >> >> > >> > > > > > > >> >>> just >>> >>>>> >> >> > >> > > > > > > >> >>>>> for >>> >>>>> >> >> > >> > > > > > > >> >>>>>> your information. >>> >>>>> >> >> > >> > > > > > > >> >>>>>> >>> >>>>> >> >> > >> > > > > > > >> >>>>>> Best, >>> >>>>> >> >> > >> > > > > > > >> >>>>>> tison. >>> >>>>> >> >> > >> > > > > > > >> >>>>>> >>> >>>>> >> >> > >> > > > > > > >> >>>>>> >>> >>>>> >> >> > >> > > > > > > >> >>>>>> tison <[hidden email]> >>> 于2019年12月12日周四 >>> >>>>> >> >> > >> 下午4:40写道: >>> >>>>> >> >> > >> > > > > > > >> >>>>>> >>> >>>>> >> >> > >> > > > > > > >> >>>>>>> Hi Peter, >>> >>>>> >> >> > >> > > > > > > >> >>>>>>> >>> >>>>> >> >> > >> > > > > > > >> >>>>>>> Another concern I realized >>> recently is that with >>> >>>>> >> >> > >> current >>> >>>>> >> >> > >> > > > > > > Executors >>> >>>>> >> >> > >> > > > > > > >> >>>>>>> abstraction(FLIP-73) >>> >>>>> >> >> > >> > > > > > > >> >>>>>>> I'm afraid that user program is >>> designed to ALWAYS >>> >>>>> >> >> > >> run >>> >>>>> >> >> > >> > > on >>> >>>>> >> >> > >> > > > > the >>> >>>>> >> >> > >> > > > > > > >> >>> client >>> >>>>> >> >> > >> > > > > > > >> >>>>>> side. >>> >>>>> >> >> > >> > > > > > > >> >>>>>>> Specifically, >>> >>>>> >> >> > >> > > > > > > >> >>>>>>> we deploy the job in executor >>> when env.execute >>> >>>>> >> >> > >> called. >>> >>>>> >> >> > >> > > > This >>> >>>>> >> >> > >> > > > > > > >> >>>>> abstraction >>> >>>>> >> >> > >> > > > > > > >> >>>>>>> possibly prevents >>> >>>>> >> >> > >> > > > > > > >> >>>>>>> Flink runs user program on the >>> cluster side. >>> >>>>> >> >> > >> > > > > > > >> >>>>>>> >>> >>>>> >> >> > >> > > > > > > >> >>>>>>> For your proposal, in this case >>> we already >>> >>>>> >> >> > >> compiled the >>> >>>>> >> >> > >> > > > > > program >>> >>>>> >> >> > >> > > > > > > >> and >>> >>>>> >> >> > >> > > > > > > >> >>>>> run >>> >>>>> >> >> > >> > > > > > > >> >>>>>> on >>> >>>>> >> >> > >> > > > > > > >> >>>>>>> the client side, >>> >>>>> >> >> > >> > > > > > > >> >>>>>>> even we deploy a cluster and >>> retrieve job graph >>> >>>>> >> >> > >> from >>> >>>>> >> >> > >> > > > program >>> >>>>> >> >> > >> > > > > > > >> >>>>> metadata, it >>> >>>>> >> >> > >> > > > > > > >> >>>>>>> doesn't make >>> >>>>> >> >> > >> > > > > > > >> >>>>>>> many sense. >>> >>>>> >> >> > >> > > > > > > >> >>>>>>> >>> >>>>> >> >> > >> > > > > > > >> >>>>>>> cc Aljoscha & Kostas what do you >>> think about this >>> >>>>> >> >> > >> > > > > constraint? >>> >>>>> >> >> > >> > > > > > > >> >>>>>>> >>> >>>>> >> >> > >> > > > > > > >> >>>>>>> Best, >>> >>>>> >> >> > >> > > > > > > >> >>>>>>> tison. >>> >>>>> >> >> > >> > > > > > > >> >>>>>>> >>> >>>>> >> >> > >> > > > > > > >> >>>>>>> >>> >>>>> >> >> > >> > > > > > > >> >>>>>>> Peter Huang < >>> [hidden email]> >>> >>>>> >> >> > >> 于2019年12月10日周二 >>> >>>>> >> >> > >> > > > > > > >> 下午12:45写道: >>> >>>>> >> >> > >> > > > > > > >> >>>>>>> >>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> Hi Tison, >>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> >>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> Yes, you are right. I think I >>> made the wrong >>> >>>>> >> >> > >> argument >>> >>>>> >> >> > >> > > in >>> >>>>> >> >> > >> > > > > the >>> >>>>> >> >> > >> > > > > > > doc. >>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> Basically, the packaging jar >>> problem is only for >>> >>>>> >> >> > >> > > platform >>> >>>>> >> >> > >> > > > > > > users. >>> >>>>> >> >> > >> > > > > > > >> >>> In >>> >>>>> >> >> > >> > > > > > > >> >>>>> our >>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> internal deploy service, >>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> we further optimized the >>> deployment latency by >>> >>>>> >> >> > >> letting >>> >>>>> >> >> > >> > > > > users >>> >>>>> >> >> > >> > > > > > to >>> >>>>> >> >> > >> > > > > > > >> >>>>>> packaging >>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> flink-runtime together with the >>> uber jar, so that >>> >>>>> >> >> > >> we >>> >>>>> >> >> > >> > > > don't >>> >>>>> >> >> > >> > > > > > need >>> >>>>> >> >> > >> > > > > > > >> to >>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> consider >>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> multiple flink version >>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> support for now. In the session >>> client mode, as >>> >>>>> >> >> > >> Flink >>> >>>>> >> >> > >> > > > libs >>> >>>>> >> >> > >> > > > > > will >>> >>>>> >> >> > >> > > > > > > >> be >>> >>>>> >> >> > >> > > > > > > >> >>>>>> shipped >>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> anyway as local resources of >>> yarn. Users actually >>> >>>>> >> >> > >> don't >>> >>>>> >> >> > >> > > > > need >>> >>>>> >> >> > >> > > > > > to >>> >>>>> >> >> > >> > > > > > > >> >>>>> package >>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> those libs into job jar. >>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> >>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> >>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> >>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> Best Regards >>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> Peter Huang >>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> >>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> On Mon, Dec 9, 2019 at 8:35 PM >>> tison < >>> >>>>> >> >> > >> > > > [hidden email] >>> >>>>> >> >> > >> > > > > > >>> >>>>> >> >> > >> > > > > > > >> >>> wrote: >>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> >>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> 3. What do you mean about the >>> package? Do users >>> >>>>> >> >> > >> need >>> >>>>> >> >> > >> > > to >>> >>>>> >> >> > >> > > > > > > >> >>> compile >>> >>>>> >> >> > >> > > > > > > >> >>>>>> their >>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>> jars >>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>> inlcuding flink-clients, >>> flink-optimizer, >>> >>>>> >> >> > >> flink-table >>> >>>>> >> >> > >> > > > > codes? >>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>> >>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>> The answer should be no >>> because they exist in >>> >>>>> >> >> > >> system >>> >>>>> >> >> > >> > > > > > > classpath. >>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>> >>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>> Best, >>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>> tison. >>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>> >>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>> >>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>> Yang Wang < >>> [hidden email]> 于2019年12月10日周二 >>> >>>>> >> >> > >> > > > > 下午12:18写道: >>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>> >>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> Hi Peter, >>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> >>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> Thanks a lot for starting >>> this discussion. I >>> >>>>> >> >> > >> think >>> >>>>> >> >> > >> > > this >>> >>>>> >> >> > >> > > > > is >>> >>>>> >> >> > >> > > > > > a >>> >>>>> >> >> > >> > > > > > > >> >>> very >>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> useful >>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> feature. >>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> >>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> Not only for Yarn, i am >>> focused on flink on >>> >>>>> >> >> > >> > > Kubernetes >>> >>>>> >> >> > >> > > > > > > >> >>>>> integration >>> >>>>> >> >> > >> > > > > > > >> >>>>>> and >>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>> come >>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> across the same >>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> problem. I do not want the >>> job graph generated >>> >>>>> >> >> > >> on >>> >>>>> >> >> > >> > > > client >>> >>>>> >> >> > >> > > > > > > side. >>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> Instead, >>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>> the >>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> user jars are built in >>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> a user-defined image. When >>> the job manager >>> >>>>> >> >> > >> launched, >>> >>>>> >> >> > >> > > we >>> >>>>> >> >> > >> > > > > > just >>> >>>>> >> >> > >> > > > > > > >> >>>>> need to >>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> generate the job graph >>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> based on local user jars. >>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> >>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> I have some small suggestion >>> about this. >>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> >>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> 1. `ProgramJobGraphRetriever` >>> is very similar to >>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> `ClasspathJobGraphRetriever`, >>> the differences >>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> are the former needs >>> `ProgramMetadata` and the >>> >>>>> >> >> > >> latter >>> >>>>> >> >> > >> > > > > needs >>> >>>>> >> >> > >> > > > > > > >> >>> some >>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>> arguments. >>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> Is it possible to >>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> have an unified >>> `JobGraphRetriever` to support >>> >>>>> >> >> > >> both? >>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> 2. Is it possible to not use >>> a local user jar to >>> >>>>> >> >> > >> > > start >>> >>>>> >> >> > >> > > > a >>> >>>>> >> >> > >> > > > > > > >> >>> per-job >>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> cluster? >>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> In your case, the user jars >>> has >>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> existed on hdfs already and >>> we do need to >>> >>>>> >> >> > >> download >>> >>>>> >> >> > >> > > the >>> >>>>> >> >> > >> > > > > jars >>> >>>>> >> >> > >> > > > > > > to >>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> deployer >>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> service. Currently, we >>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> always need a local user jar >>> to start a flink >>> >>>>> >> >> > >> > > cluster. >>> >>>>> >> >> > >> > > > It >>> >>>>> >> >> > >> > > > > > is >>> >>>>> >> >> > >> > > > > > > >> >>> be >>> >>>>> >> >> > >> > > > > > > >> >>>>>> great >>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> if >>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>> we >>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> could support remote user >>> jars. >>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>>> In the implementation, we >>> assume users package >>> >>>>> >> >> > >> > > > > > > >> >>> flink-clients, >>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> flink-optimizer, flink-table >>> together within >>> >>>>> >> >> > >> the job >>> >>>>> >> >> > >> > > > jar. >>> >>>>> >> >> > >> > > > > > > >> >>>>> Otherwise, >>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> the >>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> job graph generation within >>> >>>>> >> >> > >> JobClusterEntryPoint will >>> >>>>> >> >> > >> > > > > fail. >>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> 3. What do you mean about the >>> package? Do users >>> >>>>> >> >> > >> need >>> >>>>> >> >> > >> > > to >>> >>>>> >> >> > >> > > > > > > >> >>> compile >>> >>>>> >> >> > >> > > > > > > >> >>>>>> their >>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>> jars >>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> inlcuding flink-clients, >>> flink-optimizer, >>> >>>>> >> >> > >> flink-table >>> >>>>> >> >> > >> > > > > > codes? >>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> >>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> >>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> >>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> Best, >>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> Yang >>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> >>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> Peter Huang < >>> [hidden email]> >>> >>>>> >> >> > >> > > > 于2019年12月10日周二 >>> >>>>> >> >> > >> > > > > > > >> >>>>> 上午2:37写道: >>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> >>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> Dear All, >>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> >>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> Recently, the Flink >>> community starts to >>> >>>>> >> >> > >> improve the >>> >>>>> >> >> > >> > > > yarn >>> >>>>> >> >> > >> > > > > > > >> >>>>> cluster >>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> descriptor >>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> to make job jar and config >>> files configurable >>> >>>>> >> >> > >> from >>> >>>>> >> >> > >> > > > CLI. >>> >>>>> >> >> > >> > > > > It >>> >>>>> >> >> > >> > > > > > > >> >>>>>> improves >>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> the >>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> flexibility of Flink >>> deployment Yarn Per Job >>> >>>>> >> >> > >> Mode. >>> >>>>> >> >> > >> > > > For >>> >>>>> >> >> > >> > > > > > > >> >>>>> platform >>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> users >>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> who >>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> manage tens of hundreds of >>> streaming pipelines >>> >>>>> >> >> > >> for >>> >>>>> >> >> > >> > > the >>> >>>>> >> >> > >> > > > > > whole >>> >>>>> >> >> > >> > > > > > > >> >>>>> org >>> >>>>> >> >> > >> > > > > > > >> >>>>>> or >>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> company, we found the job >>> graph generation in >>> >>>>> >> >> > >> > > > > client-side >>> >>>>> >> >> > >> > > > > > is >>> >>>>> >> >> > >> > > > > > > >> >>>>>> another >>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> pinpoint. Thus, we want to >>> propose a >>> >>>>> >> >> > >> configurable >>> >>>>> >> >> > >> > > > > feature >>> >>>>> >> >> > >> > > > > > > >> >>> for >>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> FlinkYarnSessionCli. The >>> feature can allow >>> >>>>> >> >> > >> users to >>> >>>>> >> >> > >> > > > > choose >>> >>>>> >> >> > >> > > > > > > >> >>> the >>> >>>>> >> >> > >> > > > > > > >> >>>>> job >>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>> graph >>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> generation in Flink >>> ClusterEntryPoint so that >>> >>>>> >> >> > >> the >>> >>>>> >> >> > >> > > job >>> >>>>> >> >> > >> > > > > jar >>> >>>>> >> >> > >> > > > > > > >> >>>>> doesn't >>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> need >>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>> to >>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> be locally for the job graph >>> generation. The >>> >>>>> >> >> > >> > > proposal >>> >>>>> >> >> > >> > > > is >>> >>>>> >> >> > >> > > > > > > >> >>>>> organized >>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> as a >>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> FLIP >>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> >>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> >>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> >>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>> >>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> >>> >>>>> >> >> > >> > > > > > > >> >>>>>> >>> >>>>> >> >> > >> > > > > > > >> >>>>> >>> >>>>> >> >> > >> > > > > > > >> >>> >>> >>>>> >> >> > >> > > > > > > >> >>> >>>>> >> >> > >> > > > > > > >>> >>>>> >> >> > >> > > > > > >>> >>>>> >> >> > >> > > > > >>> >>>>> >> >> > >> > > > >>> >>>>> >> >> > >> > > >>> >>>>> >> >> > >> >>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-85+Delayed+JobGraph+Generation >>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> . >>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> >>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> Any questions and >>> suggestions are welcomed. >>> >>>>> >> >> > >> Thank >>> >>>>> >> >> > >> > > you >>> >>>>> >> >> > >> > > > in >>> >>>>> >> >> > >> > > > > > > >> >>>>> advance. >>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> >>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> >>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> Best Regards >>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> Peter Huang >>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> >>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> >>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>> >>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> >>> >>>>> >> >> > >> > > > > > > >> >>>>>>> >>> >>>>> >> >> > >> > > > > > > >> >>>>>> >>> >>>>> >> >> > >> > > > > > > >> >>>>> >>> >>>>> >> >> > >> > > > > > > >> >>>> >>> >>>>> >> >> > >> > > > > > > >> >>> >>> >>>>> >> >> > >> > > > > > > >> >> >>> >>>>> >> >> > >> > > > > > > >> >>> >>>>> >> >> > >> > > > > > > >> >>> >>>>> >> >> > >> > > > > > > >>> >>>>> >> >> > >> > > > > > >>> >>>>> >> >> > >> > > > > >>> >>>>> >> >> > >> > > > >>> >>>>> >> >> > >> > > >>> >>>>> >> >> > >> >>> >>>>> >> >> > > >>> >>>>> >> >> >>> >> |
Thanks for the reply, tison and Yang,
Regarding the public interface, is "-R/--remote" option the only change? Will the users also need to provide a remote location to upload and store the jars, and a list of jars as dependencies to be uploaded? It would be important that the public interface section in the FLIP includes all the user sensible changes including the CLI / configuration / metrics, etc. Can we update the FLIP to include the conclusion we have here in the ML? Thanks, Jiangjie (Becket) Qin On Mon, Mar 9, 2020 at 11:59 AM Yang Wang <[hidden email]> wrote: > Hi Becket, > > Thanks for jumping out and sharing your concerns. I second tison's answer > and just > make some additions. > > > > job submission interface > > This FLIP will introduce an interface for running user `main()` on > cluster, named as > “ProgramDeployer”. However, it is not a public interface. It will be used > in `CliFrontend` > when the remote deploy option(-R/--remote-deploy) is specified. So the > only changes > on user side is about the cli option. > > > > How to fetch the jars? > > The “local path” and “dfs path“ could be supported to fetch the user jars > and dependencies. > Just like tison has said, we could ship the user jar and dependencies from > client side to > HDFS and use the entrypoint to fetch. > > Also we have some other practical ways to use the new “application mode“. > 1. Upload the user jars and dependencies to the DFS(e.g. HDFS, S3, Aliyun > OSS) manually > or some external deployer system. For K8s, the user jars and dependencies > could also be > built in the docker image. > 2. Specify the remote/local user jar and dependencies in `flink run`. > Usually this could also > be done by the external deployer system. > 3. When the `ClusterEntrypoint` is launched, it will fetch the jars and > files automatically. We > do not need any specific fetcher implementation. Since we could leverage > flink `FileSystem` > to do this. > > > > > > Best, > Yang > > tison <[hidden email]> 于2020年3月9日周一 上午11:34写道: > >> Hi Becket, >> >> Thanks for your attention on FLIP-85! I answered your question inline. >> >> 1. What exactly the job submission interface will look like after this >> FLIP? The FLIP template has a Public Interface section but was removed from >> this FLIP. >> >> As Yang mentioned in this thread above: >> >> From user perspective, only a `-R/-- remote-deploy` cli option is >> visible. They are not aware of the application mode. >> >> 2. How will the new ClusterEntrypoint fetch the jars from external >> storage? What external storage will be supported out of the box? Will this >> "jar fetcher" be pluggable? If so, how does the API look like and how will >> users specify the custom "jar fetcher"? >> >> It depends actually. Here are several points: >> >> i. Currently, shipping user files is handled by Flink, dependencies >> fetching can be handled by Flink. >> ii. Current, we only support local file system shipfiles. When in >> Application Mode, to support meaningful jar fetch we should support user to >> configure richer shipfiles schema at first. >> iii. Dependencies fetching varies from deployments. That is, on YARN, its >> convention is through HDFS; on Kubernetes, its convention is configured >> resource server and fetched by initContainer. >> >> Thus, in the First phase of Application Mode dependencies fetching is >> totally handled within Flink. >> >> 3. It sounds that in this FLIP, the "session cluster" running the >> application has the same lifecycle as the user application. How will the >> session cluster be teared down after the application finishes? Will the >> ClusterEntrypoint do that? Will there be an option of not tearing the >> cluster down? >> >> The precondition we tear down the cluster is *both* >> >> i. user main reached to its end >> ii. all jobs submitted(current, at most one) reached global terminate >> state >> >> For the "how", it is an implementation topic, but conceptually it is >> ClusterEntrypoint's responsibility. >> >> >Will there be an option of not tearing the cluster down? >> >> I think the answer is "No" because the cluster is designed to be bounded >> with an Application. User logic that communicates with the job is always in >> its `main`, and for history information we have history server. >> >> Best, >> tison. >> >> >> Becket Qin <[hidden email]> 于2020年3月9日周一 上午8:12写道: >> >>> Hi Peter and Kostas, >>> >>> Thanks for creating this FLIP. Moving the JobGraph compilation to the >>> cluster makes a lot of sense to me. FLIP-40 had the exactly same idea, but >>> is currently dormant and can probably be superseded by this FLIP. After >>> reading the FLIP, I still have a few questions. >>> >>> 1. What exactly the job submission interface will look like after this >>> FLIP? The FLIP template has a Public Interface section but was removed from >>> this FLIP. >>> 2. How will the new ClusterEntrypoint fetch the jars from external >>> storage? What external storage will be supported out of the box? Will this >>> "jar fetcher" be pluggable? If so, how does the API look like and how will >>> users specify the custom "jar fetcher"? >>> 3. It sounds that in this FLIP, the "session cluster" running the >>> application has the same lifecycle as the user application. How will the >>> session cluster be teared down after the application finishes? Will the >>> ClusterEntrypoint do that? Will there be an option of not tearing the >>> cluster down? >>> >>> Maybe they have been discussed in the ML earlier, but I think they >>> should be part of the FLIP also. >>> >>> Thanks, >>> >>> Jiangjie (Becket) Qin >>> >>> On Thu, Mar 5, 2020 at 10:09 PM Kostas Kloudas <[hidden email]> >>> wrote: >>> >>>> Also from my side +1 to start voting. >>>> >>>> Cheers, >>>> Kostas >>>> >>>> On Thu, Mar 5, 2020 at 7:45 AM tison <[hidden email]> wrote: >>>> > >>>> > +1 to star voting. >>>> > >>>> > Best, >>>> > tison. >>>> > >>>> > >>>> > Yang Wang <[hidden email]> 于2020年3月5日周四 下午2:29写道: >>>> >> >>>> >> Hi Peter, >>>> >> Really thanks for your response. >>>> >> >>>> >> Hi all @Kostas Kloudas @Zili Chen @Peter Huang @Rong Rong >>>> >> It seems that we have reached an agreement. The “application mode” >>>> is regarded as the enhanced “per-job”. It is >>>> >> orthogonal with “cluster deploy”. Currently, we bind the “per-job” >>>> to `run-user-main-on-client` and “application mode” >>>> >> to `run-user-main-on-cluster`. >>>> >> >>>> >> Do you have other concerns to moving FLIP-85 to voting? >>>> >> >>>> >> >>>> >> Best, >>>> >> Yang >>>> >> >>>> >> Peter Huang <[hidden email]> 于2020年3月5日周四 下午12:48写道: >>>> >>> >>>> >>> Hi Yang and Kostas, >>>> >>> >>>> >>> Thanks for the clarification. It makes more sense to me if the long >>>> term goal is to replace per job mode to application mode >>>> >>> in the future (at the time that multiple execute can be >>>> supported). Before that, It will be better to keep the concept of >>>> >>> application mode internally. As Yang suggested, User only need to >>>> use a `-R/-- remote-deploy` cli option to launch >>>> >>> a per job cluster with the main function executed in cluster >>>> entry-point. +1 for the execution plan. >>>> >>> >>>> >>> >>>> >>> >>>> >>> Best Regards >>>> >>> Peter Huang >>>> >>> >>>> >>> >>>> >>> >>>> >>> >>>> >>> On Tue, Mar 3, 2020 at 7:11 AM Yang Wang <[hidden email]> >>>> wrote: >>>> >>>> >>>> >>>> Hi Peter, >>>> >>>> >>>> >>>> Having the application mode does not mean we will drop the >>>> cluster-deploy >>>> >>>> option. I just want to share some thoughts about “Application >>>> Mode”. >>>> >>>> >>>> >>>> >>>> >>>> 1. The application mode could cover the per-job sematic. Its >>>> lifecyle is bound >>>> >>>> to the user `main()`. And all the jobs in the user main will be >>>> executed in a same >>>> >>>> Flink cluster. In first phase of FLIP-85 implementation, running >>>> user main on the >>>> >>>> cluster side could be supported in application mode. >>>> >>>> >>>> >>>> 2. Maybe in the future, we also need to support multiple >>>> `execute()` on client side >>>> >>>> in a same Flink cluster. Then the per-job mode will evolve to >>>> application mode. >>>> >>>> >>>> >>>> 3. From user perspective, only a `-R/-- remote-deploy` cli option >>>> is visible. They >>>> >>>> are not aware of the application mode. >>>> >>>> >>>> >>>> 4. In the first phase, the application mode is working as >>>> “per-job”(only one job in >>>> >>>> the user main). We just leave more potential for the future. >>>> >>>> >>>> >>>> >>>> >>>> I am not against with calling it “cluster deploy mode” if you all >>>> think it is clearer for users. >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> Best, >>>> >>>> Yang >>>> >>>> >>>> >>>> Kostas Kloudas <[hidden email]> 于2020年3月3日周二 下午6:49写道: >>>> >>>>> >>>> >>>>> Hi Peter, >>>> >>>>> >>>> >>>>> I understand your point. This is why I was also a bit torn about >>>> the >>>> >>>>> name and my proposal was a bit aligned with yours (something >>>> along the >>>> >>>>> lines of "cluster deploy" mode). >>>> >>>>> >>>> >>>>> But many of the other participants in the discussion suggested the >>>> >>>>> "Application Mode". I think that the reasoning is that now the >>>> user's >>>> >>>>> Application is more self-contained. >>>> >>>>> It will be submitted to the cluster and the user can just >>>> disconnect. >>>> >>>>> In addition, as discussed briefly in the doc, in the future there >>>> may >>>> >>>>> be better support for multi-execute applications which will bring >>>> us >>>> >>>>> one step closer to the true "Application Mode". But this is how I >>>> >>>>> interpreted their arguments, of course they can also express their >>>> >>>>> thoughts on the topic :) >>>> >>>>> >>>> >>>>> Cheers, >>>> >>>>> Kostas >>>> >>>>> >>>> >>>>> On Mon, Mar 2, 2020 at 6:15 PM Peter Huang < >>>> [hidden email]> wrote: >>>> >>>>> > >>>> >>>>> > Hi Kostas, >>>> >>>>> > >>>> >>>>> > Thanks for updating the wiki. We have aligned with the >>>> implementations in the doc. But I feel it is still a little bit confusing >>>> of the naming from a user's perspective. It is well known that Flink >>>> support per job cluster and session cluster. The concept is in the layer of >>>> how a job is managed within Flink. The method introduced util now is a kind >>>> of mixing job and session cluster to promising the implementation >>>> complexity. We probably don't need to label it as Application Model as the >>>> same layer of per job cluster and session cluster. Conceptually, I think it >>>> is still a cluster mode implementation for per job cluster. >>>> >>>>> > >>>> >>>>> > To minimize the confusion of users, I think it would be better >>>> just an option of per job cluster for each type of cluster manager. How do >>>> you think? >>>> >>>>> > >>>> >>>>> > >>>> >>>>> > Best Regards >>>> >>>>> > Peter Huang >>>> >>>>> > >>>> >>>>> > >>>> >>>>> > >>>> >>>>> > >>>> >>>>> > >>>> >>>>> > >>>> >>>>> > >>>> >>>>> > >>>> >>>>> > On Mon, Mar 2, 2020 at 7:22 AM Kostas Kloudas < >>>> [hidden email]> wrote: >>>> >>>>> >> >>>> >>>>> >> Hi Yang, >>>> >>>>> >> >>>> >>>>> >> The difference between per-job and application mode is that, >>>> as you >>>> >>>>> >> described, in the per-job mode the main is executed on the >>>> client >>>> >>>>> >> while in the application mode, the main is executed on the >>>> cluster. >>>> >>>>> >> I do not think we have to offer "application mode" with >>>> running the >>>> >>>>> >> main on the client side as this is exactly what the per-job >>>> mode does >>>> >>>>> >> currently and, as you described also, it would be redundant. >>>> >>>>> >> >>>> >>>>> >> Sorry if this was not clear in the document. >>>> >>>>> >> >>>> >>>>> >> Cheers, >>>> >>>>> >> Kostas >>>> >>>>> >> >>>> >>>>> >> On Mon, Mar 2, 2020 at 3:17 PM Yang Wang < >>>> [hidden email]> wrote: >>>> >>>>> >> > >>>> >>>>> >> > Hi Kostas, >>>> >>>>> >> > >>>> >>>>> >> > Thanks a lot for your conclusion and updating the FLIP-85 >>>> WIKI. Currently, i have no more >>>> >>>>> >> > questions about motivation, approach, fault tolerance and >>>> the first phase implementation. >>>> >>>>> >> > >>>> >>>>> >> > I think the new title "Flink Application Mode" makes a lot >>>> senses to me. Especially for the >>>> >>>>> >> > containerized environment, the cluster deploy option will be >>>> very useful. >>>> >>>>> >> > >>>> >>>>> >> > Just one concern, how do we introduce this new application >>>> mode to our users? >>>> >>>>> >> > Each user program(i.e. `main()`) is an application. >>>> Currently, we intend to only support one >>>> >>>>> >> > `execute()`. So what's the difference between per-job and >>>> application mode? >>>> >>>>> >> > >>>> >>>>> >> > For per-job, user `main()` is always executed on client >>>> side. And For application mode, user >>>> >>>>> >> > `main()` could be executed on client or master >>>> side(configured via cli option). >>>> >>>>> >> > Right? We need to have a clear concept. Otherwise, the users >>>> will be more and more confusing. >>>> >>>>> >> > >>>> >>>>> >> > >>>> >>>>> >> > Best, >>>> >>>>> >> > Yang >>>> >>>>> >> > >>>> >>>>> >> > Kostas Kloudas <[hidden email]> 于2020年3月2日周一 下午5:58写道: >>>> >>>>> >> >> >>>> >>>>> >> >> Hi all, >>>> >>>>> >> >> >>>> >>>>> >> >> I update >>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-85+Flink+Application+Mode >>>> >>>>> >> >> based on the discussion we had here: >>>> >>>>> >> >> >>>> >>>>> >> >> >>>> https://docs.google.com/document/d/1ji72s3FD9DYUyGuKnJoO4ApzV-nSsZa0-bceGXW7Ocw/edit# >>>> >>>>> >> >> >>>> >>>>> >> >> Please let me know what you think and please keep the >>>> discussion in the ML :) >>>> >>>>> >> >> >>>> >>>>> >> >> Thanks for starting the discussion and I hope that soon we >>>> will be >>>> >>>>> >> >> able to vote on the FLIP. >>>> >>>>> >> >> >>>> >>>>> >> >> Cheers, >>>> >>>>> >> >> Kostas >>>> >>>>> >> >> >>>> >>>>> >> >> On Thu, Jan 16, 2020 at 3:40 AM Yang Wang < >>>> [hidden email]> wrote: >>>> >>>>> >> >> > >>>> >>>>> >> >> > Hi all, >>>> >>>>> >> >> > >>>> >>>>> >> >> > Thanks a lot for the feedback from @Kostas Kloudas. Your >>>> all concerns are >>>> >>>>> >> >> > on point. The FLIP-85 is mainly >>>> >>>>> >> >> > focused on supporting cluster mode for per-job. Since it >>>> is more urgent and >>>> >>>>> >> >> > have much more use >>>> >>>>> >> >> > cases both in Yarn and Kubernetes deployment. For session >>>> cluster, we could >>>> >>>>> >> >> > have more discussion >>>> >>>>> >> >> > in a new thread later. >>>> >>>>> >> >> > >>>> >>>>> >> >> > #1, How to download the user jars and dependencies for >>>> per-job in cluster >>>> >>>>> >> >> > mode? >>>> >>>>> >> >> > For Yarn, we could register the user jars and >>>> dependencies as >>>> >>>>> >> >> > LocalResource. They will be distributed >>>> >>>>> >> >> > by Yarn. And once the JobManager and TaskManager >>>> launched, the jars are >>>> >>>>> >> >> > already exists. >>>> >>>>> >> >> > For Standalone per-job and K8s, we expect that the user >>>> jars >>>> >>>>> >> >> > and dependencies are built into the image. >>>> >>>>> >> >> > Or the InitContainer could be used for downloading. It is >>>> natively >>>> >>>>> >> >> > distributed and we will not have bottleneck. >>>> >>>>> >> >> > >>>> >>>>> >> >> > #2, Job graph recovery >>>> >>>>> >> >> > We could have an optimization to store job graph on the >>>> DFS. However, i >>>> >>>>> >> >> > suggest building a new jobgraph >>>> >>>>> >> >> > from the configuration is the default option. Since we >>>> will not always have >>>> >>>>> >> >> > a DFS store when deploying a >>>> >>>>> >> >> > Flink per-job cluster. Of course, we assume that using >>>> the same >>>> >>>>> >> >> > configuration(e.g. job_id, user_jar, main_class, >>>> >>>>> >> >> > main_args, parallelism, savepoint_settings, etc.) will >>>> get a same job >>>> >>>>> >> >> > graph. I think the standalone per-job >>>> >>>>> >> >> > already has the similar behavior. >>>> >>>>> >> >> > >>>> >>>>> >> >> > #3, What happens with jobs that have multiple execute >>>> calls? >>>> >>>>> >> >> > Currently, it is really a problem. Even we use a local >>>> client on Flink >>>> >>>>> >> >> > master side, it will have different behavior with >>>> >>>>> >> >> > client mode. For client mode, if we execute multiple >>>> times, then we will >>>> >>>>> >> >> > deploy multiple Flink clusters for each execute. >>>> >>>>> >> >> > I am not pretty sure whether it is reasonable. However, i >>>> still think using >>>> >>>>> >> >> > the local client is a good choice. We could >>>> >>>>> >> >> > continue the discussion in a new thread. @Zili Chen < >>>> [hidden email]> Do >>>> >>>>> >> >> > you want to drive this? >>>> >>>>> >> >> > >>>> >>>>> >> >> > >>>> >>>>> >> >> > >>>> >>>>> >> >> > Best, >>>> >>>>> >> >> > Yang >>>> >>>>> >> >> > >>>> >>>>> >> >> > Peter Huang <[hidden email]> 于2020年1月16日周四 >>>> 上午1:55写道: >>>> >>>>> >> >> > >>>> >>>>> >> >> > > Hi Kostas, >>>> >>>>> >> >> > > >>>> >>>>> >> >> > > Thanks for this feedback. I can't agree more about the >>>> opinion. The >>>> >>>>> >> >> > > cluster mode should be added >>>> >>>>> >> >> > > first in per job cluster. >>>> >>>>> >> >> > > >>>> >>>>> >> >> > > 1) For job cluster implementation >>>> >>>>> >> >> > > 1. Job graph recovery from configuration or store as >>>> static job graph as >>>> >>>>> >> >> > > session cluster. I think the static one will be better >>>> for less recovery >>>> >>>>> >> >> > > time. >>>> >>>>> >> >> > > Let me update the doc for details. >>>> >>>>> >> >> > > >>>> >>>>> >> >> > > 2. For job execute multiple times, I think @Zili Chen >>>> >>>>> >> >> > > <[hidden email]> has proposed the local client >>>> solution that can >>>> >>>>> >> >> > > the run program actually in the cluster entry point. We >>>> can put the >>>> >>>>> >> >> > > implementation in the second stage, >>>> >>>>> >> >> > > or even a new FLIP for further discussion. >>>> >>>>> >> >> > > >>>> >>>>> >> >> > > 2) For session cluster implementation >>>> >>>>> >> >> > > We can disable the cluster mode for the session cluster >>>> in the first >>>> >>>>> >> >> > > stage. I agree the jar downloading will be a painful >>>> thing. >>>> >>>>> >> >> > > We can consider about PoC and performance evaluation >>>> first. If the end to >>>> >>>>> >> >> > > end experience is good enough, then we can consider >>>> >>>>> >> >> > > proceeding with the solution. >>>> >>>>> >> >> > > >>>> >>>>> >> >> > > Looking forward to more opinions from @Yang Wang < >>>> [hidden email]> @Zili >>>> >>>>> >> >> > > Chen <[hidden email]> @Dian Fu < >>>> [hidden email]>. >>>> >>>>> >> >> > > >>>> >>>>> >> >> > > >>>> >>>>> >> >> > > Best Regards >>>> >>>>> >> >> > > Peter Huang >>>> >>>>> >> >> > > >>>> >>>>> >> >> > > On Wed, Jan 15, 2020 at 7:50 AM Kostas Kloudas < >>>> [hidden email]> wrote: >>>> >>>>> >> >> > > >>>> >>>>> >> >> > >> Hi all, >>>> >>>>> >> >> > >> >>>> >>>>> >> >> > >> I am writing here as the discussion on the Google Doc >>>> seems to be a >>>> >>>>> >> >> > >> bit difficult to follow. >>>> >>>>> >> >> > >> >>>> >>>>> >> >> > >> I think that in order to be able to make progress, it >>>> would be helpful >>>> >>>>> >> >> > >> to focus on per-job mode for now. >>>> >>>>> >> >> > >> The reason is that: >>>> >>>>> >> >> > >> 1) making the (unique) JobSubmitHandler responsible >>>> for creating the >>>> >>>>> >> >> > >> jobgraphs, >>>> >>>>> >> >> > >> which includes downloading dependencies, is not an >>>> optimal solution >>>> >>>>> >> >> > >> 2) even if we put the responsibility on the >>>> JobMaster, currently each >>>> >>>>> >> >> > >> job has its own >>>> >>>>> >> >> > >> JobMaster but they all run on the same process, so >>>> we have again a >>>> >>>>> >> >> > >> single entity. >>>> >>>>> >> >> > >> >>>> >>>>> >> >> > >> Of course after this is done, and if we feel >>>> comfortable with the >>>> >>>>> >> >> > >> solution, then we can go to the session mode. >>>> >>>>> >> >> > >> >>>> >>>>> >> >> > >> A second comment has to do with fault-tolerance in the >>>> per-job, >>>> >>>>> >> >> > >> cluster-deploy mode. >>>> >>>>> >> >> > >> In the document, it is suggested that upon recovery, >>>> the JobMaster of >>>> >>>>> >> >> > >> each job re-creates the JobGraph. >>>> >>>>> >> >> > >> I am just wondering if it is better to create and >>>> store the jobGraph >>>> >>>>> >> >> > >> upon submission and only fetch it >>>> >>>>> >> >> > >> upon recovery so that we have a static jobGraph. >>>> >>>>> >> >> > >> >>>> >>>>> >> >> > >> Finally, I have a question which is what happens with >>>> jobs that have >>>> >>>>> >> >> > >> multiple execute calls? >>>> >>>>> >> >> > >> The semantics seem to change compared to the current >>>> behaviour, right? >>>> >>>>> >> >> > >> >>>> >>>>> >> >> > >> Cheers, >>>> >>>>> >> >> > >> Kostas >>>> >>>>> >> >> > >> >>>> >>>>> >> >> > >> On Wed, Jan 8, 2020 at 8:05 PM tison < >>>> [hidden email]> wrote: >>>> >>>>> >> >> > >> > >>>> >>>>> >> >> > >> > not always, Yang Wang is also not yet a committer >>>> but he can join the >>>> >>>>> >> >> > >> > channel. I cannot find the id by clicking “Add new >>>> member in channel” so >>>> >>>>> >> >> > >> > come to you and ask for try out the link. Possibly I >>>> will find other >>>> >>>>> >> >> > >> ways >>>> >>>>> >> >> > >> > but the original purpose is that the slack channel >>>> is a public area we >>>> >>>>> >> >> > >> > discuss about developing... >>>> >>>>> >> >> > >> > Best, >>>> >>>>> >> >> > >> > tison. >>>> >>>>> >> >> > >> > >>>> >>>>> >> >> > >> > >>>> >>>>> >> >> > >> > Peter Huang <[hidden email]> >>>> 于2020年1月9日周四 上午2:44写道: >>>> >>>>> >> >> > >> > >>>> >>>>> >> >> > >> > > Hi Tison, >>>> >>>>> >> >> > >> > > >>>> >>>>> >> >> > >> > > I am not the committer of Flink yet. I think I >>>> can't join it also. >>>> >>>>> >> >> > >> > > >>>> >>>>> >> >> > >> > > >>>> >>>>> >> >> > >> > > Best Regards >>>> >>>>> >> >> > >> > > Peter Huang >>>> >>>>> >> >> > >> > > >>>> >>>>> >> >> > >> > > On Wed, Jan 8, 2020 at 9:39 AM tison < >>>> [hidden email]> wrote: >>>> >>>>> >> >> > >> > > >>>> >>>>> >> >> > >> > > > Hi Peter, >>>> >>>>> >> >> > >> > > > >>>> >>>>> >> >> > >> > > > Could you try out this link? >>>> >>>>> >> >> > >> > > https://the-asf.slack.com/messages/CNA3ADZPH >>>> >>>>> >> >> > >> > > > >>>> >>>>> >> >> > >> > > > Best, >>>> >>>>> >> >> > >> > > > tison. >>>> >>>>> >> >> > >> > > > >>>> >>>>> >> >> > >> > > > >>>> >>>>> >> >> > >> > > > Peter Huang <[hidden email]> >>>> 于2020年1月9日周四 上午1:22写道: >>>> >>>>> >> >> > >> > > > >>>> >>>>> >> >> > >> > > > > Hi Tison, >>>> >>>>> >> >> > >> > > > > >>>> >>>>> >> >> > >> > > > > I can't join the group with shared link. Would >>>> you please add me >>>> >>>>> >> >> > >> into >>>> >>>>> >> >> > >> > > the >>>> >>>>> >> >> > >> > > > > group? My slack account is huangzhenqiu0825. >>>> >>>>> >> >> > >> > > > > Thank you in advance. >>>> >>>>> >> >> > >> > > > > >>>> >>>>> >> >> > >> > > > > >>>> >>>>> >> >> > >> > > > > Best Regards >>>> >>>>> >> >> > >> > > > > Peter Huang >>>> >>>>> >> >> > >> > > > > >>>> >>>>> >> >> > >> > > > > On Wed, Jan 8, 2020 at 12:02 AM tison < >>>> [hidden email]> >>>> >>>>> >> >> > >> wrote: >>>> >>>>> >> >> > >> > > > > >>>> >>>>> >> >> > >> > > > > > Hi Peter, >>>> >>>>> >> >> > >> > > > > > >>>> >>>>> >> >> > >> > > > > > As described above, this effort should get >>>> attention from people >>>> >>>>> >> >> > >> > > > > developing >>>> >>>>> >> >> > >> > > > > > FLIP-73 a.k.a. Executor abstractions. I >>>> recommend you to join >>>> >>>>> >> >> > >> the >>>> >>>>> >> >> > >> > > > public >>>> >>>>> >> >> > >> > > > > > slack channel[1] for Flink Client API >>>> Enhancement and you can >>>> >>>>> >> >> > >> try to >>>> >>>>> >> >> > >> > > > > share >>>> >>>>> >> >> > >> > > > > > you detailed thoughts there. It possibly >>>> gets more concrete >>>> >>>>> >> >> > >> > > attentions. >>>> >>>>> >> >> > >> > > > > > >>>> >>>>> >> >> > >> > > > > > Best, >>>> >>>>> >> >> > >> > > > > > tison. >>>> >>>>> >> >> > >> > > > > > >>>> >>>>> >> >> > >> > > > > > [1] >>>> >>>>> >> >> > >> > > > > > >>>> >>>>> >> >> > >> > > > > > >>>> >>>>> >> >> > >> > > > > >>>> >>>>> >> >> > >> > > > >>>> >>>>> >> >> > >> > > >>>> >>>>> >> >> > >> >>>> https://slack.com/share/IS21SJ75H/Rk8HhUly9FuEHb7oGwBZ33uL/enQtODg2MDYwNjE5MTg3LTA2MjIzNDc1M2ZjZDVlMjdlZjk1M2RkYmJhNjAwMTk2ZDZkODQ4NmY5YmI4OGRhNWJkYTViMTM1NzlmMzc4OWM >>>> >>>>> >> >> > >> > > > > > >>>> >>>>> >> >> > >> > > > > > >>>> >>>>> >> >> > >> > > > > > Peter Huang <[hidden email]> >>>> 于2020年1月7日周二 上午5:09写道: >>>> >>>>> >> >> > >> > > > > > >>>> >>>>> >> >> > >> > > > > > > Dear All, >>>> >>>>> >> >> > >> > > > > > > >>>> >>>>> >> >> > >> > > > > > > Happy new year! According to existing >>>> feedback from the >>>> >>>>> >> >> > >> community, >>>> >>>>> >> >> > >> > > we >>>> >>>>> >> >> > >> > > > > > > revised the doc with the consideration of >>>> session cluster >>>> >>>>> >> >> > >> support, >>>> >>>>> >> >> > >> > > > and >>>> >>>>> >> >> > >> > > > > > > concrete interface changes needed and >>>> execution plan. Please >>>> >>>>> >> >> > >> take >>>> >>>>> >> >> > >> > > one >>>> >>>>> >> >> > >> > > > > > more >>>> >>>>> >> >> > >> > > > > > > round of review at your most convenient >>>> time. >>>> >>>>> >> >> > >> > > > > > > >>>> >>>>> >> >> > >> > > > > > > >>>> >>>>> >> >> > >> > > > > > > >>>> >>>>> >> >> > >> > > > > > >>>> >>>>> >> >> > >> > > > > >>>> >>>>> >> >> > >> > > > >>>> >>>>> >> >> > >> > > >>>> >>>>> >> >> > >> >>>> https://docs.google.com/document/d/1aAwVjdZByA-0CHbgv16Me-vjaaDMCfhX7TzVVTuifYM/edit# >>>> >>>>> >> >> > >> > > > > > > >>>> >>>>> >> >> > >> > > > > > > >>>> >>>>> >> >> > >> > > > > > > Best Regards >>>> >>>>> >> >> > >> > > > > > > Peter Huang >>>> >>>>> >> >> > >> > > > > > > >>>> >>>>> >> >> > >> > > > > > > >>>> >>>>> >> >> > >> > > > > > > >>>> >>>>> >> >> > >> > > > > > > >>>> >>>>> >> >> > >> > > > > > > >>>> >>>>> >> >> > >> > > > > > > On Thu, Jan 2, 2020 at 11:29 AM Peter >>>> Huang < >>>> >>>>> >> >> > >> > > > > [hidden email]> >>>> >>>>> >> >> > >> > > > > > > wrote: >>>> >>>>> >> >> > >> > > > > > > >>>> >>>>> >> >> > >> > > > > > > > Hi Dian, >>>> >>>>> >> >> > >> > > > > > > > Thanks for giving us valuable feedbacks. >>>> >>>>> >> >> > >> > > > > > > > >>>> >>>>> >> >> > >> > > > > > > > 1) It's better to have a whole design >>>> for this feature >>>> >>>>> >> >> > >> > > > > > > > For the suggestion of enabling the >>>> cluster mode also session >>>> >>>>> >> >> > >> > > > > cluster, I >>>> >>>>> >> >> > >> > > > > > > > think Flink already supported it. >>>> WebSubmissionExtension >>>> >>>>> >> >> > >> already >>>> >>>>> >> >> > >> > > > > allows >>>> >>>>> >> >> > >> > > > > > > > users to start a job with the specified >>>> jar by using web UI. >>>> >>>>> >> >> > >> > > > > > > > But we need to enable the feature from >>>> CLI for both local >>>> >>>>> >> >> > >> jar, >>>> >>>>> >> >> > >> > > > remote >>>> >>>>> >> >> > >> > > > > > > jar. >>>> >>>>> >> >> > >> > > > > > > > I will align with Yang Wang first about >>>> the details and >>>> >>>>> >> >> > >> update >>>> >>>>> >> >> > >> > > the >>>> >>>>> >> >> > >> > > > > > design >>>> >>>>> >> >> > >> > > > > > > > doc. >>>> >>>>> >> >> > >> > > > > > > > >>>> >>>>> >> >> > >> > > > > > > > 2) It's better to consider the >>>> convenience for users, such >>>> >>>>> >> >> > >> as >>>> >>>>> >> >> > >> > > > > debugging >>>> >>>>> >> >> > >> > > > > > > > >>>> >>>>> >> >> > >> > > > > > > > I am wondering whether we can store the >>>> exception in >>>> >>>>> >> >> > >> jobgragh >>>> >>>>> >> >> > >> > > > > > > > generation in application master. As no >>>> streaming graph can >>>> >>>>> >> >> > >> be >>>> >>>>> >> >> > >> > > > > > scheduled >>>> >>>>> >> >> > >> > > > > > > in >>>> >>>>> >> >> > >> > > > > > > > this case, there will be no more TM will >>>> be requested from >>>> >>>>> >> >> > >> > > FlinkRM. >>>> >>>>> >> >> > >> > > > > > > > If the AM is still running, users can >>>> still query it from >>>> >>>>> >> >> > >> CLI. As >>>> >>>>> >> >> > >> > > > it >>>> >>>>> >> >> > >> > > > > > > > requires more change, we can get some >>>> feedback from < >>>> >>>>> >> >> > >> > > > > > [hidden email] >>>> >>>>> >> >> > >> > > > > > > > >>>> >>>>> >> >> > >> > > > > > > > and @[hidden email] <[hidden email] >>>> >. >>>> >>>>> >> >> > >> > > > > > > > >>>> >>>>> >> >> > >> > > > > > > > 3) It's better to consider the impact to >>>> the stability of >>>> >>>>> >> >> > >> the >>>> >>>>> >> >> > >> > > > cluster >>>> >>>>> >> >> > >> > > > > > > > >>>> >>>>> >> >> > >> > > > > > > > I agree with Yang Wang's opinion. >>>> >>>>> >> >> > >> > > > > > > > >>>> >>>>> >> >> > >> > > > > > > > >>>> >>>>> >> >> > >> > > > > > > > >>>> >>>>> >> >> > >> > > > > > > > Best Regards >>>> >>>>> >> >> > >> > > > > > > > Peter Huang >>>> >>>>> >> >> > >> > > > > > > > >>>> >>>>> >> >> > >> > > > > > > > >>>> >>>>> >> >> > >> > > > > > > > On Sun, Dec 29, 2019 at 9:44 PM Dian Fu < >>>> >>>>> >> >> > >> [hidden email]> >>>> >>>>> >> >> > >> > > > > wrote: >>>> >>>>> >> >> > >> > > > > > > > >>>> >>>>> >> >> > >> > > > > > > >> Hi all, >>>> >>>>> >> >> > >> > > > > > > >> >>>> >>>>> >> >> > >> > > > > > > >> Sorry to jump into this discussion. >>>> Thanks everyone for the >>>> >>>>> >> >> > >> > > > > > discussion. >>>> >>>>> >> >> > >> > > > > > > >> I'm very interested in this topic >>>> although I'm not an >>>> >>>>> >> >> > >> expert in >>>> >>>>> >> >> > >> > > > this >>>> >>>>> >> >> > >> > > > > > > part. >>>> >>>>> >> >> > >> > > > > > > >> So I'm glad to share my thoughts as >>>> following: >>>> >>>>> >> >> > >> > > > > > > >> >>>> >>>>> >> >> > >> > > > > > > >> 1) It's better to have a whole design >>>> for this feature >>>> >>>>> >> >> > >> > > > > > > >> As we know, there are two deployment >>>> modes: per-job mode >>>> >>>>> >> >> > >> and >>>> >>>>> >> >> > >> > > > session >>>> >>>>> >> >> > >> > > > > > > >> mode. I'm wondering which mode really >>>> needs this feature. >>>> >>>>> >> >> > >> As the >>>> >>>>> >> >> > >> > > > > > design >>>> >>>>> >> >> > >> > > > > > > doc >>>> >>>>> >> >> > >> > > > > > > >> mentioned, per-job mode is more used >>>> for streaming jobs and >>>> >>>>> >> >> > >> > > > session >>>> >>>>> >> >> > >> > > > > > > mode is >>>> >>>>> >> >> > >> > > > > > > >> usually used for batch jobs(Of course, >>>> the job types and >>>> >>>>> >> >> > >> the >>>> >>>>> >> >> > >> > > > > > deployment >>>> >>>>> >> >> > >> > > > > > > >> modes are orthogonal). Usually >>>> streaming job is only >>>> >>>>> >> >> > >> needed to >>>> >>>>> >> >> > >> > > be >>>> >>>>> >> >> > >> > > > > > > submitted >>>> >>>>> >> >> > >> > > > > > > >> once and it will run for days or weeks, >>>> while batch jobs >>>> >>>>> >> >> > >> will be >>>> >>>>> >> >> > >> > > > > > > submitted >>>> >>>>> >> >> > >> > > > > > > >> more frequently compared with streaming >>>> jobs. This means >>>> >>>>> >> >> > >> that >>>> >>>>> >> >> > >> > > > maybe >>>> >>>>> >> >> > >> > > > > > > session >>>> >>>>> >> >> > >> > > > > > > >> mode also needs this feature. However, >>>> if we support this >>>> >>>>> >> >> > >> > > feature >>>> >>>>> >> >> > >> > > > in >>>> >>>>> >> >> > >> > > > > > > >> session mode, the application master >>>> will become the new >>>> >>>>> >> >> > >> > > > centralized >>>> >>>>> >> >> > >> > > > > > > >> service(which should be solved). So in >>>> this case, it's >>>> >>>>> >> >> > >> better to >>>> >>>>> >> >> > >> > > > > have >>>> >>>>> >> >> > >> > > > > > a >>>> >>>>> >> >> > >> > > > > > > >> complete design for both per-job mode >>>> and session mode. >>>> >>>>> >> >> > >> > > > Furthermore, >>>> >>>>> >> >> > >> > > > > > > even >>>> >>>>> >> >> > >> > > > > > > >> if we can do it phase by phase, we need >>>> to have a whole >>>> >>>>> >> >> > >> picture >>>> >>>>> >> >> > >> > > of >>>> >>>>> >> >> > >> > > > > how >>>> >>>>> >> >> > >> > > > > > > it >>>> >>>>> >> >> > >> > > > > > > >> works in both per-job mode and session >>>> mode. >>>> >>>>> >> >> > >> > > > > > > >> >>>> >>>>> >> >> > >> > > > > > > >> 2) It's better to consider the >>>> convenience for users, such >>>> >>>>> >> >> > >> as >>>> >>>>> >> >> > >> > > > > > debugging >>>> >>>>> >> >> > >> > > > > > > >> After we finish this feature, the job >>>> graph will be >>>> >>>>> >> >> > >> compiled in >>>> >>>>> >> >> > >> > > > the >>>> >>>>> >> >> > >> > > > > > > >> application master, which means that >>>> users cannot easily >>>> >>>>> >> >> > >> get the >>>> >>>>> >> >> > >> > > > > > > exception >>>> >>>>> >> >> > >> > > > > > > >> message synchorousely in the job client >>>> if there are >>>> >>>>> >> >> > >> problems >>>> >>>>> >> >> > >> > > > during >>>> >>>>> >> >> > >> > > > > > the >>>> >>>>> >> >> > >> > > > > > > >> job graph compiling (especially for >>>> platform users), such >>>> >>>>> >> >> > >> as the >>>> >>>>> >> >> > >> > > > > > > resource >>>> >>>>> >> >> > >> > > > > > > >> path is incorrect, the user program >>>> itself has some >>>> >>>>> >> >> > >> problems, >>>> >>>>> >> >> > >> > > etc. >>>> >>>>> >> >> > >> > > > > > What >>>> >>>>> >> >> > >> > > > > > > I'm >>>> >>>>> >> >> > >> > > > > > > >> thinking is that maybe we should throw >>>> the exceptions as >>>> >>>>> >> >> > >> early >>>> >>>>> >> >> > >> > > as >>>> >>>>> >> >> > >> > > > > > > possible >>>> >>>>> >> >> > >> > > > > > > >> (during job submission stage). >>>> >>>>> >> >> > >> > > > > > > >> >>>> >>>>> >> >> > >> > > > > > > >> 3) It's better to consider the impact >>>> to the stability of >>>> >>>>> >> >> > >> the >>>> >>>>> >> >> > >> > > > > cluster >>>> >>>>> >> >> > >> > > > > > > >> If we perform the compiling in the >>>> application master, we >>>> >>>>> >> >> > >> should >>>> >>>>> >> >> > >> > > > > > > consider >>>> >>>>> >> >> > >> > > > > > > >> the impact of the compiling errors. >>>> Although YARN could >>>> >>>>> >> >> > >> resume >>>> >>>>> >> >> > >> > > the >>>> >>>>> >> >> > >> > > > > > > >> application master in case of failures, >>>> but in some case >>>> >>>>> >> >> > >> the >>>> >>>>> >> >> > >> > > > > compiling >>>> >>>>> >> >> > >> > > > > > > >> failure may be a waste of cluster >>>> resource and may impact >>>> >>>>> >> >> > >> the >>>> >>>>> >> >> > >> > > > > > stability >>>> >>>>> >> >> > >> > > > > > > the >>>> >>>>> >> >> > >> > > > > > > >> cluster and the other jobs in the >>>> cluster, such as the >>>> >>>>> >> >> > >> resource >>>> >>>>> >> >> > >> > > > path >>>> >>>>> >> >> > >> > > > > > is >>>> >>>>> >> >> > >> > > > > > > >> incorrect, the user program itself has >>>> some problems(in >>>> >>>>> >> >> > >> this >>>> >>>>> >> >> > >> > > case, >>>> >>>>> >> >> > >> > > > > job >>>> >>>>> >> >> > >> > > > > > > >> failover cannot solve this kind of >>>> problems) etc. In the >>>> >>>>> >> >> > >> current >>>> >>>>> >> >> > >> > > > > > > >> implemention, the compiling errors are >>>> handled in the >>>> >>>>> >> >> > >> client >>>> >>>>> >> >> > >> > > side >>>> >>>>> >> >> > >> > > > > and >>>> >>>>> >> >> > >> > > > > > > there >>>> >>>>> >> >> > >> > > > > > > >> is no impact to the cluster at all. >>>> >>>>> >> >> > >> > > > > > > >> >>>> >>>>> >> >> > >> > > > > > > >> Regarding to 1), it's clearly pointed >>>> in the design doc >>>> >>>>> >> >> > >> that >>>> >>>>> >> >> > >> > > only >>>> >>>>> >> >> > >> > > > > > > per-job >>>> >>>>> >> >> > >> > > > > > > >> mode will be supported. However, I >>>> think it's better to >>>> >>>>> >> >> > >> also >>>> >>>>> >> >> > >> > > > > consider >>>> >>>>> >> >> > >> > > > > > > the >>>> >>>>> >> >> > >> > > > > > > >> session mode in the design doc. >>>> >>>>> >> >> > >> > > > > > > >> Regarding to 2) and 3), I have not seen >>>> related sections >>>> >>>>> >> >> > >> in the >>>> >>>>> >> >> > >> > > > > design >>>> >>>>> >> >> > >> > > > > > > >> doc. It will be good if we can cover >>>> them in the design >>>> >>>>> >> >> > >> doc. >>>> >>>>> >> >> > >> > > > > > > >> >>>> >>>>> >> >> > >> > > > > > > >> Feel free to correct me If there is >>>> anything I >>>> >>>>> >> >> > >> misunderstand. >>>> >>>>> >> >> > >> > > > > > > >> >>>> >>>>> >> >> > >> > > > > > > >> Regards, >>>> >>>>> >> >> > >> > > > > > > >> Dian >>>> >>>>> >> >> > >> > > > > > > >> >>>> >>>>> >> >> > >> > > > > > > >> >>>> >>>>> >> >> > >> > > > > > > >> > 在 2019年12月27日,上午3:13,Peter Huang < >>>> >>>>> >> >> > >> [hidden email]> >>>> >>>>> >> >> > >> > > > 写道: >>>> >>>>> >> >> > >> > > > > > > >> > >>>> >>>>> >> >> > >> > > > > > > >> > Hi Yang, >>>> >>>>> >> >> > >> > > > > > > >> > >>>> >>>>> >> >> > >> > > > > > > >> > I can't agree more. The effort >>>> definitely needs to align >>>> >>>>> >> >> > >> with >>>> >>>>> >> >> > >> > > > the >>>> >>>>> >> >> > >> > > > > > > final >>>> >>>>> >> >> > >> > > > > > > >> > goal of FLIP-73. >>>> >>>>> >> >> > >> > > > > > > >> > I am thinking about whether we can >>>> achieve the goal with >>>> >>>>> >> >> > >> two >>>> >>>>> >> >> > >> > > > > phases. >>>> >>>>> >> >> > >> > > > > > > >> > >>>> >>>>> >> >> > >> > > > > > > >> > 1) Phase I >>>> >>>>> >> >> > >> > > > > > > >> > As the CLiFrontend will not be >>>> depreciated soon. We can >>>> >>>>> >> >> > >> still >>>> >>>>> >> >> > >> > > > use >>>> >>>>> >> >> > >> > > > > > the >>>> >>>>> >> >> > >> > > > > > > >> > deployMode flag there, >>>> >>>>> >> >> > >> > > > > > > >> > pass the program info through Flink >>>> configuration, use >>>> >>>>> >> >> > >> the >>>> >>>>> >> >> > >> > > > > > > >> > ClassPathJobGraphRetriever >>>> >>>>> >> >> > >> > > > > > > >> > to generate the job graph in >>>> ClusterEntrypoints of yarn >>>> >>>>> >> >> > >> and >>>> >>>>> >> >> > >> > > > > > > Kubernetes. >>>> >>>>> >> >> > >> > > > > > > >> > >>>> >>>>> >> >> > >> > > > > > > >> > 2) Phase II >>>> >>>>> >> >> > >> > > > > > > >> > In AbstractJobClusterExecutor, the >>>> job graph is >>>> >>>>> >> >> > >> generated in >>>> >>>>> >> >> > >> > > > the >>>> >>>>> >> >> > >> > > > > > > >> execute >>>> >>>>> >> >> > >> > > > > > > >> > function. We can still >>>> >>>>> >> >> > >> > > > > > > >> > use the deployMode in it. With >>>> deployMode = cluster, the >>>> >>>>> >> >> > >> > > execute >>>> >>>>> >> >> > >> > > > > > > >> function >>>> >>>>> >> >> > >> > > > > > > >> > only starts the cluster. >>>> >>>>> >> >> > >> > > > > > > >> > >>>> >>>>> >> >> > >> > > > > > > >> > When >>>> {Yarn/Kuberneates}PerJobClusterEntrypoint starts, >>>> >>>>> >> >> > >> It will >>>> >>>>> >> >> > >> > > > > start >>>> >>>>> >> >> > >> > > > > > > the >>>> >>>>> >> >> > >> > > > > > > >> > dispatch first, then we can use >>>> >>>>> >> >> > >> > > > > > > >> > a ClusterEnvironment similar to >>>> ContextEnvironment to >>>> >>>>> >> >> > >> submit >>>> >>>>> >> >> > >> > > the >>>> >>>>> >> >> > >> > > > > job >>>> >>>>> >> >> > >> > > > > > > >> with >>>> >>>>> >> >> > >> > > > > > > >> > jobName the local >>>> >>>>> >> >> > >> > > > > > > >> > dispatcher. For the details, we need >>>> more investigation. >>>> >>>>> >> >> > >> Let's >>>> >>>>> >> >> > >> > > > > wait >>>> >>>>> >> >> > >> > > > > > > >> > for @Aljoscha >>>> >>>>> >> >> > >> > > > > > > >> > Krettek <[hidden email]> @Till >>>> Rohrmann < >>>> >>>>> >> >> > >> > > > > [hidden email] >>>> >>>>> >> >> > >> > > > > > >'s >>>> >>>>> >> >> > >> > > > > > > >> > feedback after the holiday season. >>>> >>>>> >> >> > >> > > > > > > >> > >>>> >>>>> >> >> > >> > > > > > > >> > Thank you in advance. Merry Chrismas >>>> and Happy New >>>> >>>>> >> >> > >> Year!!! >>>> >>>>> >> >> > >> > > > > > > >> > >>>> >>>>> >> >> > >> > > > > > > >> > >>>> >>>>> >> >> > >> > > > > > > >> > Best Regards >>>> >>>>> >> >> > >> > > > > > > >> > Peter Huang >>>> >>>>> >> >> > >> > > > > > > >> > >>>> >>>>> >> >> > >> > > > > > > >> > >>>> >>>>> >> >> > >> > > > > > > >> > >>>> >>>>> >> >> > >> > > > > > > >> > >>>> >>>>> >> >> > >> > > > > > > >> > >>>> >>>>> >> >> > >> > > > > > > >> > >>>> >>>>> >> >> > >> > > > > > > >> > >>>> >>>>> >> >> > >> > > > > > > >> > >>>> >>>>> >> >> > >> > > > > > > >> > On Wed, Dec 25, 2019 at 1:08 AM Yang >>>> Wang < >>>> >>>>> >> >> > >> > > > [hidden email]> >>>> >>>>> >> >> > >> > > > > > > >> wrote: >>>> >>>>> >> >> > >> > > > > > > >> > >>>> >>>>> >> >> > >> > > > > > > >> >> Hi Peter, >>>> >>>>> >> >> > >> > > > > > > >> >> >>>> >>>>> >> >> > >> > > > > > > >> >> I think we need to reconsider >>>> tison's suggestion >>>> >>>>> >> >> > >> seriously. >>>> >>>>> >> >> > >> > > > After >>>> >>>>> >> >> > >> > > > > > > >> FLIP-73, >>>> >>>>> >> >> > >> > > > > > > >> >> the deployJobCluster has >>>> >>>>> >> >> > >> > > > > > > >> >> beenmoved into >>>> `JobClusterExecutor#execute`. It should >>>> >>>>> >> >> > >> not be >>>> >>>>> >> >> > >> > > > > > > perceived >>>> >>>>> >> >> > >> > > > > > > >> >> for `CliFrontend`. That >>>> >>>>> >> >> > >> > > > > > > >> >> means the user program will *ALWAYS* >>>> be executed on >>>> >>>>> >> >> > >> client >>>> >>>>> >> >> > >> > > > side. >>>> >>>>> >> >> > >> > > > > > This >>>> >>>>> >> >> > >> > > > > > > >> is >>>> >>>>> >> >> > >> > > > > > > >> >> the by design behavior. >>>> >>>>> >> >> > >> > > > > > > >> >> So, we could not just add `if(client >>>> mode) .. else >>>> >>>>> >> >> > >> if(cluster >>>> >>>>> >> >> > >> > > > > mode) >>>> >>>>> >> >> > >> > > > > > > >> ...` >>>> >>>>> >> >> > >> > > > > > > >> >> codes in `CliFrontend` to bypass >>>> >>>>> >> >> > >> > > > > > > >> >> the executor. We need to find a >>>> clean way to decouple >>>> >>>>> >> >> > >> > > executing >>>> >>>>> >> >> > >> > > > > > user >>>> >>>>> >> >> > >> > > > > > > >> >> program and deploying per-job >>>> >>>>> >> >> > >> > > > > > > >> >> cluster. Based on this, we could >>>> support to execute user >>>> >>>>> >> >> > >> > > > program >>>> >>>>> >> >> > >> > > > > on >>>> >>>>> >> >> > >> > > > > > > >> client >>>> >>>>> >> >> > >> > > > > > > >> >> or master side. >>>> >>>>> >> >> > >> > > > > > > >> >> >>>> >>>>> >> >> > >> > > > > > > >> >> Maybe Aljoscha and Jeff could give >>>> some good >>>> >>>>> >> >> > >> suggestions. >>>> >>>>> >> >> > >> > > > > > > >> >> >>>> >>>>> >> >> > >> > > > > > > >> >> >>>> >>>>> >> >> > >> > > > > > > >> >> >>>> >>>>> >> >> > >> > > > > > > >> >> Best, >>>> >>>>> >> >> > >> > > > > > > >> >> Yang >>>> >>>>> >> >> > >> > > > > > > >> >> >>>> >>>>> >> >> > >> > > > > > > >> >> Peter Huang < >>>> [hidden email]> 于2019年12月25日周三 >>>> >>>>> >> >> > >> > > > > 上午4:03写道: >>>> >>>>> >> >> > >> > > > > > > >> >> >>>> >>>>> >> >> > >> > > > > > > >> >>> Hi Jingjing, >>>> >>>>> >> >> > >> > > > > > > >> >>> >>>> >>>>> >> >> > >> > > > > > > >> >>> The improvement proposed is a >>>> deployment option for >>>> >>>>> >> >> > >> CLI. For >>>> >>>>> >> >> > >> > > > SQL >>>> >>>>> >> >> > >> > > > > > > based >>>> >>>>> >> >> > >> > > > > > > >> >>> Flink application, It is more >>>> convenient to use the >>>> >>>>> >> >> > >> existing >>>> >>>>> >> >> > >> > > > > model >>>> >>>>> >> >> > >> > > > > > > in >>>> >>>>> >> >> > >> > > > > > > >> >>> SqlClient in which >>>> >>>>> >> >> > >> > > > > > > >> >>> the job graph is generated within >>>> SqlClient. After >>>> >>>>> >> >> > >> adding >>>> >>>>> >> >> > >> > > the >>>> >>>>> >> >> > >> > > > > > > delayed >>>> >>>>> >> >> > >> > > > > > > >> job >>>> >>>>> >> >> > >> > > > > > > >> >>> graph generation, I think there is >>>> no change is needed >>>> >>>>> >> >> > >> for >>>> >>>>> >> >> > >> > > > your >>>> >>>>> >> >> > >> > > > > > > side. >>>> >>>>> >> >> > >> > > > > > > >> >>> >>>> >>>>> >> >> > >> > > > > > > >> >>> >>>> >>>>> >> >> > >> > > > > > > >> >>> Best Regards >>>> >>>>> >> >> > >> > > > > > > >> >>> Peter Huang >>>> >>>>> >> >> > >> > > > > > > >> >>> >>>> >>>>> >> >> > >> > > > > > > >> >>> >>>> >>>>> >> >> > >> > > > > > > >> >>> On Wed, Dec 18, 2019 at 6:01 AM >>>> jingjing bai < >>>> >>>>> >> >> > >> > > > > > > >> [hidden email]> >>>> >>>>> >> >> > >> > > > > > > >> >>> wrote: >>>> >>>>> >> >> > >> > > > > > > >> >>> >>>> >>>>> >> >> > >> > > > > > > >> >>>> hi peter: >>>> >>>>> >> >> > >> > > > > > > >> >>>> we had extension SqlClent to >>>> support sql job >>>> >>>>> >> >> > >> submit in >>>> >>>>> >> >> > >> > > web >>>> >>>>> >> >> > >> > > > > > base >>>> >>>>> >> >> > >> > > > > > > on >>>> >>>>> >> >> > >> > > > > > > >> >>>> flink 1.9. we support submit to >>>> yarn on per job >>>> >>>>> >> >> > >> mode too. >>>> >>>>> >> >> > >> > > > > > > >> >>>> in this case, the job graph >>>> generated on client >>>> >>>>> >> >> > >> side >>>> >>>>> >> >> > >> > > . I >>>> >>>>> >> >> > >> > > > > > think >>>> >>>>> >> >> > >> > > > > > > >> >>> this >>>> >>>>> >> >> > >> > > > > > > >> >>>> discuss Mainly to improve api >>>> programme. but in my >>>> >>>>> >> >> > >> case , >>>> >>>>> >> >> > >> > > > > there >>>> >>>>> >> >> > >> > > > > > is >>>> >>>>> >> >> > >> > > > > > > >> no >>>> >>>>> >> >> > >> > > > > > > >> >>>> jar to upload but only a sql >>>> string . >>>> >>>>> >> >> > >> > > > > > > >> >>>> do u had more suggestion to >>>> improve for sql mode >>>> >>>>> >> >> > >> or it >>>> >>>>> >> >> > >> > > is >>>> >>>>> >> >> > >> > > > > > only a >>>> >>>>> >> >> > >> > > > > > > >> >>>> switch for api programme? >>>> >>>>> >> >> > >> > > > > > > >> >>>> >>>> >>>>> >> >> > >> > > > > > > >> >>>> >>>> >>>>> >> >> > >> > > > > > > >> >>>> best >>>> >>>>> >> >> > >> > > > > > > >> >>>> bai jj >>>> >>>>> >> >> > >> > > > > > > >> >>>> >>>> >>>>> >> >> > >> > > > > > > >> >>>> >>>> >>>>> >> >> > >> > > > > > > >> >>>> Yang Wang <[hidden email]> >>>> 于2019年12月18日周三 >>>> >>>>> >> >> > >> 下午7:21写道: >>>> >>>>> >> >> > >> > > > > > > >> >>>> >>>> >>>>> >> >> > >> > > > > > > >> >>>>> I just want to revive this >>>> discussion. >>>> >>>>> >> >> > >> > > > > > > >> >>>>> >>>> >>>>> >> >> > >> > > > > > > >> >>>>> Recently, i am thinking about how >>>> to natively run >>>> >>>>> >> >> > >> flink >>>> >>>>> >> >> > >> > > > > per-job >>>> >>>>> >> >> > >> > > > > > > >> >>> cluster on >>>> >>>>> >> >> > >> > > > > > > >> >>>>> Kubernetes. >>>> >>>>> >> >> > >> > > > > > > >> >>>>> The per-job mode on Kubernetes is >>>> very different >>>> >>>>> >> >> > >> from on >>>> >>>>> >> >> > >> > > > Yarn. >>>> >>>>> >> >> > >> > > > > > And >>>> >>>>> >> >> > >> > > > > > > >> we >>>> >>>>> >> >> > >> > > > > > > >> >>> will >>>> >>>>> >> >> > >> > > > > > > >> >>>>> have >>>> >>>>> >> >> > >> > > > > > > >> >>>>> the same deployment requirements >>>> to the client and >>>> >>>>> >> >> > >> entry >>>> >>>>> >> >> > >> > > > > point. >>>> >>>>> >> >> > >> > > > > > > >> >>>>> >>>> >>>>> >> >> > >> > > > > > > >> >>>>> 1. Flink client not always need a >>>> local jar to start >>>> >>>>> >> >> > >> a >>>> >>>>> >> >> > >> > > Flink >>>> >>>>> >> >> > >> > > > > > > per-job >>>> >>>>> >> >> > >> > > > > > > >> >>>>> cluster. We could >>>> >>>>> >> >> > >> > > > > > > >> >>>>> support multiple schemas. For >>>> example, >>>> >>>>> >> >> > >> > > > file:///path/of/my.jar >>>> >>>>> >> >> > >> > > > > > > means >>>> >>>>> >> >> > >> > > > > > > >> a >>>> >>>>> >> >> > >> > > > > > > >> >>> jar >>>> >>>>> >> >> > >> > > > > > > >> >>>>> located >>>> >>>>> >> >> > >> > > > > > > >> >>>>> at client side, >>>> >>>>> >> >> > >> hdfs://myhdfs/user/myname/flink/my.jar >>>> >>>>> >> >> > >> > > > means a >>>> >>>>> >> >> > >> > > > > > jar >>>> >>>>> >> >> > >> > > > > > > >> >>> located >>>> >>>>> >> >> > >> > > > > > > >> >>>>> at >>>> >>>>> >> >> > >> > > > > > > >> >>>>> remote hdfs, >>>> local:///path/in/image/my.jar means a >>>> >>>>> >> >> > >> jar >>>> >>>>> >> >> > >> > > > located >>>> >>>>> >> >> > >> > > > > > at >>>> >>>>> >> >> > >> > > > > > > >> >>>>> jobmanager side. >>>> >>>>> >> >> > >> > > > > > > >> >>>>> >>>> >>>>> >> >> > >> > > > > > > >> >>>>> 2. Support running user program >>>> on master side. This >>>> >>>>> >> >> > >> also >>>> >>>>> >> >> > >> > > > > means >>>> >>>>> >> >> > >> > > > > > > the >>>> >>>>> >> >> > >> > > > > > > >> >>> entry >>>> >>>>> >> >> > >> > > > > > > >> >>>>> point >>>> >>>>> >> >> > >> > > > > > > >> >>>>> will generate the job graph on >>>> master side. We could >>>> >>>>> >> >> > >> use >>>> >>>>> >> >> > >> > > the >>>> >>>>> >> >> > >> > > > > > > >> >>>>> ClasspathJobGraphRetriever >>>> >>>>> >> >> > >> > > > > > > >> >>>>> or start a local Flink client to >>>> achieve this >>>> >>>>> >> >> > >> purpose. >>>> >>>>> >> >> > >> > > > > > > >> >>>>> >>>> >>>>> >> >> > >> > > > > > > >> >>>>> >>>> >>>>> >> >> > >> > > > > > > >> >>>>> cc tison, Aljoscha & Kostas Do >>>> you think this is the >>>> >>>>> >> >> > >> right >>>> >>>>> >> >> > >> > > > > > > >> direction we >>>> >>>>> >> >> > >> > > > > > > >> >>>>> need to work? >>>> >>>>> >> >> > >> > > > > > > >> >>>>> >>>> >>>>> >> >> > >> > > > > > > >> >>>>> tison <[hidden email]> >>>> 于2019年12月12日周四 >>>> >>>>> >> >> > >> 下午4:48写道: >>>> >>>>> >> >> > >> > > > > > > >> >>>>> >>>> >>>>> >> >> > >> > > > > > > >> >>>>>> A quick idea is that we separate >>>> the deployment >>>> >>>>> >> >> > >> from user >>>> >>>>> >> >> > >> > > > > > program >>>> >>>>> >> >> > >> > > > > > > >> >>> that >>>> >>>>> >> >> > >> > > > > > > >> >>>>> it >>>> >>>>> >> >> > >> > > > > > > >> >>>>>> has always been done >>>> >>>>> >> >> > >> > > > > > > >> >>>>>> outside the program. On user >>>> program executed there >>>> >>>>> >> >> > >> is >>>> >>>>> >> >> > >> > > > > always a >>>> >>>>> >> >> > >> > > > > > > >> >>>>>> ClusterClient that communicates >>>> with >>>> >>>>> >> >> > >> > > > > > > >> >>>>>> an existing cluster, remote or >>>> local. It will be >>>> >>>>> >> >> > >> another >>>> >>>>> >> >> > >> > > > > thread >>>> >>>>> >> >> > >> > > > > > > so >>>> >>>>> >> >> > >> > > > > > > >> >>> just >>>> >>>>> >> >> > >> > > > > > > >> >>>>> for >>>> >>>>> >> >> > >> > > > > > > >> >>>>>> your information. >>>> >>>>> >> >> > >> > > > > > > >> >>>>>> >>>> >>>>> >> >> > >> > > > > > > >> >>>>>> Best, >>>> >>>>> >> >> > >> > > > > > > >> >>>>>> tison. >>>> >>>>> >> >> > >> > > > > > > >> >>>>>> >>>> >>>>> >> >> > >> > > > > > > >> >>>>>> >>>> >>>>> >> >> > >> > > > > > > >> >>>>>> tison <[hidden email]> >>>> 于2019年12月12日周四 >>>> >>>>> >> >> > >> 下午4:40写道: >>>> >>>>> >> >> > >> > > > > > > >> >>>>>> >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>> Hi Peter, >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>> >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>> Another concern I realized >>>> recently is that with >>>> >>>>> >> >> > >> current >>>> >>>>> >> >> > >> > > > > > > Executors >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>> abstraction(FLIP-73) >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>> I'm afraid that user program is >>>> designed to ALWAYS >>>> >>>>> >> >> > >> run >>>> >>>>> >> >> > >> > > on >>>> >>>>> >> >> > >> > > > > the >>>> >>>>> >> >> > >> > > > > > > >> >>> client >>>> >>>>> >> >> > >> > > > > > > >> >>>>>> side. >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>> Specifically, >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>> we deploy the job in executor >>>> when env.execute >>>> >>>>> >> >> > >> called. >>>> >>>>> >> >> > >> > > > This >>>> >>>>> >> >> > >> > > > > > > >> >>>>> abstraction >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>> possibly prevents >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>> Flink runs user program on the >>>> cluster side. >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>> >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>> For your proposal, in this case >>>> we already >>>> >>>>> >> >> > >> compiled the >>>> >>>>> >> >> > >> > > > > > program >>>> >>>>> >> >> > >> > > > > > > >> and >>>> >>>>> >> >> > >> > > > > > > >> >>>>> run >>>> >>>>> >> >> > >> > > > > > > >> >>>>>> on >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>> the client side, >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>> even we deploy a cluster and >>>> retrieve job graph >>>> >>>>> >> >> > >> from >>>> >>>>> >> >> > >> > > > program >>>> >>>>> >> >> > >> > > > > > > >> >>>>> metadata, it >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>> doesn't make >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>> many sense. >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>> >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>> cc Aljoscha & Kostas what do >>>> you think about this >>>> >>>>> >> >> > >> > > > > constraint? >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>> >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>> Best, >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>> tison. >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>> >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>> >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>> Peter Huang < >>>> [hidden email]> >>>> >>>>> >> >> > >> 于2019年12月10日周二 >>>> >>>>> >> >> > >> > > > > > > >> 下午12:45写道: >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>> >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> Hi Tison, >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> Yes, you are right. I think I >>>> made the wrong >>>> >>>>> >> >> > >> argument >>>> >>>>> >> >> > >> > > in >>>> >>>>> >> >> > >> > > > > the >>>> >>>>> >> >> > >> > > > > > > doc. >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> Basically, the packaging jar >>>> problem is only for >>>> >>>>> >> >> > >> > > platform >>>> >>>>> >> >> > >> > > > > > > users. >>>> >>>>> >> >> > >> > > > > > > >> >>> In >>>> >>>>> >> >> > >> > > > > > > >> >>>>> our >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> internal deploy service, >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> we further optimized the >>>> deployment latency by >>>> >>>>> >> >> > >> letting >>>> >>>>> >> >> > >> > > > > users >>>> >>>>> >> >> > >> > > > > > to >>>> >>>>> >> >> > >> > > > > > > >> >>>>>> packaging >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> flink-runtime together with >>>> the uber jar, so that >>>> >>>>> >> >> > >> we >>>> >>>>> >> >> > >> > > > don't >>>> >>>>> >> >> > >> > > > > > need >>>> >>>>> >> >> > >> > > > > > > >> to >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> consider >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> multiple flink version >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> support for now. In the >>>> session client mode, as >>>> >>>>> >> >> > >> Flink >>>> >>>>> >> >> > >> > > > libs >>>> >>>>> >> >> > >> > > > > > will >>>> >>>>> >> >> > >> > > > > > > >> be >>>> >>>>> >> >> > >> > > > > > > >> >>>>>> shipped >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> anyway as local resources of >>>> yarn. Users actually >>>> >>>>> >> >> > >> don't >>>> >>>>> >> >> > >> > > > > need >>>> >>>>> >> >> > >> > > > > > to >>>> >>>>> >> >> > >> > > > > > > >> >>>>> package >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> those libs into job jar. >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> Best Regards >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> Peter Huang >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> On Mon, Dec 9, 2019 at 8:35 PM >>>> tison < >>>> >>>>> >> >> > >> > > > [hidden email] >>>> >>>>> >> >> > >> > > > > > >>>> >>>>> >> >> > >> > > > > > > >> >>> wrote: >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> 3. What do you mean about >>>> the package? Do users >>>> >>>>> >> >> > >> need >>>> >>>>> >> >> > >> > > to >>>> >>>>> >> >> > >> > > > > > > >> >>> compile >>>> >>>>> >> >> > >> > > > > > > >> >>>>>> their >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>> jars >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>> inlcuding flink-clients, >>>> flink-optimizer, >>>> >>>>> >> >> > >> flink-table >>>> >>>>> >> >> > >> > > > > codes? >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>> >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>> The answer should be no >>>> because they exist in >>>> >>>>> >> >> > >> system >>>> >>>>> >> >> > >> > > > > > > classpath. >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>> >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>> Best, >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>> tison. >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>> >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>> >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>> Yang Wang < >>>> [hidden email]> 于2019年12月10日周二 >>>> >>>>> >> >> > >> > > > > 下午12:18写道: >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>> >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> Hi Peter, >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> Thanks a lot for starting >>>> this discussion. I >>>> >>>>> >> >> > >> think >>>> >>>>> >> >> > >> > > this >>>> >>>>> >> >> > >> > > > > is >>>> >>>>> >> >> > >> > > > > > a >>>> >>>>> >> >> > >> > > > > > > >> >>> very >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> useful >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> feature. >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> Not only for Yarn, i am >>>> focused on flink on >>>> >>>>> >> >> > >> > > Kubernetes >>>> >>>>> >> >> > >> > > > > > > >> >>>>> integration >>>> >>>>> >> >> > >> > > > > > > >> >>>>>> and >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>> come >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> across the same >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> problem. I do not want the >>>> job graph generated >>>> >>>>> >> >> > >> on >>>> >>>>> >> >> > >> > > > client >>>> >>>>> >> >> > >> > > > > > > side. >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> Instead, >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>> the >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> user jars are built in >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> a user-defined image. When >>>> the job manager >>>> >>>>> >> >> > >> launched, >>>> >>>>> >> >> > >> > > we >>>> >>>>> >> >> > >> > > > > > just >>>> >>>>> >> >> > >> > > > > > > >> >>>>> need to >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> generate the job graph >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> based on local user jars. >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> I have some small suggestion >>>> about this. >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> 1. >>>> `ProgramJobGraphRetriever` is very similar to >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> >>>> `ClasspathJobGraphRetriever`, the differences >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> are the former needs >>>> `ProgramMetadata` and the >>>> >>>>> >> >> > >> latter >>>> >>>>> >> >> > >> > > > > needs >>>> >>>>> >> >> > >> > > > > > > >> >>> some >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>> arguments. >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> Is it possible to >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> have an unified >>>> `JobGraphRetriever` to support >>>> >>>>> >> >> > >> both? >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> 2. Is it possible to not use >>>> a local user jar to >>>> >>>>> >> >> > >> > > start >>>> >>>>> >> >> > >> > > > a >>>> >>>>> >> >> > >> > > > > > > >> >>> per-job >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> cluster? >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> In your case, the user jars >>>> has >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> existed on hdfs already and >>>> we do need to >>>> >>>>> >> >> > >> download >>>> >>>>> >> >> > >> > > the >>>> >>>>> >> >> > >> > > > > jars >>>> >>>>> >> >> > >> > > > > > > to >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> deployer >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> service. Currently, we >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> always need a local user jar >>>> to start a flink >>>> >>>>> >> >> > >> > > cluster. >>>> >>>>> >> >> > >> > > > It >>>> >>>>> >> >> > >> > > > > > is >>>> >>>>> >> >> > >> > > > > > > >> >>> be >>>> >>>>> >> >> > >> > > > > > > >> >>>>>> great >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> if >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>> we >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> could support remote user >>>> jars. >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>>> In the implementation, we >>>> assume users package >>>> >>>>> >> >> > >> > > > > > > >> >>> flink-clients, >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> flink-optimizer, flink-table >>>> together within >>>> >>>>> >> >> > >> the job >>>> >>>>> >> >> > >> > > > jar. >>>> >>>>> >> >> > >> > > > > > > >> >>>>> Otherwise, >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> the >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> job graph generation within >>>> >>>>> >> >> > >> JobClusterEntryPoint will >>>> >>>>> >> >> > >> > > > > fail. >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> 3. What do you mean about >>>> the package? Do users >>>> >>>>> >> >> > >> need >>>> >>>>> >> >> > >> > > to >>>> >>>>> >> >> > >> > > > > > > >> >>> compile >>>> >>>>> >> >> > >> > > > > > > >> >>>>>> their >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>> jars >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> inlcuding flink-clients, >>>> flink-optimizer, >>>> >>>>> >> >> > >> flink-table >>>> >>>>> >> >> > >> > > > > > codes? >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> Best, >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> Yang >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> Peter Huang < >>>> [hidden email]> >>>> >>>>> >> >> > >> > > > 于2019年12月10日周二 >>>> >>>>> >> >> > >> > > > > > > >> >>>>> 上午2:37写道: >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> Dear All, >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> Recently, the Flink >>>> community starts to >>>> >>>>> >> >> > >> improve the >>>> >>>>> >> >> > >> > > > yarn >>>> >>>>> >> >> > >> > > > > > > >> >>>>> cluster >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> descriptor >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> to make job jar and config >>>> files configurable >>>> >>>>> >> >> > >> from >>>> >>>>> >> >> > >> > > > CLI. >>>> >>>>> >> >> > >> > > > > It >>>> >>>>> >> >> > >> > > > > > > >> >>>>>> improves >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> the >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> flexibility of Flink >>>> deployment Yarn Per Job >>>> >>>>> >> >> > >> Mode. >>>> >>>>> >> >> > >> > > > For >>>> >>>>> >> >> > >> > > > > > > >> >>>>> platform >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> users >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> who >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> manage tens of hundreds of >>>> streaming pipelines >>>> >>>>> >> >> > >> for >>>> >>>>> >> >> > >> > > the >>>> >>>>> >> >> > >> > > > > > whole >>>> >>>>> >> >> > >> > > > > > > >> >>>>> org >>>> >>>>> >> >> > >> > > > > > > >> >>>>>> or >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> company, we found the job >>>> graph generation in >>>> >>>>> >> >> > >> > > > > client-side >>>> >>>>> >> >> > >> > > > > > is >>>> >>>>> >> >> > >> > > > > > > >> >>>>>> another >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> pinpoint. Thus, we want to >>>> propose a >>>> >>>>> >> >> > >> configurable >>>> >>>>> >> >> > >> > > > > feature >>>> >>>>> >> >> > >> > > > > > > >> >>> for >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> FlinkYarnSessionCli. The >>>> feature can allow >>>> >>>>> >> >> > >> users to >>>> >>>>> >> >> > >> > > > > choose >>>> >>>>> >> >> > >> > > > > > > >> >>> the >>>> >>>>> >> >> > >> > > > > > > >> >>>>> job >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>> graph >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> generation in Flink >>>> ClusterEntryPoint so that >>>> >>>>> >> >> > >> the >>>> >>>>> >> >> > >> > > job >>>> >>>>> >> >> > >> > > > > jar >>>> >>>>> >> >> > >> > > > > > > >> >>>>> doesn't >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> need >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>> to >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> be locally for the job >>>> graph generation. The >>>> >>>>> >> >> > >> > > proposal >>>> >>>>> >> >> > >> > > > is >>>> >>>>> >> >> > >> > > > > > > >> >>>>> organized >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> as a >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> FLIP >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>> >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> >>>> >>>>> >> >> > >> > > > > > > >> >>>>>> >>>> >>>>> >> >> > >> > > > > > > >> >>>>> >>>> >>>>> >> >> > >> > > > > > > >> >>> >>>> >>>>> >> >> > >> > > > > > > >> >>>> >>>>> >> >> > >> > > > > > > >>>> >>>>> >> >> > >> > > > > > >>>> >>>>> >> >> > >> > > > > >>>> >>>>> >> >> > >> > > > >>>> >>>>> >> >> > >> > > >>>> >>>>> >> >> > >> >>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-85+Delayed+JobGraph+Generation >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> . >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> Any questions and >>>> suggestions are welcomed. >>>> >>>>> >> >> > >> Thank >>>> >>>>> >> >> > >> > > you >>>> >>>>> >> >> > >> > > > in >>>> >>>>> >> >> > >> > > > > > > >> >>>>> advance. >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> Best Regards >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> Peter Huang >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>> >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> >>>> >>>>> >> >> > >> > > > > > > >> >>>>>>> >>>> >>>>> >> >> > >> > > > > > > >> >>>>>> >>>> >>>>> >> >> > >> > > > > > > >> >>>>> >>>> >>>>> >> >> > >> > > > > > > >> >>>> >>>> >>>>> >> >> > >> > > > > > > >> >>> >>>> >>>>> >> >> > >> > > > > > > >> >> >>>> >>>>> >> >> > >> > > > > > > >> >>>> >>>>> >> >> > >> > > > > > > >> >>>> >>>>> >> >> > >> > > > > > > >>>> >>>>> >> >> > >> > > > > > >>>> >>>>> >> >> > >> > > > > >>>> >>>>> >> >> > >> > > > >>>> >>>>> >> >> > >> > > >>>> >>>>> >> >> > >> >>>> >>>>> >> >> > > >>>> >>>>> >> >> >>>> >>> |
Hi Becket,
Thanks for your suggestion. We will update the FLIP to add/enrich the following parts. * User cli option change, use "-R/--remote" to apply the cluster deploy mode * Configuration change, how to specify remote user jars and dependencies * The whole story about how "application mode" works, upload -> fetch -> submit job * The cluster lifecycle, when and how the Flink cluster is destroyed Best, Yang Becket Qin <[hidden email]> 于2020年3月9日周一 下午12:34写道: > Thanks for the reply, tison and Yang, > > Regarding the public interface, is "-R/--remote" option the only change? > Will the users also need to provide a remote location to upload and store > the jars, and a list of jars as dependencies to be uploaded? > > It would be important that the public interface section in the FLIP > includes all the user sensible changes including the CLI / configuration / > metrics, etc. Can we update the FLIP to include the conclusion we have here > in the ML? > > Thanks, > > Jiangjie (Becket) Qin > > On Mon, Mar 9, 2020 at 11:59 AM Yang Wang <[hidden email]> wrote: > >> Hi Becket, >> >> Thanks for jumping out and sharing your concerns. I second tison's answer >> and just >> make some additions. >> >> >> > job submission interface >> >> This FLIP will introduce an interface for running user `main()` on >> cluster, named as >> “ProgramDeployer”. However, it is not a public interface. It will be used >> in `CliFrontend` >> when the remote deploy option(-R/--remote-deploy) is specified. So the >> only changes >> on user side is about the cli option. >> >> >> > How to fetch the jars? >> >> The “local path” and “dfs path“ could be supported to fetch the user jars >> and dependencies. >> Just like tison has said, we could ship the user jar and dependencies >> from client side to >> HDFS and use the entrypoint to fetch. >> >> Also we have some other practical ways to use the new “application mode“. >> 1. Upload the user jars and dependencies to the DFS(e.g. HDFS, S3, Aliyun >> OSS) manually >> or some external deployer system. For K8s, the user jars and dependencies >> could also be >> built in the docker image. >> 2. Specify the remote/local user jar and dependencies in `flink run`. >> Usually this could also >> be done by the external deployer system. >> 3. When the `ClusterEntrypoint` is launched, it will fetch the jars and >> files automatically. We >> do not need any specific fetcher implementation. Since we could leverage >> flink `FileSystem` >> to do this. >> >> >> >> >> >> Best, >> Yang >> >> tison <[hidden email]> 于2020年3月9日周一 上午11:34写道: >> >>> Hi Becket, >>> >>> Thanks for your attention on FLIP-85! I answered your question inline. >>> >>> 1. What exactly the job submission interface will look like after this >>> FLIP? The FLIP template has a Public Interface section but was removed from >>> this FLIP. >>> >>> As Yang mentioned in this thread above: >>> >>> From user perspective, only a `-R/-- remote-deploy` cli option is >>> visible. They are not aware of the application mode. >>> >>> 2. How will the new ClusterEntrypoint fetch the jars from external >>> storage? What external storage will be supported out of the box? Will this >>> "jar fetcher" be pluggable? If so, how does the API look like and how will >>> users specify the custom "jar fetcher"? >>> >>> It depends actually. Here are several points: >>> >>> i. Currently, shipping user files is handled by Flink, dependencies >>> fetching can be handled by Flink. >>> ii. Current, we only support local file system shipfiles. When in >>> Application Mode, to support meaningful jar fetch we should support user to >>> configure richer shipfiles schema at first. >>> iii. Dependencies fetching varies from deployments. That is, on YARN, >>> its convention is through HDFS; on Kubernetes, its convention is configured >>> resource server and fetched by initContainer. >>> >>> Thus, in the First phase of Application Mode dependencies fetching is >>> totally handled within Flink. >>> >>> 3. It sounds that in this FLIP, the "session cluster" running the >>> application has the same lifecycle as the user application. How will the >>> session cluster be teared down after the application finishes? Will the >>> ClusterEntrypoint do that? Will there be an option of not tearing the >>> cluster down? >>> >>> The precondition we tear down the cluster is *both* >>> >>> i. user main reached to its end >>> ii. all jobs submitted(current, at most one) reached global terminate >>> state >>> >>> For the "how", it is an implementation topic, but conceptually it is >>> ClusterEntrypoint's responsibility. >>> >>> >Will there be an option of not tearing the cluster down? >>> >>> I think the answer is "No" because the cluster is designed to be bounded >>> with an Application. User logic that communicates with the job is always in >>> its `main`, and for history information we have history server. >>> >>> Best, >>> tison. >>> >>> >>> Becket Qin <[hidden email]> 于2020年3月9日周一 上午8:12写道: >>> >>>> Hi Peter and Kostas, >>>> >>>> Thanks for creating this FLIP. Moving the JobGraph compilation to the >>>> cluster makes a lot of sense to me. FLIP-40 had the exactly same idea, but >>>> is currently dormant and can probably be superseded by this FLIP. After >>>> reading the FLIP, I still have a few questions. >>>> >>>> 1. What exactly the job submission interface will look like after this >>>> FLIP? The FLIP template has a Public Interface section but was removed from >>>> this FLIP. >>>> 2. How will the new ClusterEntrypoint fetch the jars from external >>>> storage? What external storage will be supported out of the box? Will this >>>> "jar fetcher" be pluggable? If so, how does the API look like and how will >>>> users specify the custom "jar fetcher"? >>>> 3. It sounds that in this FLIP, the "session cluster" running the >>>> application has the same lifecycle as the user application. How will the >>>> session cluster be teared down after the application finishes? Will the >>>> ClusterEntrypoint do that? Will there be an option of not tearing the >>>> cluster down? >>>> >>>> Maybe they have been discussed in the ML earlier, but I think they >>>> should be part of the FLIP also. >>>> >>>> Thanks, >>>> >>>> Jiangjie (Becket) Qin >>>> >>>> On Thu, Mar 5, 2020 at 10:09 PM Kostas Kloudas <[hidden email]> >>>> wrote: >>>> >>>>> Also from my side +1 to start voting. >>>>> >>>>> Cheers, >>>>> Kostas >>>>> >>>>> On Thu, Mar 5, 2020 at 7:45 AM tison <[hidden email]> wrote: >>>>> > >>>>> > +1 to star voting. >>>>> > >>>>> > Best, >>>>> > tison. >>>>> > >>>>> > >>>>> > Yang Wang <[hidden email]> 于2020年3月5日周四 下午2:29写道: >>>>> >> >>>>> >> Hi Peter, >>>>> >> Really thanks for your response. >>>>> >> >>>>> >> Hi all @Kostas Kloudas @Zili Chen @Peter Huang @Rong Rong >>>>> >> It seems that we have reached an agreement. The “application mode” >>>>> is regarded as the enhanced “per-job”. It is >>>>> >> orthogonal with “cluster deploy”. Currently, we bind the “per-job” >>>>> to `run-user-main-on-client` and “application mode” >>>>> >> to `run-user-main-on-cluster`. >>>>> >> >>>>> >> Do you have other concerns to moving FLIP-85 to voting? >>>>> >> >>>>> >> >>>>> >> Best, >>>>> >> Yang >>>>> >> >>>>> >> Peter Huang <[hidden email]> 于2020年3月5日周四 下午12:48写道: >>>>> >>> >>>>> >>> Hi Yang and Kostas, >>>>> >>> >>>>> >>> Thanks for the clarification. It makes more sense to me if the >>>>> long term goal is to replace per job mode to application mode >>>>> >>> in the future (at the time that multiple execute can be >>>>> supported). Before that, It will be better to keep the concept of >>>>> >>> application mode internally. As Yang suggested, User only need to >>>>> use a `-R/-- remote-deploy` cli option to launch >>>>> >>> a per job cluster with the main function executed in cluster >>>>> entry-point. +1 for the execution plan. >>>>> >>> >>>>> >>> >>>>> >>> >>>>> >>> Best Regards >>>>> >>> Peter Huang >>>>> >>> >>>>> >>> >>>>> >>> >>>>> >>> >>>>> >>> On Tue, Mar 3, 2020 at 7:11 AM Yang Wang <[hidden email]> >>>>> wrote: >>>>> >>>> >>>>> >>>> Hi Peter, >>>>> >>>> >>>>> >>>> Having the application mode does not mean we will drop the >>>>> cluster-deploy >>>>> >>>> option. I just want to share some thoughts about “Application >>>>> Mode”. >>>>> >>>> >>>>> >>>> >>>>> >>>> 1. The application mode could cover the per-job sematic. Its >>>>> lifecyle is bound >>>>> >>>> to the user `main()`. And all the jobs in the user main will be >>>>> executed in a same >>>>> >>>> Flink cluster. In first phase of FLIP-85 implementation, running >>>>> user main on the >>>>> >>>> cluster side could be supported in application mode. >>>>> >>>> >>>>> >>>> 2. Maybe in the future, we also need to support multiple >>>>> `execute()` on client side >>>>> >>>> in a same Flink cluster. Then the per-job mode will evolve to >>>>> application mode. >>>>> >>>> >>>>> >>>> 3. From user perspective, only a `-R/-- remote-deploy` cli option >>>>> is visible. They >>>>> >>>> are not aware of the application mode. >>>>> >>>> >>>>> >>>> 4. In the first phase, the application mode is working as >>>>> “per-job”(only one job in >>>>> >>>> the user main). We just leave more potential for the future. >>>>> >>>> >>>>> >>>> >>>>> >>>> I am not against with calling it “cluster deploy mode” if you all >>>>> think it is clearer for users. >>>>> >>>> >>>>> >>>> >>>>> >>>> >>>>> >>>> Best, >>>>> >>>> Yang >>>>> >>>> >>>>> >>>> Kostas Kloudas <[hidden email]> 于2020年3月3日周二 下午6:49写道: >>>>> >>>>> >>>>> >>>>> Hi Peter, >>>>> >>>>> >>>>> >>>>> I understand your point. This is why I was also a bit torn about >>>>> the >>>>> >>>>> name and my proposal was a bit aligned with yours (something >>>>> along the >>>>> >>>>> lines of "cluster deploy" mode). >>>>> >>>>> >>>>> >>>>> But many of the other participants in the discussion suggested >>>>> the >>>>> >>>>> "Application Mode". I think that the reasoning is that now the >>>>> user's >>>>> >>>>> Application is more self-contained. >>>>> >>>>> It will be submitted to the cluster and the user can just >>>>> disconnect. >>>>> >>>>> In addition, as discussed briefly in the doc, in the future >>>>> there may >>>>> >>>>> be better support for multi-execute applications which will >>>>> bring us >>>>> >>>>> one step closer to the true "Application Mode". But this is how I >>>>> >>>>> interpreted their arguments, of course they can also express >>>>> their >>>>> >>>>> thoughts on the topic :) >>>>> >>>>> >>>>> >>>>> Cheers, >>>>> >>>>> Kostas >>>>> >>>>> >>>>> >>>>> On Mon, Mar 2, 2020 at 6:15 PM Peter Huang < >>>>> [hidden email]> wrote: >>>>> >>>>> > >>>>> >>>>> > Hi Kostas, >>>>> >>>>> > >>>>> >>>>> > Thanks for updating the wiki. We have aligned with the >>>>> implementations in the doc. But I feel it is still a little bit confusing >>>>> of the naming from a user's perspective. It is well known that Flink >>>>> support per job cluster and session cluster. The concept is in the layer of >>>>> how a job is managed within Flink. The method introduced util now is a kind >>>>> of mixing job and session cluster to promising the implementation >>>>> complexity. We probably don't need to label it as Application Model as the >>>>> same layer of per job cluster and session cluster. Conceptually, I think it >>>>> is still a cluster mode implementation for per job cluster. >>>>> >>>>> > >>>>> >>>>> > To minimize the confusion of users, I think it would be better >>>>> just an option of per job cluster for each type of cluster manager. How do >>>>> you think? >>>>> >>>>> > >>>>> >>>>> > >>>>> >>>>> > Best Regards >>>>> >>>>> > Peter Huang >>>>> >>>>> > >>>>> >>>>> > >>>>> >>>>> > >>>>> >>>>> > >>>>> >>>>> > >>>>> >>>>> > >>>>> >>>>> > >>>>> >>>>> > >>>>> >>>>> > On Mon, Mar 2, 2020 at 7:22 AM Kostas Kloudas < >>>>> [hidden email]> wrote: >>>>> >>>>> >> >>>>> >>>>> >> Hi Yang, >>>>> >>>>> >> >>>>> >>>>> >> The difference between per-job and application mode is that, >>>>> as you >>>>> >>>>> >> described, in the per-job mode the main is executed on the >>>>> client >>>>> >>>>> >> while in the application mode, the main is executed on the >>>>> cluster. >>>>> >>>>> >> I do not think we have to offer "application mode" with >>>>> running the >>>>> >>>>> >> main on the client side as this is exactly what the per-job >>>>> mode does >>>>> >>>>> >> currently and, as you described also, it would be redundant. >>>>> >>>>> >> >>>>> >>>>> >> Sorry if this was not clear in the document. >>>>> >>>>> >> >>>>> >>>>> >> Cheers, >>>>> >>>>> >> Kostas >>>>> >>>>> >> >>>>> >>>>> >> On Mon, Mar 2, 2020 at 3:17 PM Yang Wang < >>>>> [hidden email]> wrote: >>>>> >>>>> >> > >>>>> >>>>> >> > Hi Kostas, >>>>> >>>>> >> > >>>>> >>>>> >> > Thanks a lot for your conclusion and updating the FLIP-85 >>>>> WIKI. Currently, i have no more >>>>> >>>>> >> > questions about motivation, approach, fault tolerance and >>>>> the first phase implementation. >>>>> >>>>> >> > >>>>> >>>>> >> > I think the new title "Flink Application Mode" makes a lot >>>>> senses to me. Especially for the >>>>> >>>>> >> > containerized environment, the cluster deploy option will >>>>> be very useful. >>>>> >>>>> >> > >>>>> >>>>> >> > Just one concern, how do we introduce this new application >>>>> mode to our users? >>>>> >>>>> >> > Each user program(i.e. `main()`) is an application. >>>>> Currently, we intend to only support one >>>>> >>>>> >> > `execute()`. So what's the difference between per-job and >>>>> application mode? >>>>> >>>>> >> > >>>>> >>>>> >> > For per-job, user `main()` is always executed on client >>>>> side. And For application mode, user >>>>> >>>>> >> > `main()` could be executed on client or master >>>>> side(configured via cli option). >>>>> >>>>> >> > Right? We need to have a clear concept. Otherwise, the >>>>> users will be more and more confusing. >>>>> >>>>> >> > >>>>> >>>>> >> > >>>>> >>>>> >> > Best, >>>>> >>>>> >> > Yang >>>>> >>>>> >> > >>>>> >>>>> >> > Kostas Kloudas <[hidden email]> 于2020年3月2日周一 下午5:58写道: >>>>> >>>>> >> >> >>>>> >>>>> >> >> Hi all, >>>>> >>>>> >> >> >>>>> >>>>> >> >> I update >>>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-85+Flink+Application+Mode >>>>> >>>>> >> >> based on the discussion we had here: >>>>> >>>>> >> >> >>>>> >>>>> >> >> >>>>> https://docs.google.com/document/d/1ji72s3FD9DYUyGuKnJoO4ApzV-nSsZa0-bceGXW7Ocw/edit# >>>>> >>>>> >> >> >>>>> >>>>> >> >> Please let me know what you think and please keep the >>>>> discussion in the ML :) >>>>> >>>>> >> >> >>>>> >>>>> >> >> Thanks for starting the discussion and I hope that soon we >>>>> will be >>>>> >>>>> >> >> able to vote on the FLIP. >>>>> >>>>> >> >> >>>>> >>>>> >> >> Cheers, >>>>> >>>>> >> >> Kostas >>>>> >>>>> >> >> >>>>> >>>>> >> >> On Thu, Jan 16, 2020 at 3:40 AM Yang Wang < >>>>> [hidden email]> wrote: >>>>> >>>>> >> >> > >>>>> >>>>> >> >> > Hi all, >>>>> >>>>> >> >> > >>>>> >>>>> >> >> > Thanks a lot for the feedback from @Kostas Kloudas. Your >>>>> all concerns are >>>>> >>>>> >> >> > on point. The FLIP-85 is mainly >>>>> >>>>> >> >> > focused on supporting cluster mode for per-job. Since it >>>>> is more urgent and >>>>> >>>>> >> >> > have much more use >>>>> >>>>> >> >> > cases both in Yarn and Kubernetes deployment. For >>>>> session cluster, we could >>>>> >>>>> >> >> > have more discussion >>>>> >>>>> >> >> > in a new thread later. >>>>> >>>>> >> >> > >>>>> >>>>> >> >> > #1, How to download the user jars and dependencies for >>>>> per-job in cluster >>>>> >>>>> >> >> > mode? >>>>> >>>>> >> >> > For Yarn, we could register the user jars and >>>>> dependencies as >>>>> >>>>> >> >> > LocalResource. They will be distributed >>>>> >>>>> >> >> > by Yarn. And once the JobManager and TaskManager >>>>> launched, the jars are >>>>> >>>>> >> >> > already exists. >>>>> >>>>> >> >> > For Standalone per-job and K8s, we expect that the user >>>>> jars >>>>> >>>>> >> >> > and dependencies are built into the image. >>>>> >>>>> >> >> > Or the InitContainer could be used for downloading. It >>>>> is natively >>>>> >>>>> >> >> > distributed and we will not have bottleneck. >>>>> >>>>> >> >> > >>>>> >>>>> >> >> > #2, Job graph recovery >>>>> >>>>> >> >> > We could have an optimization to store job graph on the >>>>> DFS. However, i >>>>> >>>>> >> >> > suggest building a new jobgraph >>>>> >>>>> >> >> > from the configuration is the default option. Since we >>>>> will not always have >>>>> >>>>> >> >> > a DFS store when deploying a >>>>> >>>>> >> >> > Flink per-job cluster. Of course, we assume that using >>>>> the same >>>>> >>>>> >> >> > configuration(e.g. job_id, user_jar, main_class, >>>>> >>>>> >> >> > main_args, parallelism, savepoint_settings, etc.) will >>>>> get a same job >>>>> >>>>> >> >> > graph. I think the standalone per-job >>>>> >>>>> >> >> > already has the similar behavior. >>>>> >>>>> >> >> > >>>>> >>>>> >> >> > #3, What happens with jobs that have multiple execute >>>>> calls? >>>>> >>>>> >> >> > Currently, it is really a problem. Even we use a local >>>>> client on Flink >>>>> >>>>> >> >> > master side, it will have different behavior with >>>>> >>>>> >> >> > client mode. For client mode, if we execute multiple >>>>> times, then we will >>>>> >>>>> >> >> > deploy multiple Flink clusters for each execute. >>>>> >>>>> >> >> > I am not pretty sure whether it is reasonable. However, >>>>> i still think using >>>>> >>>>> >> >> > the local client is a good choice. We could >>>>> >>>>> >> >> > continue the discussion in a new thread. @Zili Chen < >>>>> [hidden email]> Do >>>>> >>>>> >> >> > you want to drive this? >>>>> >>>>> >> >> > >>>>> >>>>> >> >> > >>>>> >>>>> >> >> > >>>>> >>>>> >> >> > Best, >>>>> >>>>> >> >> > Yang >>>>> >>>>> >> >> > >>>>> >>>>> >> >> > Peter Huang <[hidden email]> 于2020年1月16日周四 >>>>> 上午1:55写道: >>>>> >>>>> >> >> > >>>>> >>>>> >> >> > > Hi Kostas, >>>>> >>>>> >> >> > > >>>>> >>>>> >> >> > > Thanks for this feedback. I can't agree more about the >>>>> opinion. The >>>>> >>>>> >> >> > > cluster mode should be added >>>>> >>>>> >> >> > > first in per job cluster. >>>>> >>>>> >> >> > > >>>>> >>>>> >> >> > > 1) For job cluster implementation >>>>> >>>>> >> >> > > 1. Job graph recovery from configuration or store as >>>>> static job graph as >>>>> >>>>> >> >> > > session cluster. I think the static one will be better >>>>> for less recovery >>>>> >>>>> >> >> > > time. >>>>> >>>>> >> >> > > Let me update the doc for details. >>>>> >>>>> >> >> > > >>>>> >>>>> >> >> > > 2. For job execute multiple times, I think @Zili Chen >>>>> >>>>> >> >> > > <[hidden email]> has proposed the local client >>>>> solution that can >>>>> >>>>> >> >> > > the run program actually in the cluster entry point. >>>>> We can put the >>>>> >>>>> >> >> > > implementation in the second stage, >>>>> >>>>> >> >> > > or even a new FLIP for further discussion. >>>>> >>>>> >> >> > > >>>>> >>>>> >> >> > > 2) For session cluster implementation >>>>> >>>>> >> >> > > We can disable the cluster mode for the session >>>>> cluster in the first >>>>> >>>>> >> >> > > stage. I agree the jar downloading will be a painful >>>>> thing. >>>>> >>>>> >> >> > > We can consider about PoC and performance evaluation >>>>> first. If the end to >>>>> >>>>> >> >> > > end experience is good enough, then we can consider >>>>> >>>>> >> >> > > proceeding with the solution. >>>>> >>>>> >> >> > > >>>>> >>>>> >> >> > > Looking forward to more opinions from @Yang Wang < >>>>> [hidden email]> @Zili >>>>> >>>>> >> >> > > Chen <[hidden email]> @Dian Fu < >>>>> [hidden email]>. >>>>> >>>>> >> >> > > >>>>> >>>>> >> >> > > >>>>> >>>>> >> >> > > Best Regards >>>>> >>>>> >> >> > > Peter Huang >>>>> >>>>> >> >> > > >>>>> >>>>> >> >> > > On Wed, Jan 15, 2020 at 7:50 AM Kostas Kloudas < >>>>> [hidden email]> wrote: >>>>> >>>>> >> >> > > >>>>> >>>>> >> >> > >> Hi all, >>>>> >>>>> >> >> > >> >>>>> >>>>> >> >> > >> I am writing here as the discussion on the Google Doc >>>>> seems to be a >>>>> >>>>> >> >> > >> bit difficult to follow. >>>>> >>>>> >> >> > >> >>>>> >>>>> >> >> > >> I think that in order to be able to make progress, it >>>>> would be helpful >>>>> >>>>> >> >> > >> to focus on per-job mode for now. >>>>> >>>>> >> >> > >> The reason is that: >>>>> >>>>> >> >> > >> 1) making the (unique) JobSubmitHandler responsible >>>>> for creating the >>>>> >>>>> >> >> > >> jobgraphs, >>>>> >>>>> >> >> > >> which includes downloading dependencies, is not an >>>>> optimal solution >>>>> >>>>> >> >> > >> 2) even if we put the responsibility on the >>>>> JobMaster, currently each >>>>> >>>>> >> >> > >> job has its own >>>>> >>>>> >> >> > >> JobMaster but they all run on the same process, so >>>>> we have again a >>>>> >>>>> >> >> > >> single entity. >>>>> >>>>> >> >> > >> >>>>> >>>>> >> >> > >> Of course after this is done, and if we feel >>>>> comfortable with the >>>>> >>>>> >> >> > >> solution, then we can go to the session mode. >>>>> >>>>> >> >> > >> >>>>> >>>>> >> >> > >> A second comment has to do with fault-tolerance in >>>>> the per-job, >>>>> >>>>> >> >> > >> cluster-deploy mode. >>>>> >>>>> >> >> > >> In the document, it is suggested that upon recovery, >>>>> the JobMaster of >>>>> >>>>> >> >> > >> each job re-creates the JobGraph. >>>>> >>>>> >> >> > >> I am just wondering if it is better to create and >>>>> store the jobGraph >>>>> >>>>> >> >> > >> upon submission and only fetch it >>>>> >>>>> >> >> > >> upon recovery so that we have a static jobGraph. >>>>> >>>>> >> >> > >> >>>>> >>>>> >> >> > >> Finally, I have a question which is what happens with >>>>> jobs that have >>>>> >>>>> >> >> > >> multiple execute calls? >>>>> >>>>> >> >> > >> The semantics seem to change compared to the current >>>>> behaviour, right? >>>>> >>>>> >> >> > >> >>>>> >>>>> >> >> > >> Cheers, >>>>> >>>>> >> >> > >> Kostas >>>>> >>>>> >> >> > >> >>>>> >>>>> >> >> > >> On Wed, Jan 8, 2020 at 8:05 PM tison < >>>>> [hidden email]> wrote: >>>>> >>>>> >> >> > >> > >>>>> >>>>> >> >> > >> > not always, Yang Wang is also not yet a committer >>>>> but he can join the >>>>> >>>>> >> >> > >> > channel. I cannot find the id by clicking “Add new >>>>> member in channel” so >>>>> >>>>> >> >> > >> > come to you and ask for try out the link. Possibly >>>>> I will find other >>>>> >>>>> >> >> > >> ways >>>>> >>>>> >> >> > >> > but the original purpose is that the slack channel >>>>> is a public area we >>>>> >>>>> >> >> > >> > discuss about developing... >>>>> >>>>> >> >> > >> > Best, >>>>> >>>>> >> >> > >> > tison. >>>>> >>>>> >> >> > >> > >>>>> >>>>> >> >> > >> > >>>>> >>>>> >> >> > >> > Peter Huang <[hidden email]> >>>>> 于2020年1月9日周四 上午2:44写道: >>>>> >>>>> >> >> > >> > >>>>> >>>>> >> >> > >> > > Hi Tison, >>>>> >>>>> >> >> > >> > > >>>>> >>>>> >> >> > >> > > I am not the committer of Flink yet. I think I >>>>> can't join it also. >>>>> >>>>> >> >> > >> > > >>>>> >>>>> >> >> > >> > > >>>>> >>>>> >> >> > >> > > Best Regards >>>>> >>>>> >> >> > >> > > Peter Huang >>>>> >>>>> >> >> > >> > > >>>>> >>>>> >> >> > >> > > On Wed, Jan 8, 2020 at 9:39 AM tison < >>>>> [hidden email]> wrote: >>>>> >>>>> >> >> > >> > > >>>>> >>>>> >> >> > >> > > > Hi Peter, >>>>> >>>>> >> >> > >> > > > >>>>> >>>>> >> >> > >> > > > Could you try out this link? >>>>> >>>>> >> >> > >> > > https://the-asf.slack.com/messages/CNA3ADZPH >>>>> >>>>> >> >> > >> > > > >>>>> >>>>> >> >> > >> > > > Best, >>>>> >>>>> >> >> > >> > > > tison. >>>>> >>>>> >> >> > >> > > > >>>>> >>>>> >> >> > >> > > > >>>>> >>>>> >> >> > >> > > > Peter Huang <[hidden email]> >>>>> 于2020年1月9日周四 上午1:22写道: >>>>> >>>>> >> >> > >> > > > >>>>> >>>>> >> >> > >> > > > > Hi Tison, >>>>> >>>>> >> >> > >> > > > > >>>>> >>>>> >> >> > >> > > > > I can't join the group with shared link. >>>>> Would you please add me >>>>> >>>>> >> >> > >> into >>>>> >>>>> >> >> > >> > > the >>>>> >>>>> >> >> > >> > > > > group? My slack account is huangzhenqiu0825. >>>>> >>>>> >> >> > >> > > > > Thank you in advance. >>>>> >>>>> >> >> > >> > > > > >>>>> >>>>> >> >> > >> > > > > >>>>> >>>>> >> >> > >> > > > > Best Regards >>>>> >>>>> >> >> > >> > > > > Peter Huang >>>>> >>>>> >> >> > >> > > > > >>>>> >>>>> >> >> > >> > > > > On Wed, Jan 8, 2020 at 12:02 AM tison < >>>>> [hidden email]> >>>>> >>>>> >> >> > >> wrote: >>>>> >>>>> >> >> > >> > > > > >>>>> >>>>> >> >> > >> > > > > > Hi Peter, >>>>> >>>>> >> >> > >> > > > > > >>>>> >>>>> >> >> > >> > > > > > As described above, this effort should get >>>>> attention from people >>>>> >>>>> >> >> > >> > > > > developing >>>>> >>>>> >> >> > >> > > > > > FLIP-73 a.k.a. Executor abstractions. I >>>>> recommend you to join >>>>> >>>>> >> >> > >> the >>>>> >>>>> >> >> > >> > > > public >>>>> >>>>> >> >> > >> > > > > > slack channel[1] for Flink Client API >>>>> Enhancement and you can >>>>> >>>>> >> >> > >> try to >>>>> >>>>> >> >> > >> > > > > share >>>>> >>>>> >> >> > >> > > > > > you detailed thoughts there. It possibly >>>>> gets more concrete >>>>> >>>>> >> >> > >> > > attentions. >>>>> >>>>> >> >> > >> > > > > > >>>>> >>>>> >> >> > >> > > > > > Best, >>>>> >>>>> >> >> > >> > > > > > tison. >>>>> >>>>> >> >> > >> > > > > > >>>>> >>>>> >> >> > >> > > > > > [1] >>>>> >>>>> >> >> > >> > > > > > >>>>> >>>>> >> >> > >> > > > > > >>>>> >>>>> >> >> > >> > > > > >>>>> >>>>> >> >> > >> > > > >>>>> >>>>> >> >> > >> > > >>>>> >>>>> >> >> > >> >>>>> https://slack.com/share/IS21SJ75H/Rk8HhUly9FuEHb7oGwBZ33uL/enQtODg2MDYwNjE5MTg3LTA2MjIzNDc1M2ZjZDVlMjdlZjk1M2RkYmJhNjAwMTk2ZDZkODQ4NmY5YmI4OGRhNWJkYTViMTM1NzlmMzc4OWM >>>>> >>>>> >> >> > >> > > > > > >>>>> >>>>> >> >> > >> > > > > > >>>>> >>>>> >> >> > >> > > > > > Peter Huang <[hidden email]> >>>>> 于2020年1月7日周二 上午5:09写道: >>>>> >>>>> >> >> > >> > > > > > >>>>> >>>>> >> >> > >> > > > > > > Dear All, >>>>> >>>>> >> >> > >> > > > > > > >>>>> >>>>> >> >> > >> > > > > > > Happy new year! According to existing >>>>> feedback from the >>>>> >>>>> >> >> > >> community, >>>>> >>>>> >> >> > >> > > we >>>>> >>>>> >> >> > >> > > > > > > revised the doc with the consideration of >>>>> session cluster >>>>> >>>>> >> >> > >> support, >>>>> >>>>> >> >> > >> > > > and >>>>> >>>>> >> >> > >> > > > > > > concrete interface changes needed and >>>>> execution plan. Please >>>>> >>>>> >> >> > >> take >>>>> >>>>> >> >> > >> > > one >>>>> >>>>> >> >> > >> > > > > > more >>>>> >>>>> >> >> > >> > > > > > > round of review at your most convenient >>>>> time. >>>>> >>>>> >> >> > >> > > > > > > >>>>> >>>>> >> >> > >> > > > > > > >>>>> >>>>> >> >> > >> > > > > > > >>>>> >>>>> >> >> > >> > > > > > >>>>> >>>>> >> >> > >> > > > > >>>>> >>>>> >> >> > >> > > > >>>>> >>>>> >> >> > >> > > >>>>> >>>>> >> >> > >> >>>>> https://docs.google.com/document/d/1aAwVjdZByA-0CHbgv16Me-vjaaDMCfhX7TzVVTuifYM/edit# >>>>> >>>>> >> >> > >> > > > > > > >>>>> >>>>> >> >> > >> > > > > > > >>>>> >>>>> >> >> > >> > > > > > > Best Regards >>>>> >>>>> >> >> > >> > > > > > > Peter Huang >>>>> >>>>> >> >> > >> > > > > > > >>>>> >>>>> >> >> > >> > > > > > > >>>>> >>>>> >> >> > >> > > > > > > >>>>> >>>>> >> >> > >> > > > > > > >>>>> >>>>> >> >> > >> > > > > > > >>>>> >>>>> >> >> > >> > > > > > > On Thu, Jan 2, 2020 at 11:29 AM Peter >>>>> Huang < >>>>> >>>>> >> >> > >> > > > > [hidden email]> >>>>> >>>>> >> >> > >> > > > > > > wrote: >>>>> >>>>> >> >> > >> > > > > > > >>>>> >>>>> >> >> > >> > > > > > > > Hi Dian, >>>>> >>>>> >> >> > >> > > > > > > > Thanks for giving us valuable feedbacks. >>>>> >>>>> >> >> > >> > > > > > > > >>>>> >>>>> >> >> > >> > > > > > > > 1) It's better to have a whole design >>>>> for this feature >>>>> >>>>> >> >> > >> > > > > > > > For the suggestion of enabling the >>>>> cluster mode also session >>>>> >>>>> >> >> > >> > > > > cluster, I >>>>> >>>>> >> >> > >> > > > > > > > think Flink already supported it. >>>>> WebSubmissionExtension >>>>> >>>>> >> >> > >> already >>>>> >>>>> >> >> > >> > > > > allows >>>>> >>>>> >> >> > >> > > > > > > > users to start a job with the specified >>>>> jar by using web UI. >>>>> >>>>> >> >> > >> > > > > > > > But we need to enable the feature from >>>>> CLI for both local >>>>> >>>>> >> >> > >> jar, >>>>> >>>>> >> >> > >> > > > remote >>>>> >>>>> >> >> > >> > > > > > > jar. >>>>> >>>>> >> >> > >> > > > > > > > I will align with Yang Wang first about >>>>> the details and >>>>> >>>>> >> >> > >> update >>>>> >>>>> >> >> > >> > > the >>>>> >>>>> >> >> > >> > > > > > design >>>>> >>>>> >> >> > >> > > > > > > > doc. >>>>> >>>>> >> >> > >> > > > > > > > >>>>> >>>>> >> >> > >> > > > > > > > 2) It's better to consider the >>>>> convenience for users, such >>>>> >>>>> >> >> > >> as >>>>> >>>>> >> >> > >> > > > > debugging >>>>> >>>>> >> >> > >> > > > > > > > >>>>> >>>>> >> >> > >> > > > > > > > I am wondering whether we can store the >>>>> exception in >>>>> >>>>> >> >> > >> jobgragh >>>>> >>>>> >> >> > >> > > > > > > > generation in application master. As no >>>>> streaming graph can >>>>> >>>>> >> >> > >> be >>>>> >>>>> >> >> > >> > > > > > scheduled >>>>> >>>>> >> >> > >> > > > > > > in >>>>> >>>>> >> >> > >> > > > > > > > this case, there will be no more TM >>>>> will be requested from >>>>> >>>>> >> >> > >> > > FlinkRM. >>>>> >>>>> >> >> > >> > > > > > > > If the AM is still running, users can >>>>> still query it from >>>>> >>>>> >> >> > >> CLI. As >>>>> >>>>> >> >> > >> > > > it >>>>> >>>>> >> >> > >> > > > > > > > requires more change, we can get some >>>>> feedback from < >>>>> >>>>> >> >> > >> > > > > > [hidden email] >>>>> >>>>> >> >> > >> > > > > > > > >>>>> >>>>> >> >> > >> > > > > > > > and @[hidden email] <[hidden email] >>>>> >. >>>>> >>>>> >> >> > >> > > > > > > > >>>>> >>>>> >> >> > >> > > > > > > > 3) It's better to consider the impact >>>>> to the stability of >>>>> >>>>> >> >> > >> the >>>>> >>>>> >> >> > >> > > > cluster >>>>> >>>>> >> >> > >> > > > > > > > >>>>> >>>>> >> >> > >> > > > > > > > I agree with Yang Wang's opinion. >>>>> >>>>> >> >> > >> > > > > > > > >>>>> >>>>> >> >> > >> > > > > > > > >>>>> >>>>> >> >> > >> > > > > > > > >>>>> >>>>> >> >> > >> > > > > > > > Best Regards >>>>> >>>>> >> >> > >> > > > > > > > Peter Huang >>>>> >>>>> >> >> > >> > > > > > > > >>>>> >>>>> >> >> > >> > > > > > > > >>>>> >>>>> >> >> > >> > > > > > > > On Sun, Dec 29, 2019 at 9:44 PM Dian Fu >>>>> < >>>>> >>>>> >> >> > >> [hidden email]> >>>>> >>>>> >> >> > >> > > > > wrote: >>>>> >>>>> >> >> > >> > > > > > > > >>>>> >>>>> >> >> > >> > > > > > > >> Hi all, >>>>> >>>>> >> >> > >> > > > > > > >> >>>>> >>>>> >> >> > >> > > > > > > >> Sorry to jump into this discussion. >>>>> Thanks everyone for the >>>>> >>>>> >> >> > >> > > > > > discussion. >>>>> >>>>> >> >> > >> > > > > > > >> I'm very interested in this topic >>>>> although I'm not an >>>>> >>>>> >> >> > >> expert in >>>>> >>>>> >> >> > >> > > > this >>>>> >>>>> >> >> > >> > > > > > > part. >>>>> >>>>> >> >> > >> > > > > > > >> So I'm glad to share my thoughts as >>>>> following: >>>>> >>>>> >> >> > >> > > > > > > >> >>>>> >>>>> >> >> > >> > > > > > > >> 1) It's better to have a whole design >>>>> for this feature >>>>> >>>>> >> >> > >> > > > > > > >> As we know, there are two deployment >>>>> modes: per-job mode >>>>> >>>>> >> >> > >> and >>>>> >>>>> >> >> > >> > > > session >>>>> >>>>> >> >> > >> > > > > > > >> mode. I'm wondering which mode really >>>>> needs this feature. >>>>> >>>>> >> >> > >> As the >>>>> >>>>> >> >> > >> > > > > > design >>>>> >>>>> >> >> > >> > > > > > > doc >>>>> >>>>> >> >> > >> > > > > > > >> mentioned, per-job mode is more used >>>>> for streaming jobs and >>>>> >>>>> >> >> > >> > > > session >>>>> >>>>> >> >> > >> > > > > > > mode is >>>>> >>>>> >> >> > >> > > > > > > >> usually used for batch jobs(Of course, >>>>> the job types and >>>>> >>>>> >> >> > >> the >>>>> >>>>> >> >> > >> > > > > > deployment >>>>> >>>>> >> >> > >> > > > > > > >> modes are orthogonal). Usually >>>>> streaming job is only >>>>> >>>>> >> >> > >> needed to >>>>> >>>>> >> >> > >> > > be >>>>> >>>>> >> >> > >> > > > > > > submitted >>>>> >>>>> >> >> > >> > > > > > > >> once and it will run for days or >>>>> weeks, while batch jobs >>>>> >>>>> >> >> > >> will be >>>>> >>>>> >> >> > >> > > > > > > submitted >>>>> >>>>> >> >> > >> > > > > > > >> more frequently compared with >>>>> streaming jobs. This means >>>>> >>>>> >> >> > >> that >>>>> >>>>> >> >> > >> > > > maybe >>>>> >>>>> >> >> > >> > > > > > > session >>>>> >>>>> >> >> > >> > > > > > > >> mode also needs this feature. However, >>>>> if we support this >>>>> >>>>> >> >> > >> > > feature >>>>> >>>>> >> >> > >> > > > in >>>>> >>>>> >> >> > >> > > > > > > >> session mode, the application master >>>>> will become the new >>>>> >>>>> >> >> > >> > > > centralized >>>>> >>>>> >> >> > >> > > > > > > >> service(which should be solved). So in >>>>> this case, it's >>>>> >>>>> >> >> > >> better to >>>>> >>>>> >> >> > >> > > > > have >>>>> >>>>> >> >> > >> > > > > > a >>>>> >>>>> >> >> > >> > > > > > > >> complete design for both per-job mode >>>>> and session mode. >>>>> >>>>> >> >> > >> > > > Furthermore, >>>>> >>>>> >> >> > >> > > > > > > even >>>>> >>>>> >> >> > >> > > > > > > >> if we can do it phase by phase, we >>>>> need to have a whole >>>>> >>>>> >> >> > >> picture >>>>> >>>>> >> >> > >> > > of >>>>> >>>>> >> >> > >> > > > > how >>>>> >>>>> >> >> > >> > > > > > > it >>>>> >>>>> >> >> > >> > > > > > > >> works in both per-job mode and session >>>>> mode. >>>>> >>>>> >> >> > >> > > > > > > >> >>>>> >>>>> >> >> > >> > > > > > > >> 2) It's better to consider the >>>>> convenience for users, such >>>>> >>>>> >> >> > >> as >>>>> >>>>> >> >> > >> > > > > > debugging >>>>> >>>>> >> >> > >> > > > > > > >> After we finish this feature, the job >>>>> graph will be >>>>> >>>>> >> >> > >> compiled in >>>>> >>>>> >> >> > >> > > > the >>>>> >>>>> >> >> > >> > > > > > > >> application master, which means that >>>>> users cannot easily >>>>> >>>>> >> >> > >> get the >>>>> >>>>> >> >> > >> > > > > > > exception >>>>> >>>>> >> >> > >> > > > > > > >> message synchorousely in the job >>>>> client if there are >>>>> >>>>> >> >> > >> problems >>>>> >>>>> >> >> > >> > > > during >>>>> >>>>> >> >> > >> > > > > > the >>>>> >>>>> >> >> > >> > > > > > > >> job graph compiling (especially for >>>>> platform users), such >>>>> >>>>> >> >> > >> as the >>>>> >>>>> >> >> > >> > > > > > > resource >>>>> >>>>> >> >> > >> > > > > > > >> path is incorrect, the user program >>>>> itself has some >>>>> >>>>> >> >> > >> problems, >>>>> >>>>> >> >> > >> > > etc. >>>>> >>>>> >> >> > >> > > > > > What >>>>> >>>>> >> >> > >> > > > > > > I'm >>>>> >>>>> >> >> > >> > > > > > > >> thinking is that maybe we should throw >>>>> the exceptions as >>>>> >>>>> >> >> > >> early >>>>> >>>>> >> >> > >> > > as >>>>> >>>>> >> >> > >> > > > > > > possible >>>>> >>>>> >> >> > >> > > > > > > >> (during job submission stage). >>>>> >>>>> >> >> > >> > > > > > > >> >>>>> >>>>> >> >> > >> > > > > > > >> 3) It's better to consider the impact >>>>> to the stability of >>>>> >>>>> >> >> > >> the >>>>> >>>>> >> >> > >> > > > > cluster >>>>> >>>>> >> >> > >> > > > > > > >> If we perform the compiling in the >>>>> application master, we >>>>> >>>>> >> >> > >> should >>>>> >>>>> >> >> > >> > > > > > > consider >>>>> >>>>> >> >> > >> > > > > > > >> the impact of the compiling errors. >>>>> Although YARN could >>>>> >>>>> >> >> > >> resume >>>>> >>>>> >> >> > >> > > the >>>>> >>>>> >> >> > >> > > > > > > >> application master in case of >>>>> failures, but in some case >>>>> >>>>> >> >> > >> the >>>>> >>>>> >> >> > >> > > > > compiling >>>>> >>>>> >> >> > >> > > > > > > >> failure may be a waste of cluster >>>>> resource and may impact >>>>> >>>>> >> >> > >> the >>>>> >>>>> >> >> > >> > > > > > stability >>>>> >>>>> >> >> > >> > > > > > > the >>>>> >>>>> >> >> > >> > > > > > > >> cluster and the other jobs in the >>>>> cluster, such as the >>>>> >>>>> >> >> > >> resource >>>>> >>>>> >> >> > >> > > > path >>>>> >>>>> >> >> > >> > > > > > is >>>>> >>>>> >> >> > >> > > > > > > >> incorrect, the user program itself has >>>>> some problems(in >>>>> >>>>> >> >> > >> this >>>>> >>>>> >> >> > >> > > case, >>>>> >>>>> >> >> > >> > > > > job >>>>> >>>>> >> >> > >> > > > > > > >> failover cannot solve this kind of >>>>> problems) etc. In the >>>>> >>>>> >> >> > >> current >>>>> >>>>> >> >> > >> > > > > > > >> implemention, the compiling errors are >>>>> handled in the >>>>> >>>>> >> >> > >> client >>>>> >>>>> >> >> > >> > > side >>>>> >>>>> >> >> > >> > > > > and >>>>> >>>>> >> >> > >> > > > > > > there >>>>> >>>>> >> >> > >> > > > > > > >> is no impact to the cluster at all. >>>>> >>>>> >> >> > >> > > > > > > >> >>>>> >>>>> >> >> > >> > > > > > > >> Regarding to 1), it's clearly pointed >>>>> in the design doc >>>>> >>>>> >> >> > >> that >>>>> >>>>> >> >> > >> > > only >>>>> >>>>> >> >> > >> > > > > > > per-job >>>>> >>>>> >> >> > >> > > > > > > >> mode will be supported. However, I >>>>> think it's better to >>>>> >>>>> >> >> > >> also >>>>> >>>>> >> >> > >> > > > > consider >>>>> >>>>> >> >> > >> > > > > > > the >>>>> >>>>> >> >> > >> > > > > > > >> session mode in the design doc. >>>>> >>>>> >> >> > >> > > > > > > >> Regarding to 2) and 3), I have not >>>>> seen related sections >>>>> >>>>> >> >> > >> in the >>>>> >>>>> >> >> > >> > > > > design >>>>> >>>>> >> >> > >> > > > > > > >> doc. It will be good if we can cover >>>>> them in the design >>>>> >>>>> >> >> > >> doc. >>>>> >>>>> >> >> > >> > > > > > > >> >>>>> >>>>> >> >> > >> > > > > > > >> Feel free to correct me If there is >>>>> anything I >>>>> >>>>> >> >> > >> misunderstand. >>>>> >>>>> >> >> > >> > > > > > > >> >>>>> >>>>> >> >> > >> > > > > > > >> Regards, >>>>> >>>>> >> >> > >> > > > > > > >> Dian >>>>> >>>>> >> >> > >> > > > > > > >> >>>>> >>>>> >> >> > >> > > > > > > >> >>>>> >>>>> >> >> > >> > > > > > > >> > 在 2019年12月27日,上午3:13,Peter Huang < >>>>> >>>>> >> >> > >> [hidden email]> >>>>> >>>>> >> >> > >> > > > 写道: >>>>> >>>>> >> >> > >> > > > > > > >> > >>>>> >>>>> >> >> > >> > > > > > > >> > Hi Yang, >>>>> >>>>> >> >> > >> > > > > > > >> > >>>>> >>>>> >> >> > >> > > > > > > >> > I can't agree more. The effort >>>>> definitely needs to align >>>>> >>>>> >> >> > >> with >>>>> >>>>> >> >> > >> > > > the >>>>> >>>>> >> >> > >> > > > > > > final >>>>> >>>>> >> >> > >> > > > > > > >> > goal of FLIP-73. >>>>> >>>>> >> >> > >> > > > > > > >> > I am thinking about whether we can >>>>> achieve the goal with >>>>> >>>>> >> >> > >> two >>>>> >>>>> >> >> > >> > > > > phases. >>>>> >>>>> >> >> > >> > > > > > > >> > >>>>> >>>>> >> >> > >> > > > > > > >> > 1) Phase I >>>>> >>>>> >> >> > >> > > > > > > >> > As the CLiFrontend will not be >>>>> depreciated soon. We can >>>>> >>>>> >> >> > >> still >>>>> >>>>> >> >> > >> > > > use >>>>> >>>>> >> >> > >> > > > > > the >>>>> >>>>> >> >> > >> > > > > > > >> > deployMode flag there, >>>>> >>>>> >> >> > >> > > > > > > >> > pass the program info through Flink >>>>> configuration, use >>>>> >>>>> >> >> > >> the >>>>> >>>>> >> >> > >> > > > > > > >> > ClassPathJobGraphRetriever >>>>> >>>>> >> >> > >> > > > > > > >> > to generate the job graph in >>>>> ClusterEntrypoints of yarn >>>>> >>>>> >> >> > >> and >>>>> >>>>> >> >> > >> > > > > > > Kubernetes. >>>>> >>>>> >> >> > >> > > > > > > >> > >>>>> >>>>> >> >> > >> > > > > > > >> > 2) Phase II >>>>> >>>>> >> >> > >> > > > > > > >> > In AbstractJobClusterExecutor, the >>>>> job graph is >>>>> >>>>> >> >> > >> generated in >>>>> >>>>> >> >> > >> > > > the >>>>> >>>>> >> >> > >> > > > > > > >> execute >>>>> >>>>> >> >> > >> > > > > > > >> > function. We can still >>>>> >>>>> >> >> > >> > > > > > > >> > use the deployMode in it. With >>>>> deployMode = cluster, the >>>>> >>>>> >> >> > >> > > execute >>>>> >>>>> >> >> > >> > > > > > > >> function >>>>> >>>>> >> >> > >> > > > > > > >> > only starts the cluster. >>>>> >>>>> >> >> > >> > > > > > > >> > >>>>> >>>>> >> >> > >> > > > > > > >> > When >>>>> {Yarn/Kuberneates}PerJobClusterEntrypoint starts, >>>>> >>>>> >> >> > >> It will >>>>> >>>>> >> >> > >> > > > > start >>>>> >>>>> >> >> > >> > > > > > > the >>>>> >>>>> >> >> > >> > > > > > > >> > dispatch first, then we can use >>>>> >>>>> >> >> > >> > > > > > > >> > a ClusterEnvironment similar to >>>>> ContextEnvironment to >>>>> >>>>> >> >> > >> submit >>>>> >>>>> >> >> > >> > > the >>>>> >>>>> >> >> > >> > > > > job >>>>> >>>>> >> >> > >> > > > > > > >> with >>>>> >>>>> >> >> > >> > > > > > > >> > jobName the local >>>>> >>>>> >> >> > >> > > > > > > >> > dispatcher. For the details, we need >>>>> more investigation. >>>>> >>>>> >> >> > >> Let's >>>>> >>>>> >> >> > >> > > > > wait >>>>> >>>>> >> >> > >> > > > > > > >> > for @Aljoscha >>>>> >>>>> >> >> > >> > > > > > > >> > Krettek <[hidden email]> @Till >>>>> Rohrmann < >>>>> >>>>> >> >> > >> > > > > [hidden email] >>>>> >>>>> >> >> > >> > > > > > >'s >>>>> >>>>> >> >> > >> > > > > > > >> > feedback after the holiday season. >>>>> >>>>> >> >> > >> > > > > > > >> > >>>>> >>>>> >> >> > >> > > > > > > >> > Thank you in advance. Merry Chrismas >>>>> and Happy New >>>>> >>>>> >> >> > >> Year!!! >>>>> >>>>> >> >> > >> > > > > > > >> > >>>>> >>>>> >> >> > >> > > > > > > >> > >>>>> >>>>> >> >> > >> > > > > > > >> > Best Regards >>>>> >>>>> >> >> > >> > > > > > > >> > Peter Huang >>>>> >>>>> >> >> > >> > > > > > > >> > >>>>> >>>>> >> >> > >> > > > > > > >> > >>>>> >>>>> >> >> > >> > > > > > > >> > >>>>> >>>>> >> >> > >> > > > > > > >> > >>>>> >>>>> >> >> > >> > > > > > > >> > >>>>> >>>>> >> >> > >> > > > > > > >> > >>>>> >>>>> >> >> > >> > > > > > > >> > >>>>> >>>>> >> >> > >> > > > > > > >> > >>>>> >>>>> >> >> > >> > > > > > > >> > On Wed, Dec 25, 2019 at 1:08 AM Yang >>>>> Wang < >>>>> >>>>> >> >> > >> > > > [hidden email]> >>>>> >>>>> >> >> > >> > > > > > > >> wrote: >>>>> >>>>> >> >> > >> > > > > > > >> > >>>>> >>>>> >> >> > >> > > > > > > >> >> Hi Peter, >>>>> >>>>> >> >> > >> > > > > > > >> >> >>>>> >>>>> >> >> > >> > > > > > > >> >> I think we need to reconsider >>>>> tison's suggestion >>>>> >>>>> >> >> > >> seriously. >>>>> >>>>> >> >> > >> > > > After >>>>> >>>>> >> >> > >> > > > > > > >> FLIP-73, >>>>> >>>>> >> >> > >> > > > > > > >> >> the deployJobCluster has >>>>> >>>>> >> >> > >> > > > > > > >> >> beenmoved into >>>>> `JobClusterExecutor#execute`. It should >>>>> >>>>> >> >> > >> not be >>>>> >>>>> >> >> > >> > > > > > > perceived >>>>> >>>>> >> >> > >> > > > > > > >> >> for `CliFrontend`. That >>>>> >>>>> >> >> > >> > > > > > > >> >> means the user program will >>>>> *ALWAYS* be executed on >>>>> >>>>> >> >> > >> client >>>>> >>>>> >> >> > >> > > > side. >>>>> >>>>> >> >> > >> > > > > > This >>>>> >>>>> >> >> > >> > > > > > > >> is >>>>> >>>>> >> >> > >> > > > > > > >> >> the by design behavior. >>>>> >>>>> >> >> > >> > > > > > > >> >> So, we could not just add >>>>> `if(client mode) .. else >>>>> >>>>> >> >> > >> if(cluster >>>>> >>>>> >> >> > >> > > > > mode) >>>>> >>>>> >> >> > >> > > > > > > >> ...` >>>>> >>>>> >> >> > >> > > > > > > >> >> codes in `CliFrontend` to bypass >>>>> >>>>> >> >> > >> > > > > > > >> >> the executor. We need to find a >>>>> clean way to decouple >>>>> >>>>> >> >> > >> > > executing >>>>> >>>>> >> >> > >> > > > > > user >>>>> >>>>> >> >> > >> > > > > > > >> >> program and deploying per-job >>>>> >>>>> >> >> > >> > > > > > > >> >> cluster. Based on this, we could >>>>> support to execute user >>>>> >>>>> >> >> > >> > > > program >>>>> >>>>> >> >> > >> > > > > on >>>>> >>>>> >> >> > >> > > > > > > >> client >>>>> >>>>> >> >> > >> > > > > > > >> >> or master side. >>>>> >>>>> >> >> > >> > > > > > > >> >> >>>>> >>>>> >> >> > >> > > > > > > >> >> Maybe Aljoscha and Jeff could give >>>>> some good >>>>> >>>>> >> >> > >> suggestions. >>>>> >>>>> >> >> > >> > > > > > > >> >> >>>>> >>>>> >> >> > >> > > > > > > >> >> >>>>> >>>>> >> >> > >> > > > > > > >> >> >>>>> >>>>> >> >> > >> > > > > > > >> >> Best, >>>>> >>>>> >> >> > >> > > > > > > >> >> Yang >>>>> >>>>> >> >> > >> > > > > > > >> >> >>>>> >>>>> >> >> > >> > > > > > > >> >> Peter Huang < >>>>> [hidden email]> 于2019年12月25日周三 >>>>> >>>>> >> >> > >> > > > > 上午4:03写道: >>>>> >>>>> >> >> > >> > > > > > > >> >> >>>>> >>>>> >> >> > >> > > > > > > >> >>> Hi Jingjing, >>>>> >>>>> >> >> > >> > > > > > > >> >>> >>>>> >>>>> >> >> > >> > > > > > > >> >>> The improvement proposed is a >>>>> deployment option for >>>>> >>>>> >> >> > >> CLI. For >>>>> >>>>> >> >> > >> > > > SQL >>>>> >>>>> >> >> > >> > > > > > > based >>>>> >>>>> >> >> > >> > > > > > > >> >>> Flink application, It is more >>>>> convenient to use the >>>>> >>>>> >> >> > >> existing >>>>> >>>>> >> >> > >> > > > > model >>>>> >>>>> >> >> > >> > > > > > > in >>>>> >>>>> >> >> > >> > > > > > > >> >>> SqlClient in which >>>>> >>>>> >> >> > >> > > > > > > >> >>> the job graph is generated within >>>>> SqlClient. After >>>>> >>>>> >> >> > >> adding >>>>> >>>>> >> >> > >> > > the >>>>> >>>>> >> >> > >> > > > > > > delayed >>>>> >>>>> >> >> > >> > > > > > > >> job >>>>> >>>>> >> >> > >> > > > > > > >> >>> graph generation, I think there is >>>>> no change is needed >>>>> >>>>> >> >> > >> for >>>>> >>>>> >> >> > >> > > > your >>>>> >>>>> >> >> > >> > > > > > > side. >>>>> >>>>> >> >> > >> > > > > > > >> >>> >>>>> >>>>> >> >> > >> > > > > > > >> >>> >>>>> >>>>> >> >> > >> > > > > > > >> >>> Best Regards >>>>> >>>>> >> >> > >> > > > > > > >> >>> Peter Huang >>>>> >>>>> >> >> > >> > > > > > > >> >>> >>>>> >>>>> >> >> > >> > > > > > > >> >>> >>>>> >>>>> >> >> > >> > > > > > > >> >>> On Wed, Dec 18, 2019 at 6:01 AM >>>>> jingjing bai < >>>>> >>>>> >> >> > >> > > > > > > >> [hidden email]> >>>>> >>>>> >> >> > >> > > > > > > >> >>> wrote: >>>>> >>>>> >> >> > >> > > > > > > >> >>> >>>>> >>>>> >> >> > >> > > > > > > >> >>>> hi peter: >>>>> >>>>> >> >> > >> > > > > > > >> >>>> we had extension SqlClent to >>>>> support sql job >>>>> >>>>> >> >> > >> submit in >>>>> >>>>> >> >> > >> > > web >>>>> >>>>> >> >> > >> > > > > > base >>>>> >>>>> >> >> > >> > > > > > > on >>>>> >>>>> >> >> > >> > > > > > > >> >>>> flink 1.9. we support submit to >>>>> yarn on per job >>>>> >>>>> >> >> > >> mode too. >>>>> >>>>> >> >> > >> > > > > > > >> >>>> in this case, the job graph >>>>> generated on client >>>>> >>>>> >> >> > >> side >>>>> >>>>> >> >> > >> > > . I >>>>> >>>>> >> >> > >> > > > > > think >>>>> >>>>> >> >> > >> > > > > > > >> >>> this >>>>> >>>>> >> >> > >> > > > > > > >> >>>> discuss Mainly to improve api >>>>> programme. but in my >>>>> >>>>> >> >> > >> case , >>>>> >>>>> >> >> > >> > > > > there >>>>> >>>>> >> >> > >> > > > > > is >>>>> >>>>> >> >> > >> > > > > > > >> no >>>>> >>>>> >> >> > >> > > > > > > >> >>>> jar to upload but only a sql >>>>> string . >>>>> >>>>> >> >> > >> > > > > > > >> >>>> do u had more suggestion to >>>>> improve for sql mode >>>>> >>>>> >> >> > >> or it >>>>> >>>>> >> >> > >> > > is >>>>> >>>>> >> >> > >> > > > > > only a >>>>> >>>>> >> >> > >> > > > > > > >> >>>> switch for api programme? >>>>> >>>>> >> >> > >> > > > > > > >> >>>> >>>>> >>>>> >> >> > >> > > > > > > >> >>>> >>>>> >>>>> >> >> > >> > > > > > > >> >>>> best >>>>> >>>>> >> >> > >> > > > > > > >> >>>> bai jj >>>>> >>>>> >> >> > >> > > > > > > >> >>>> >>>>> >>>>> >> >> > >> > > > > > > >> >>>> >>>>> >>>>> >> >> > >> > > > > > > >> >>>> Yang Wang <[hidden email]> >>>>> 于2019年12月18日周三 >>>>> >>>>> >> >> > >> 下午7:21写道: >>>>> >>>>> >> >> > >> > > > > > > >> >>>> >>>>> >>>>> >> >> > >> > > > > > > >> >>>>> I just want to revive this >>>>> discussion. >>>>> >>>>> >> >> > >> > > > > > > >> >>>>> >>>>> >>>>> >> >> > >> > > > > > > >> >>>>> Recently, i am thinking about >>>>> how to natively run >>>>> >>>>> >> >> > >> flink >>>>> >>>>> >> >> > >> > > > > per-job >>>>> >>>>> >> >> > >> > > > > > > >> >>> cluster on >>>>> >>>>> >> >> > >> > > > > > > >> >>>>> Kubernetes. >>>>> >>>>> >> >> > >> > > > > > > >> >>>>> The per-job mode on Kubernetes >>>>> is very different >>>>> >>>>> >> >> > >> from on >>>>> >>>>> >> >> > >> > > > Yarn. >>>>> >>>>> >> >> > >> > > > > > And >>>>> >>>>> >> >> > >> > > > > > > >> we >>>>> >>>>> >> >> > >> > > > > > > >> >>> will >>>>> >>>>> >> >> > >> > > > > > > >> >>>>> have >>>>> >>>>> >> >> > >> > > > > > > >> >>>>> the same deployment requirements >>>>> to the client and >>>>> >>>>> >> >> > >> entry >>>>> >>>>> >> >> > >> > > > > point. >>>>> >>>>> >> >> > >> > > > > > > >> >>>>> >>>>> >>>>> >> >> > >> > > > > > > >> >>>>> 1. Flink client not always need >>>>> a local jar to start >>>>> >>>>> >> >> > >> a >>>>> >>>>> >> >> > >> > > Flink >>>>> >>>>> >> >> > >> > > > > > > per-job >>>>> >>>>> >> >> > >> > > > > > > >> >>>>> cluster. We could >>>>> >>>>> >> >> > >> > > > > > > >> >>>>> support multiple schemas. For >>>>> example, >>>>> >>>>> >> >> > >> > > > file:///path/of/my.jar >>>>> >>>>> >> >> > >> > > > > > > means >>>>> >>>>> >> >> > >> > > > > > > >> a >>>>> >>>>> >> >> > >> > > > > > > >> >>> jar >>>>> >>>>> >> >> > >> > > > > > > >> >>>>> located >>>>> >>>>> >> >> > >> > > > > > > >> >>>>> at client side, >>>>> >>>>> >> >> > >> hdfs://myhdfs/user/myname/flink/my.jar >>>>> >>>>> >> >> > >> > > > means a >>>>> >>>>> >> >> > >> > > > > > jar >>>>> >>>>> >> >> > >> > > > > > > >> >>> located >>>>> >>>>> >> >> > >> > > > > > > >> >>>>> at >>>>> >>>>> >> >> > >> > > > > > > >> >>>>> remote hdfs, >>>>> local:///path/in/image/my.jar means a >>>>> >>>>> >> >> > >> jar >>>>> >>>>> >> >> > >> > > > located >>>>> >>>>> >> >> > >> > > > > > at >>>>> >>>>> >> >> > >> > > > > > > >> >>>>> jobmanager side. >>>>> >>>>> >> >> > >> > > > > > > >> >>>>> >>>>> >>>>> >> >> > >> > > > > > > >> >>>>> 2. Support running user program >>>>> on master side. This >>>>> >>>>> >> >> > >> also >>>>> >>>>> >> >> > >> > > > > means >>>>> >>>>> >> >> > >> > > > > > > the >>>>> >>>>> >> >> > >> > > > > > > >> >>> entry >>>>> >>>>> >> >> > >> > > > > > > >> >>>>> point >>>>> >>>>> >> >> > >> > > > > > > >> >>>>> will generate the job graph on >>>>> master side. We could >>>>> >>>>> >> >> > >> use >>>>> >>>>> >> >> > >> > > the >>>>> >>>>> >> >> > >> > > > > > > >> >>>>> ClasspathJobGraphRetriever >>>>> >>>>> >> >> > >> > > > > > > >> >>>>> or start a local Flink client to >>>>> achieve this >>>>> >>>>> >> >> > >> purpose. >>>>> >>>>> >> >> > >> > > > > > > >> >>>>> >>>>> >>>>> >> >> > >> > > > > > > >> >>>>> >>>>> >>>>> >> >> > >> > > > > > > >> >>>>> cc tison, Aljoscha & Kostas Do >>>>> you think this is the >>>>> >>>>> >> >> > >> right >>>>> >>>>> >> >> > >> > > > > > > >> direction we >>>>> >>>>> >> >> > >> > > > > > > >> >>>>> need to work? >>>>> >>>>> >> >> > >> > > > > > > >> >>>>> >>>>> >>>>> >> >> > >> > > > > > > >> >>>>> tison <[hidden email]> >>>>> 于2019年12月12日周四 >>>>> >>>>> >> >> > >> 下午4:48写道: >>>>> >>>>> >> >> > >> > > > > > > >> >>>>> >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>> A quick idea is that we >>>>> separate the deployment >>>>> >>>>> >> >> > >> from user >>>>> >>>>> >> >> > >> > > > > > program >>>>> >>>>> >> >> > >> > > > > > > >> >>> that >>>>> >>>>> >> >> > >> > > > > > > >> >>>>> it >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>> has always been done >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>> outside the program. On user >>>>> program executed there >>>>> >>>>> >> >> > >> is >>>>> >>>>> >> >> > >> > > > > always a >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>> ClusterClient that communicates >>>>> with >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>> an existing cluster, remote or >>>>> local. It will be >>>>> >>>>> >> >> > >> another >>>>> >>>>> >> >> > >> > > > > thread >>>>> >>>>> >> >> > >> > > > > > > so >>>>> >>>>> >> >> > >> > > > > > > >> >>> just >>>>> >>>>> >> >> > >> > > > > > > >> >>>>> for >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>> your information. >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>> >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>> Best, >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>> tison. >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>> >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>> >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>> tison <[hidden email]> >>>>> 于2019年12月12日周四 >>>>> >>>>> >> >> > >> 下午4:40写道: >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>> >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>> Hi Peter, >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>> >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>> Another concern I realized >>>>> recently is that with >>>>> >>>>> >> >> > >> current >>>>> >>>>> >> >> > >> > > > > > > Executors >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>> abstraction(FLIP-73) >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>> I'm afraid that user program >>>>> is designed to ALWAYS >>>>> >>>>> >> >> > >> run >>>>> >>>>> >> >> > >> > > on >>>>> >>>>> >> >> > >> > > > > the >>>>> >>>>> >> >> > >> > > > > > > >> >>> client >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>> side. >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>> Specifically, >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>> we deploy the job in executor >>>>> when env.execute >>>>> >>>>> >> >> > >> called. >>>>> >>>>> >> >> > >> > > > This >>>>> >>>>> >> >> > >> > > > > > > >> >>>>> abstraction >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>> possibly prevents >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>> Flink runs user program on the >>>>> cluster side. >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>> >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>> For your proposal, in this >>>>> case we already >>>>> >>>>> >> >> > >> compiled the >>>>> >>>>> >> >> > >> > > > > > program >>>>> >>>>> >> >> > >> > > > > > > >> and >>>>> >>>>> >> >> > >> > > > > > > >> >>>>> run >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>> on >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>> the client side, >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>> even we deploy a cluster and >>>>> retrieve job graph >>>>> >>>>> >> >> > >> from >>>>> >>>>> >> >> > >> > > > program >>>>> >>>>> >> >> > >> > > > > > > >> >>>>> metadata, it >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>> doesn't make >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>> many sense. >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>> >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>> cc Aljoscha & Kostas what do >>>>> you think about this >>>>> >>>>> >> >> > >> > > > > constraint? >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>> >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>> Best, >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>> tison. >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>> >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>> >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>> Peter Huang < >>>>> [hidden email]> >>>>> >>>>> >> >> > >> 于2019年12月10日周二 >>>>> >>>>> >> >> > >> > > > > > > >> 下午12:45写道: >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>> >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> Hi Tison, >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> Yes, you are right. I think I >>>>> made the wrong >>>>> >>>>> >> >> > >> argument >>>>> >>>>> >> >> > >> > > in >>>>> >>>>> >> >> > >> > > > > the >>>>> >>>>> >> >> > >> > > > > > > doc. >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> Basically, the packaging jar >>>>> problem is only for >>>>> >>>>> >> >> > >> > > platform >>>>> >>>>> >> >> > >> > > > > > > users. >>>>> >>>>> >> >> > >> > > > > > > >> >>> In >>>>> >>>>> >> >> > >> > > > > > > >> >>>>> our >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> internal deploy service, >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> we further optimized the >>>>> deployment latency by >>>>> >>>>> >> >> > >> letting >>>>> >>>>> >> >> > >> > > > > users >>>>> >>>>> >> >> > >> > > > > > to >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>> packaging >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> flink-runtime together with >>>>> the uber jar, so that >>>>> >>>>> >> >> > >> we >>>>> >>>>> >> >> > >> > > > don't >>>>> >>>>> >> >> > >> > > > > > need >>>>> >>>>> >> >> > >> > > > > > > >> to >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> consider >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> multiple flink version >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> support for now. In the >>>>> session client mode, as >>>>> >>>>> >> >> > >> Flink >>>>> >>>>> >> >> > >> > > > libs >>>>> >>>>> >> >> > >> > > > > > will >>>>> >>>>> >> >> > >> > > > > > > >> be >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>> shipped >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> anyway as local resources of >>>>> yarn. Users actually >>>>> >>>>> >> >> > >> don't >>>>> >>>>> >> >> > >> > > > > need >>>>> >>>>> >> >> > >> > > > > > to >>>>> >>>>> >> >> > >> > > > > > > >> >>>>> package >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> those libs into job jar. >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> Best Regards >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> Peter Huang >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> On Mon, Dec 9, 2019 at 8:35 >>>>> PM tison < >>>>> >>>>> >> >> > >> > > > [hidden email] >>>>> >>>>> >> >> > >> > > > > > >>>>> >>>>> >> >> > >> > > > > > > >> >>> wrote: >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> 3. What do you mean about >>>>> the package? Do users >>>>> >>>>> >> >> > >> need >>>>> >>>>> >> >> > >> > > to >>>>> >>>>> >> >> > >> > > > > > > >> >>> compile >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>> their >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>> jars >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>> inlcuding flink-clients, >>>>> flink-optimizer, >>>>> >>>>> >> >> > >> flink-table >>>>> >>>>> >> >> > >> > > > > codes? >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>> >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>> The answer should be no >>>>> because they exist in >>>>> >>>>> >> >> > >> system >>>>> >>>>> >> >> > >> > > > > > > classpath. >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>> >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>> Best, >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>> tison. >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>> >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>> >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>> Yang Wang < >>>>> [hidden email]> 于2019年12月10日周二 >>>>> >>>>> >> >> > >> > > > > 下午12:18写道: >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>> >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> Hi Peter, >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> Thanks a lot for starting >>>>> this discussion. I >>>>> >>>>> >> >> > >> think >>>>> >>>>> >> >> > >> > > this >>>>> >>>>> >> >> > >> > > > > is >>>>> >>>>> >> >> > >> > > > > > a >>>>> >>>>> >> >> > >> > > > > > > >> >>> very >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> useful >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> feature. >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> Not only for Yarn, i am >>>>> focused on flink on >>>>> >>>>> >> >> > >> > > Kubernetes >>>>> >>>>> >> >> > >> > > > > > > >> >>>>> integration >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>> and >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>> come >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> across the same >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> problem. I do not want the >>>>> job graph generated >>>>> >>>>> >> >> > >> on >>>>> >>>>> >> >> > >> > > > client >>>>> >>>>> >> >> > >> > > > > > > side. >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> Instead, >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>> the >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> user jars are built in >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> a user-defined image. When >>>>> the job manager >>>>> >>>>> >> >> > >> launched, >>>>> >>>>> >> >> > >> > > we >>>>> >>>>> >> >> > >> > > > > > just >>>>> >>>>> >> >> > >> > > > > > > >> >>>>> need to >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> generate the job graph >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> based on local user jars. >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> I have some small >>>>> suggestion about this. >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> 1. >>>>> `ProgramJobGraphRetriever` is very similar to >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> >>>>> `ClasspathJobGraphRetriever`, the differences >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> are the former needs >>>>> `ProgramMetadata` and the >>>>> >>>>> >> >> > >> latter >>>>> >>>>> >> >> > >> > > > > needs >>>>> >>>>> >> >> > >> > > > > > > >> >>> some >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>> arguments. >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> Is it possible to >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> have an unified >>>>> `JobGraphRetriever` to support >>>>> >>>>> >> >> > >> both? >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> 2. Is it possible to not >>>>> use a local user jar to >>>>> >>>>> >> >> > >> > > start >>>>> >>>>> >> >> > >> > > > a >>>>> >>>>> >> >> > >> > > > > > > >> >>> per-job >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> cluster? >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> In your case, the user jars >>>>> has >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> existed on hdfs already and >>>>> we do need to >>>>> >>>>> >> >> > >> download >>>>> >>>>> >> >> > >> > > the >>>>> >>>>> >> >> > >> > > > > jars >>>>> >>>>> >> >> > >> > > > > > > to >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> deployer >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> service. Currently, we >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> always need a local user >>>>> jar to start a flink >>>>> >>>>> >> >> > >> > > cluster. >>>>> >>>>> >> >> > >> > > > It >>>>> >>>>> >> >> > >> > > > > > is >>>>> >>>>> >> >> > >> > > > > > > >> >>> be >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>> great >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> if >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>> we >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> could support remote user >>>>> jars. >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>>> In the implementation, we >>>>> assume users package >>>>> >>>>> >> >> > >> > > > > > > >> >>> flink-clients, >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> flink-optimizer, >>>>> flink-table together within >>>>> >>>>> >> >> > >> the job >>>>> >>>>> >> >> > >> > > > jar. >>>>> >>>>> >> >> > >> > > > > > > >> >>>>> Otherwise, >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> the >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> job graph generation within >>>>> >>>>> >> >> > >> JobClusterEntryPoint will >>>>> >>>>> >> >> > >> > > > > fail. >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> 3. What do you mean about >>>>> the package? Do users >>>>> >>>>> >> >> > >> need >>>>> >>>>> >> >> > >> > > to >>>>> >>>>> >> >> > >> > > > > > > >> >>> compile >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>> their >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>> jars >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> inlcuding flink-clients, >>>>> flink-optimizer, >>>>> >>>>> >> >> > >> flink-table >>>>> >>>>> >> >> > >> > > > > > codes? >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> Best, >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> Yang >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> Peter Huang < >>>>> [hidden email]> >>>>> >>>>> >> >> > >> > > > 于2019年12月10日周二 >>>>> >>>>> >> >> > >> > > > > > > >> >>>>> 上午2:37写道: >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> Dear All, >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> Recently, the Flink >>>>> community starts to >>>>> >>>>> >> >> > >> improve the >>>>> >>>>> >> >> > >> > > > yarn >>>>> >>>>> >> >> > >> > > > > > > >> >>>>> cluster >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> descriptor >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> to make job jar and config >>>>> files configurable >>>>> >>>>> >> >> > >> from >>>>> >>>>> >> >> > >> > > > CLI. >>>>> >>>>> >> >> > >> > > > > It >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>> improves >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> the >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> flexibility of Flink >>>>> deployment Yarn Per Job >>>>> >>>>> >> >> > >> Mode. >>>>> >>>>> >> >> > >> > > > For >>>>> >>>>> >> >> > >> > > > > > > >> >>>>> platform >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> users >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> who >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> manage tens of hundreds of >>>>> streaming pipelines >>>>> >>>>> >> >> > >> for >>>>> >>>>> >> >> > >> > > the >>>>> >>>>> >> >> > >> > > > > > whole >>>>> >>>>> >> >> > >> > > > > > > >> >>>>> org >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>> or >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> company, we found the job >>>>> graph generation in >>>>> >>>>> >> >> > >> > > > > client-side >>>>> >>>>> >> >> > >> > > > > > is >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>> another >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> pinpoint. Thus, we want to >>>>> propose a >>>>> >>>>> >> >> > >> configurable >>>>> >>>>> >> >> > >> > > > > feature >>>>> >>>>> >> >> > >> > > > > > > >> >>> for >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> FlinkYarnSessionCli. The >>>>> feature can allow >>>>> >>>>> >> >> > >> users to >>>>> >>>>> >> >> > >> > > > > choose >>>>> >>>>> >> >> > >> > > > > > > >> >>> the >>>>> >>>>> >> >> > >> > > > > > > >> >>>>> job >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>> graph >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> generation in Flink >>>>> ClusterEntryPoint so that >>>>> >>>>> >> >> > >> the >>>>> >>>>> >> >> > >> > > job >>>>> >>>>> >> >> > >> > > > > jar >>>>> >>>>> >> >> > >> > > > > > > >> >>>>> doesn't >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> need >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>> to >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> be locally for the job >>>>> graph generation. The >>>>> >>>>> >> >> > >> > > proposal >>>>> >>>>> >> >> > >> > > > is >>>>> >>>>> >> >> > >> > > > > > > >> >>>>> organized >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> as a >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> FLIP >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>> >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>> >>>>> >>>>> >> >> > >> > > > > > > >> >>>>> >>>>> >>>>> >> >> > >> > > > > > > >> >>> >>>>> >>>>> >> >> > >> > > > > > > >> >>>>> >>>>> >> >> > >> > > > > > > >>>>> >>>>> >> >> > >> > > > > > >>>>> >>>>> >> >> > >> > > > > >>>>> >>>>> >> >> > >> > > > >>>>> >>>>> >> >> > >> > > >>>>> >>>>> >> >> > >> >>>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-85+Delayed+JobGraph+Generation >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> . >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> Any questions and >>>>> suggestions are welcomed. >>>>> >>>>> >> >> > >> Thank >>>>> >>>>> >> >> > >> > > you >>>>> >>>>> >> >> > >> > > > in >>>>> >>>>> >> >> > >> > > > > > > >> >>>>> advance. >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> Best Regards >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> Peter Huang >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>> >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>> >>>>> >>>>> >> >> > >> > > > > > > >> >>>>>> >>>>> >>>>> >> >> > >> > > > > > > >> >>>>> >>>>> >>>>> >> >> > >> > > > > > > >> >>>> >>>>> >>>>> >> >> > >> > > > > > > >> >>> >>>>> >>>>> >> >> > >> > > > > > > >> >> >>>>> >>>>> >> >> > >> > > > > > > >> >>>>> >>>>> >> >> > >> > > > > > > >> >>>>> >>>>> >> >> > >> > > > > > > >>>>> >>>>> >> >> > >> > > > > > >>>>> >>>>> >> >> > >> > > > > >>>>> >>>>> >> >> > >> > > > >>>>> >>>>> >> >> > >> > > >>>>> >>>>> >> >> > >> >>>>> >>>>> >> >> > > >>>>> >>>>> >> >> >>>>> >>>> |
Thanks Yang,
That would be very helpful! Jiangjie (Becket) Qin On Mon, Mar 9, 2020 at 3:31 PM Yang Wang <[hidden email]> wrote: > Hi Becket, > > Thanks for your suggestion. We will update the FLIP to add/enrich the > following parts. > * User cli option change, use "-R/--remote" to apply the cluster deploy > mode > * Configuration change, how to specify remote user jars and dependencies > * The whole story about how "application mode" works, upload -> fetch -> > submit job > * The cluster lifecycle, when and how the Flink cluster is destroyed > > > Best, > Yang > > Becket Qin <[hidden email]> 于2020年3月9日周一 下午12:34写道: > >> Thanks for the reply, tison and Yang, >> >> Regarding the public interface, is "-R/--remote" option the only change? >> Will the users also need to provide a remote location to upload and store >> the jars, and a list of jars as dependencies to be uploaded? >> >> It would be important that the public interface section in the FLIP >> includes all the user sensible changes including the CLI / configuration / >> metrics, etc. Can we update the FLIP to include the conclusion we have here >> in the ML? >> >> Thanks, >> >> Jiangjie (Becket) Qin >> >> On Mon, Mar 9, 2020 at 11:59 AM Yang Wang <[hidden email]> wrote: >> >>> Hi Becket, >>> >>> Thanks for jumping out and sharing your concerns. I second tison's >>> answer and just >>> make some additions. >>> >>> >>> > job submission interface >>> >>> This FLIP will introduce an interface for running user `main()` on >>> cluster, named as >>> “ProgramDeployer”. However, it is not a public interface. It will be >>> used in `CliFrontend` >>> when the remote deploy option(-R/--remote-deploy) is specified. So the >>> only changes >>> on user side is about the cli option. >>> >>> >>> > How to fetch the jars? >>> >>> The “local path” and “dfs path“ could be supported to fetch the user >>> jars and dependencies. >>> Just like tison has said, we could ship the user jar and dependencies >>> from client side to >>> HDFS and use the entrypoint to fetch. >>> >>> Also we have some other practical ways to use the new “application mode“. >>> 1. Upload the user jars and dependencies to the DFS(e.g. HDFS, S3, >>> Aliyun OSS) manually >>> or some external deployer system. For K8s, the user jars and >>> dependencies could also be >>> built in the docker image. >>> 2. Specify the remote/local user jar and dependencies in `flink run`. >>> Usually this could also >>> be done by the external deployer system. >>> 3. When the `ClusterEntrypoint` is launched, it will fetch the jars and >>> files automatically. We >>> do not need any specific fetcher implementation. Since we could leverage >>> flink `FileSystem` >>> to do this. >>> >>> >>> >>> >>> >>> Best, >>> Yang >>> >>> tison <[hidden email]> 于2020年3月9日周一 上午11:34写道: >>> >>>> Hi Becket, >>>> >>>> Thanks for your attention on FLIP-85! I answered your question inline. >>>> >>>> 1. What exactly the job submission interface will look like after this >>>> FLIP? The FLIP template has a Public Interface section but was removed from >>>> this FLIP. >>>> >>>> As Yang mentioned in this thread above: >>>> >>>> From user perspective, only a `-R/-- remote-deploy` cli option is >>>> visible. They are not aware of the application mode. >>>> >>>> 2. How will the new ClusterEntrypoint fetch the jars from external >>>> storage? What external storage will be supported out of the box? Will this >>>> "jar fetcher" be pluggable? If so, how does the API look like and how will >>>> users specify the custom "jar fetcher"? >>>> >>>> It depends actually. Here are several points: >>>> >>>> i. Currently, shipping user files is handled by Flink, dependencies >>>> fetching can be handled by Flink. >>>> ii. Current, we only support local file system shipfiles. When in >>>> Application Mode, to support meaningful jar fetch we should support user to >>>> configure richer shipfiles schema at first. >>>> iii. Dependencies fetching varies from deployments. That is, on YARN, >>>> its convention is through HDFS; on Kubernetes, its convention is configured >>>> resource server and fetched by initContainer. >>>> >>>> Thus, in the First phase of Application Mode dependencies fetching is >>>> totally handled within Flink. >>>> >>>> 3. It sounds that in this FLIP, the "session cluster" running the >>>> application has the same lifecycle as the user application. How will the >>>> session cluster be teared down after the application finishes? Will the >>>> ClusterEntrypoint do that? Will there be an option of not tearing the >>>> cluster down? >>>> >>>> The precondition we tear down the cluster is *both* >>>> >>>> i. user main reached to its end >>>> ii. all jobs submitted(current, at most one) reached global terminate >>>> state >>>> >>>> For the "how", it is an implementation topic, but conceptually it is >>>> ClusterEntrypoint's responsibility. >>>> >>>> >Will there be an option of not tearing the cluster down? >>>> >>>> I think the answer is "No" because the cluster is designed to be >>>> bounded with an Application. User logic that communicates with the job is >>>> always in its `main`, and for history information we have history server. >>>> >>>> Best, >>>> tison. >>>> >>>> >>>> Becket Qin <[hidden email]> 于2020年3月9日周一 上午8:12写道: >>>> >>>>> Hi Peter and Kostas, >>>>> >>>>> Thanks for creating this FLIP. Moving the JobGraph compilation to the >>>>> cluster makes a lot of sense to me. FLIP-40 had the exactly same idea, but >>>>> is currently dormant and can probably be superseded by this FLIP. After >>>>> reading the FLIP, I still have a few questions. >>>>> >>>>> 1. What exactly the job submission interface will look like after this >>>>> FLIP? The FLIP template has a Public Interface section but was removed from >>>>> this FLIP. >>>>> 2. How will the new ClusterEntrypoint fetch the jars from external >>>>> storage? What external storage will be supported out of the box? Will this >>>>> "jar fetcher" be pluggable? If so, how does the API look like and how will >>>>> users specify the custom "jar fetcher"? >>>>> 3. It sounds that in this FLIP, the "session cluster" running the >>>>> application has the same lifecycle as the user application. How will the >>>>> session cluster be teared down after the application finishes? Will the >>>>> ClusterEntrypoint do that? Will there be an option of not tearing the >>>>> cluster down? >>>>> >>>>> Maybe they have been discussed in the ML earlier, but I think they >>>>> should be part of the FLIP also. >>>>> >>>>> Thanks, >>>>> >>>>> Jiangjie (Becket) Qin >>>>> >>>>> On Thu, Mar 5, 2020 at 10:09 PM Kostas Kloudas <[hidden email]> >>>>> wrote: >>>>> >>>>>> Also from my side +1 to start voting. >>>>>> >>>>>> Cheers, >>>>>> Kostas >>>>>> >>>>>> On Thu, Mar 5, 2020 at 7:45 AM tison <[hidden email]> wrote: >>>>>> > >>>>>> > +1 to star voting. >>>>>> > >>>>>> > Best, >>>>>> > tison. >>>>>> > >>>>>> > >>>>>> > Yang Wang <[hidden email]> 于2020年3月5日周四 下午2:29写道: >>>>>> >> >>>>>> >> Hi Peter, >>>>>> >> Really thanks for your response. >>>>>> >> >>>>>> >> Hi all @Kostas Kloudas @Zili Chen @Peter Huang @Rong Rong >>>>>> >> It seems that we have reached an agreement. The “application mode” >>>>>> is regarded as the enhanced “per-job”. It is >>>>>> >> orthogonal with “cluster deploy”. Currently, we bind the “per-job” >>>>>> to `run-user-main-on-client` and “application mode” >>>>>> >> to `run-user-main-on-cluster`. >>>>>> >> >>>>>> >> Do you have other concerns to moving FLIP-85 to voting? >>>>>> >> >>>>>> >> >>>>>> >> Best, >>>>>> >> Yang >>>>>> >> >>>>>> >> Peter Huang <[hidden email]> 于2020年3月5日周四 下午12:48写道: >>>>>> >>> >>>>>> >>> Hi Yang and Kostas, >>>>>> >>> >>>>>> >>> Thanks for the clarification. It makes more sense to me if the >>>>>> long term goal is to replace per job mode to application mode >>>>>> >>> in the future (at the time that multiple execute can be >>>>>> supported). Before that, It will be better to keep the concept of >>>>>> >>> application mode internally. As Yang suggested, User only need >>>>>> to use a `-R/-- remote-deploy` cli option to launch >>>>>> >>> a per job cluster with the main function executed in cluster >>>>>> entry-point. +1 for the execution plan. >>>>>> >>> >>>>>> >>> >>>>>> >>> >>>>>> >>> Best Regards >>>>>> >>> Peter Huang >>>>>> >>> >>>>>> >>> >>>>>> >>> >>>>>> >>> >>>>>> >>> On Tue, Mar 3, 2020 at 7:11 AM Yang Wang <[hidden email]> >>>>>> wrote: >>>>>> >>>> >>>>>> >>>> Hi Peter, >>>>>> >>>> >>>>>> >>>> Having the application mode does not mean we will drop the >>>>>> cluster-deploy >>>>>> >>>> option. I just want to share some thoughts about “Application >>>>>> Mode”. >>>>>> >>>> >>>>>> >>>> >>>>>> >>>> 1. The application mode could cover the per-job sematic. Its >>>>>> lifecyle is bound >>>>>> >>>> to the user `main()`. And all the jobs in the user main will be >>>>>> executed in a same >>>>>> >>>> Flink cluster. In first phase of FLIP-85 implementation, running >>>>>> user main on the >>>>>> >>>> cluster side could be supported in application mode. >>>>>> >>>> >>>>>> >>>> 2. Maybe in the future, we also need to support multiple >>>>>> `execute()` on client side >>>>>> >>>> in a same Flink cluster. Then the per-job mode will evolve to >>>>>> application mode. >>>>>> >>>> >>>>>> >>>> 3. From user perspective, only a `-R/-- remote-deploy` cli >>>>>> option is visible. They >>>>>> >>>> are not aware of the application mode. >>>>>> >>>> >>>>>> >>>> 4. In the first phase, the application mode is working as >>>>>> “per-job”(only one job in >>>>>> >>>> the user main). We just leave more potential for the future. >>>>>> >>>> >>>>>> >>>> >>>>>> >>>> I am not against with calling it “cluster deploy mode” if you >>>>>> all think it is clearer for users. >>>>>> >>>> >>>>>> >>>> >>>>>> >>>> >>>>>> >>>> Best, >>>>>> >>>> Yang >>>>>> >>>> >>>>>> >>>> Kostas Kloudas <[hidden email]> 于2020年3月3日周二 下午6:49写道: >>>>>> >>>>> >>>>>> >>>>> Hi Peter, >>>>>> >>>>> >>>>>> >>>>> I understand your point. This is why I was also a bit torn >>>>>> about the >>>>>> >>>>> name and my proposal was a bit aligned with yours (something >>>>>> along the >>>>>> >>>>> lines of "cluster deploy" mode). >>>>>> >>>>> >>>>>> >>>>> But many of the other participants in the discussion suggested >>>>>> the >>>>>> >>>>> "Application Mode". I think that the reasoning is that now the >>>>>> user's >>>>>> >>>>> Application is more self-contained. >>>>>> >>>>> It will be submitted to the cluster and the user can just >>>>>> disconnect. >>>>>> >>>>> In addition, as discussed briefly in the doc, in the future >>>>>> there may >>>>>> >>>>> be better support for multi-execute applications which will >>>>>> bring us >>>>>> >>>>> one step closer to the true "Application Mode". But this is how >>>>>> I >>>>>> >>>>> interpreted their arguments, of course they can also express >>>>>> their >>>>>> >>>>> thoughts on the topic :) >>>>>> >>>>> >>>>>> >>>>> Cheers, >>>>>> >>>>> Kostas >>>>>> >>>>> >>>>>> >>>>> On Mon, Mar 2, 2020 at 6:15 PM Peter Huang < >>>>>> [hidden email]> wrote: >>>>>> >>>>> > >>>>>> >>>>> > Hi Kostas, >>>>>> >>>>> > >>>>>> >>>>> > Thanks for updating the wiki. We have aligned with the >>>>>> implementations in the doc. But I feel it is still a little bit confusing >>>>>> of the naming from a user's perspective. It is well known that Flink >>>>>> support per job cluster and session cluster. The concept is in the layer of >>>>>> how a job is managed within Flink. The method introduced util now is a kind >>>>>> of mixing job and session cluster to promising the implementation >>>>>> complexity. We probably don't need to label it as Application Model as the >>>>>> same layer of per job cluster and session cluster. Conceptually, I think it >>>>>> is still a cluster mode implementation for per job cluster. >>>>>> >>>>> > >>>>>> >>>>> > To minimize the confusion of users, I think it would be >>>>>> better just an option of per job cluster for each type of cluster manager. >>>>>> How do you think? >>>>>> >>>>> > >>>>>> >>>>> > >>>>>> >>>>> > Best Regards >>>>>> >>>>> > Peter Huang >>>>>> >>>>> > >>>>>> >>>>> > >>>>>> >>>>> > >>>>>> >>>>> > >>>>>> >>>>> > >>>>>> >>>>> > >>>>>> >>>>> > >>>>>> >>>>> > >>>>>> >>>>> > On Mon, Mar 2, 2020 at 7:22 AM Kostas Kloudas < >>>>>> [hidden email]> wrote: >>>>>> >>>>> >> >>>>>> >>>>> >> Hi Yang, >>>>>> >>>>> >> >>>>>> >>>>> >> The difference between per-job and application mode is that, >>>>>> as you >>>>>> >>>>> >> described, in the per-job mode the main is executed on the >>>>>> client >>>>>> >>>>> >> while in the application mode, the main is executed on the >>>>>> cluster. >>>>>> >>>>> >> I do not think we have to offer "application mode" with >>>>>> running the >>>>>> >>>>> >> main on the client side as this is exactly what the per-job >>>>>> mode does >>>>>> >>>>> >> currently and, as you described also, it would be redundant. >>>>>> >>>>> >> >>>>>> >>>>> >> Sorry if this was not clear in the document. >>>>>> >>>>> >> >>>>>> >>>>> >> Cheers, >>>>>> >>>>> >> Kostas >>>>>> >>>>> >> >>>>>> >>>>> >> On Mon, Mar 2, 2020 at 3:17 PM Yang Wang < >>>>>> [hidden email]> wrote: >>>>>> >>>>> >> > >>>>>> >>>>> >> > Hi Kostas, >>>>>> >>>>> >> > >>>>>> >>>>> >> > Thanks a lot for your conclusion and updating the FLIP-85 >>>>>> WIKI. Currently, i have no more >>>>>> >>>>> >> > questions about motivation, approach, fault tolerance and >>>>>> the first phase implementation. >>>>>> >>>>> >> > >>>>>> >>>>> >> > I think the new title "Flink Application Mode" makes a lot >>>>>> senses to me. Especially for the >>>>>> >>>>> >> > containerized environment, the cluster deploy option will >>>>>> be very useful. >>>>>> >>>>> >> > >>>>>> >>>>> >> > Just one concern, how do we introduce this new application >>>>>> mode to our users? >>>>>> >>>>> >> > Each user program(i.e. `main()`) is an application. >>>>>> Currently, we intend to only support one >>>>>> >>>>> >> > `execute()`. So what's the difference between per-job and >>>>>> application mode? >>>>>> >>>>> >> > >>>>>> >>>>> >> > For per-job, user `main()` is always executed on client >>>>>> side. And For application mode, user >>>>>> >>>>> >> > `main()` could be executed on client or master >>>>>> side(configured via cli option). >>>>>> >>>>> >> > Right? We need to have a clear concept. Otherwise, the >>>>>> users will be more and more confusing. >>>>>> >>>>> >> > >>>>>> >>>>> >> > >>>>>> >>>>> >> > Best, >>>>>> >>>>> >> > Yang >>>>>> >>>>> >> > >>>>>> >>>>> >> > Kostas Kloudas <[hidden email]> 于2020年3月2日周一 下午5:58写道: >>>>>> >>>>> >> >> >>>>>> >>>>> >> >> Hi all, >>>>>> >>>>> >> >> >>>>>> >>>>> >> >> I update >>>>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-85+Flink+Application+Mode >>>>>> >>>>> >> >> based on the discussion we had here: >>>>>> >>>>> >> >> >>>>>> >>>>> >> >> >>>>>> https://docs.google.com/document/d/1ji72s3FD9DYUyGuKnJoO4ApzV-nSsZa0-bceGXW7Ocw/edit# >>>>>> >>>>> >> >> >>>>>> >>>>> >> >> Please let me know what you think and please keep the >>>>>> discussion in the ML :) >>>>>> >>>>> >> >> >>>>>> >>>>> >> >> Thanks for starting the discussion and I hope that soon >>>>>> we will be >>>>>> >>>>> >> >> able to vote on the FLIP. >>>>>> >>>>> >> >> >>>>>> >>>>> >> >> Cheers, >>>>>> >>>>> >> >> Kostas >>>>>> >>>>> >> >> >>>>>> >>>>> >> >> On Thu, Jan 16, 2020 at 3:40 AM Yang Wang < >>>>>> [hidden email]> wrote: >>>>>> >>>>> >> >> > >>>>>> >>>>> >> >> > Hi all, >>>>>> >>>>> >> >> > >>>>>> >>>>> >> >> > Thanks a lot for the feedback from @Kostas Kloudas. >>>>>> Your all concerns are >>>>>> >>>>> >> >> > on point. The FLIP-85 is mainly >>>>>> >>>>> >> >> > focused on supporting cluster mode for per-job. Since >>>>>> it is more urgent and >>>>>> >>>>> >> >> > have much more use >>>>>> >>>>> >> >> > cases both in Yarn and Kubernetes deployment. For >>>>>> session cluster, we could >>>>>> >>>>> >> >> > have more discussion >>>>>> >>>>> >> >> > in a new thread later. >>>>>> >>>>> >> >> > >>>>>> >>>>> >> >> > #1, How to download the user jars and dependencies for >>>>>> per-job in cluster >>>>>> >>>>> >> >> > mode? >>>>>> >>>>> >> >> > For Yarn, we could register the user jars and >>>>>> dependencies as >>>>>> >>>>> >> >> > LocalResource. They will be distributed >>>>>> >>>>> >> >> > by Yarn. And once the JobManager and TaskManager >>>>>> launched, the jars are >>>>>> >>>>> >> >> > already exists. >>>>>> >>>>> >> >> > For Standalone per-job and K8s, we expect that the user >>>>>> jars >>>>>> >>>>> >> >> > and dependencies are built into the image. >>>>>> >>>>> >> >> > Or the InitContainer could be used for downloading. It >>>>>> is natively >>>>>> >>>>> >> >> > distributed and we will not have bottleneck. >>>>>> >>>>> >> >> > >>>>>> >>>>> >> >> > #2, Job graph recovery >>>>>> >>>>> >> >> > We could have an optimization to store job graph on the >>>>>> DFS. However, i >>>>>> >>>>> >> >> > suggest building a new jobgraph >>>>>> >>>>> >> >> > from the configuration is the default option. Since we >>>>>> will not always have >>>>>> >>>>> >> >> > a DFS store when deploying a >>>>>> >>>>> >> >> > Flink per-job cluster. Of course, we assume that using >>>>>> the same >>>>>> >>>>> >> >> > configuration(e.g. job_id, user_jar, main_class, >>>>>> >>>>> >> >> > main_args, parallelism, savepoint_settings, etc.) will >>>>>> get a same job >>>>>> >>>>> >> >> > graph. I think the standalone per-job >>>>>> >>>>> >> >> > already has the similar behavior. >>>>>> >>>>> >> >> > >>>>>> >>>>> >> >> > #3, What happens with jobs that have multiple execute >>>>>> calls? >>>>>> >>>>> >> >> > Currently, it is really a problem. Even we use a local >>>>>> client on Flink >>>>>> >>>>> >> >> > master side, it will have different behavior with >>>>>> >>>>> >> >> > client mode. For client mode, if we execute multiple >>>>>> times, then we will >>>>>> >>>>> >> >> > deploy multiple Flink clusters for each execute. >>>>>> >>>>> >> >> > I am not pretty sure whether it is reasonable. However, >>>>>> i still think using >>>>>> >>>>> >> >> > the local client is a good choice. We could >>>>>> >>>>> >> >> > continue the discussion in a new thread. @Zili Chen < >>>>>> [hidden email]> Do >>>>>> >>>>> >> >> > you want to drive this? >>>>>> >>>>> >> >> > >>>>>> >>>>> >> >> > >>>>>> >>>>> >> >> > >>>>>> >>>>> >> >> > Best, >>>>>> >>>>> >> >> > Yang >>>>>> >>>>> >> >> > >>>>>> >>>>> >> >> > Peter Huang <[hidden email]> 于2020年1月16日周四 >>>>>> 上午1:55写道: >>>>>> >>>>> >> >> > >>>>>> >>>>> >> >> > > Hi Kostas, >>>>>> >>>>> >> >> > > >>>>>> >>>>> >> >> > > Thanks for this feedback. I can't agree more about >>>>>> the opinion. The >>>>>> >>>>> >> >> > > cluster mode should be added >>>>>> >>>>> >> >> > > first in per job cluster. >>>>>> >>>>> >> >> > > >>>>>> >>>>> >> >> > > 1) For job cluster implementation >>>>>> >>>>> >> >> > > 1. Job graph recovery from configuration or store as >>>>>> static job graph as >>>>>> >>>>> >> >> > > session cluster. I think the static one will be >>>>>> better for less recovery >>>>>> >>>>> >> >> > > time. >>>>>> >>>>> >> >> > > Let me update the doc for details. >>>>>> >>>>> >> >> > > >>>>>> >>>>> >> >> > > 2. For job execute multiple times, I think @Zili Chen >>>>>> >>>>> >> >> > > <[hidden email]> has proposed the local client >>>>>> solution that can >>>>>> >>>>> >> >> > > the run program actually in the cluster entry point. >>>>>> We can put the >>>>>> >>>>> >> >> > > implementation in the second stage, >>>>>> >>>>> >> >> > > or even a new FLIP for further discussion. >>>>>> >>>>> >> >> > > >>>>>> >>>>> >> >> > > 2) For session cluster implementation >>>>>> >>>>> >> >> > > We can disable the cluster mode for the session >>>>>> cluster in the first >>>>>> >>>>> >> >> > > stage. I agree the jar downloading will be a painful >>>>>> thing. >>>>>> >>>>> >> >> > > We can consider about PoC and performance evaluation >>>>>> first. If the end to >>>>>> >>>>> >> >> > > end experience is good enough, then we can consider >>>>>> >>>>> >> >> > > proceeding with the solution. >>>>>> >>>>> >> >> > > >>>>>> >>>>> >> >> > > Looking forward to more opinions from @Yang Wang < >>>>>> [hidden email]> @Zili >>>>>> >>>>> >> >> > > Chen <[hidden email]> @Dian Fu < >>>>>> [hidden email]>. >>>>>> >>>>> >> >> > > >>>>>> >>>>> >> >> > > >>>>>> >>>>> >> >> > > Best Regards >>>>>> >>>>> >> >> > > Peter Huang >>>>>> >>>>> >> >> > > >>>>>> >>>>> >> >> > > On Wed, Jan 15, 2020 at 7:50 AM Kostas Kloudas < >>>>>> [hidden email]> wrote: >>>>>> >>>>> >> >> > > >>>>>> >>>>> >> >> > >> Hi all, >>>>>> >>>>> >> >> > >> >>>>>> >>>>> >> >> > >> I am writing here as the discussion on the Google >>>>>> Doc seems to be a >>>>>> >>>>> >> >> > >> bit difficult to follow. >>>>>> >>>>> >> >> > >> >>>>>> >>>>> >> >> > >> I think that in order to be able to make progress, >>>>>> it would be helpful >>>>>> >>>>> >> >> > >> to focus on per-job mode for now. >>>>>> >>>>> >> >> > >> The reason is that: >>>>>> >>>>> >> >> > >> 1) making the (unique) JobSubmitHandler responsible >>>>>> for creating the >>>>>> >>>>> >> >> > >> jobgraphs, >>>>>> >>>>> >> >> > >> which includes downloading dependencies, is not an >>>>>> optimal solution >>>>>> >>>>> >> >> > >> 2) even if we put the responsibility on the >>>>>> JobMaster, currently each >>>>>> >>>>> >> >> > >> job has its own >>>>>> >>>>> >> >> > >> JobMaster but they all run on the same process, so >>>>>> we have again a >>>>>> >>>>> >> >> > >> single entity. >>>>>> >>>>> >> >> > >> >>>>>> >>>>> >> >> > >> Of course after this is done, and if we feel >>>>>> comfortable with the >>>>>> >>>>> >> >> > >> solution, then we can go to the session mode. >>>>>> >>>>> >> >> > >> >>>>>> >>>>> >> >> > >> A second comment has to do with fault-tolerance in >>>>>> the per-job, >>>>>> >>>>> >> >> > >> cluster-deploy mode. >>>>>> >>>>> >> >> > >> In the document, it is suggested that upon recovery, >>>>>> the JobMaster of >>>>>> >>>>> >> >> > >> each job re-creates the JobGraph. >>>>>> >>>>> >> >> > >> I am just wondering if it is better to create and >>>>>> store the jobGraph >>>>>> >>>>> >> >> > >> upon submission and only fetch it >>>>>> >>>>> >> >> > >> upon recovery so that we have a static jobGraph. >>>>>> >>>>> >> >> > >> >>>>>> >>>>> >> >> > >> Finally, I have a question which is what happens >>>>>> with jobs that have >>>>>> >>>>> >> >> > >> multiple execute calls? >>>>>> >>>>> >> >> > >> The semantics seem to change compared to the current >>>>>> behaviour, right? >>>>>> >>>>> >> >> > >> >>>>>> >>>>> >> >> > >> Cheers, >>>>>> >>>>> >> >> > >> Kostas >>>>>> >>>>> >> >> > >> >>>>>> >>>>> >> >> > >> On Wed, Jan 8, 2020 at 8:05 PM tison < >>>>>> [hidden email]> wrote: >>>>>> >>>>> >> >> > >> > >>>>>> >>>>> >> >> > >> > not always, Yang Wang is also not yet a committer >>>>>> but he can join the >>>>>> >>>>> >> >> > >> > channel. I cannot find the id by clicking “Add new >>>>>> member in channel” so >>>>>> >>>>> >> >> > >> > come to you and ask for try out the link. Possibly >>>>>> I will find other >>>>>> >>>>> >> >> > >> ways >>>>>> >>>>> >> >> > >> > but the original purpose is that the slack channel >>>>>> is a public area we >>>>>> >>>>> >> >> > >> > discuss about developing... >>>>>> >>>>> >> >> > >> > Best, >>>>>> >>>>> >> >> > >> > tison. >>>>>> >>>>> >> >> > >> > >>>>>> >>>>> >> >> > >> > >>>>>> >>>>> >> >> > >> > Peter Huang <[hidden email]> >>>>>> 于2020年1月9日周四 上午2:44写道: >>>>>> >>>>> >> >> > >> > >>>>>> >>>>> >> >> > >> > > Hi Tison, >>>>>> >>>>> >> >> > >> > > >>>>>> >>>>> >> >> > >> > > I am not the committer of Flink yet. I think I >>>>>> can't join it also. >>>>>> >>>>> >> >> > >> > > >>>>>> >>>>> >> >> > >> > > >>>>>> >>>>> >> >> > >> > > Best Regards >>>>>> >>>>> >> >> > >> > > Peter Huang >>>>>> >>>>> >> >> > >> > > >>>>>> >>>>> >> >> > >> > > On Wed, Jan 8, 2020 at 9:39 AM tison < >>>>>> [hidden email]> wrote: >>>>>> >>>>> >> >> > >> > > >>>>>> >>>>> >> >> > >> > > > Hi Peter, >>>>>> >>>>> >> >> > >> > > > >>>>>> >>>>> >> >> > >> > > > Could you try out this link? >>>>>> >>>>> >> >> > >> > > https://the-asf.slack.com/messages/CNA3ADZPH >>>>>> >>>>> >> >> > >> > > > >>>>>> >>>>> >> >> > >> > > > Best, >>>>>> >>>>> >> >> > >> > > > tison. >>>>>> >>>>> >> >> > >> > > > >>>>>> >>>>> >> >> > >> > > > >>>>>> >>>>> >> >> > >> > > > Peter Huang <[hidden email]> >>>>>> 于2020年1月9日周四 上午1:22写道: >>>>>> >>>>> >> >> > >> > > > >>>>>> >>>>> >> >> > >> > > > > Hi Tison, >>>>>> >>>>> >> >> > >> > > > > >>>>>> >>>>> >> >> > >> > > > > I can't join the group with shared link. >>>>>> Would you please add me >>>>>> >>>>> >> >> > >> into >>>>>> >>>>> >> >> > >> > > the >>>>>> >>>>> >> >> > >> > > > > group? My slack account is huangzhenqiu0825. >>>>>> >>>>> >> >> > >> > > > > Thank you in advance. >>>>>> >>>>> >> >> > >> > > > > >>>>>> >>>>> >> >> > >> > > > > >>>>>> >>>>> >> >> > >> > > > > Best Regards >>>>>> >>>>> >> >> > >> > > > > Peter Huang >>>>>> >>>>> >> >> > >> > > > > >>>>>> >>>>> >> >> > >> > > > > On Wed, Jan 8, 2020 at 12:02 AM tison < >>>>>> [hidden email]> >>>>>> >>>>> >> >> > >> wrote: >>>>>> >>>>> >> >> > >> > > > > >>>>>> >>>>> >> >> > >> > > > > > Hi Peter, >>>>>> >>>>> >> >> > >> > > > > > >>>>>> >>>>> >> >> > >> > > > > > As described above, this effort should get >>>>>> attention from people >>>>>> >>>>> >> >> > >> > > > > developing >>>>>> >>>>> >> >> > >> > > > > > FLIP-73 a.k.a. Executor abstractions. I >>>>>> recommend you to join >>>>>> >>>>> >> >> > >> the >>>>>> >>>>> >> >> > >> > > > public >>>>>> >>>>> >> >> > >> > > > > > slack channel[1] for Flink Client API >>>>>> Enhancement and you can >>>>>> >>>>> >> >> > >> try to >>>>>> >>>>> >> >> > >> > > > > share >>>>>> >>>>> >> >> > >> > > > > > you detailed thoughts there. It possibly >>>>>> gets more concrete >>>>>> >>>>> >> >> > >> > > attentions. >>>>>> >>>>> >> >> > >> > > > > > >>>>>> >>>>> >> >> > >> > > > > > Best, >>>>>> >>>>> >> >> > >> > > > > > tison. >>>>>> >>>>> >> >> > >> > > > > > >>>>>> >>>>> >> >> > >> > > > > > [1] >>>>>> >>>>> >> >> > >> > > > > > >>>>>> >>>>> >> >> > >> > > > > > >>>>>> >>>>> >> >> > >> > > > > >>>>>> >>>>> >> >> > >> > > > >>>>>> >>>>> >> >> > >> > > >>>>>> >>>>> >> >> > >> >>>>>> https://slack.com/share/IS21SJ75H/Rk8HhUly9FuEHb7oGwBZ33uL/enQtODg2MDYwNjE5MTg3LTA2MjIzNDc1M2ZjZDVlMjdlZjk1M2RkYmJhNjAwMTk2ZDZkODQ4NmY5YmI4OGRhNWJkYTViMTM1NzlmMzc4OWM >>>>>> >>>>> >> >> > >> > > > > > >>>>>> >>>>> >> >> > >> > > > > > >>>>>> >>>>> >> >> > >> > > > > > Peter Huang <[hidden email]> >>>>>> 于2020年1月7日周二 上午5:09写道: >>>>>> >>>>> >> >> > >> > > > > > >>>>>> >>>>> >> >> > >> > > > > > > Dear All, >>>>>> >>>>> >> >> > >> > > > > > > >>>>>> >>>>> >> >> > >> > > > > > > Happy new year! According to existing >>>>>> feedback from the >>>>>> >>>>> >> >> > >> community, >>>>>> >>>>> >> >> > >> > > we >>>>>> >>>>> >> >> > >> > > > > > > revised the doc with the consideration >>>>>> of session cluster >>>>>> >>>>> >> >> > >> support, >>>>>> >>>>> >> >> > >> > > > and >>>>>> >>>>> >> >> > >> > > > > > > concrete interface changes needed and >>>>>> execution plan. Please >>>>>> >>>>> >> >> > >> take >>>>>> >>>>> >> >> > >> > > one >>>>>> >>>>> >> >> > >> > > > > > more >>>>>> >>>>> >> >> > >> > > > > > > round of review at your most convenient >>>>>> time. >>>>>> >>>>> >> >> > >> > > > > > > >>>>>> >>>>> >> >> > >> > > > > > > >>>>>> >>>>> >> >> > >> > > > > > > >>>>>> >>>>> >> >> > >> > > > > > >>>>>> >>>>> >> >> > >> > > > > >>>>>> >>>>> >> >> > >> > > > >>>>>> >>>>> >> >> > >> > > >>>>>> >>>>> >> >> > >> >>>>>> https://docs.google.com/document/d/1aAwVjdZByA-0CHbgv16Me-vjaaDMCfhX7TzVVTuifYM/edit# >>>>>> >>>>> >> >> > >> > > > > > > >>>>>> >>>>> >> >> > >> > > > > > > >>>>>> >>>>> >> >> > >> > > > > > > Best Regards >>>>>> >>>>> >> >> > >> > > > > > > Peter Huang >>>>>> >>>>> >> >> > >> > > > > > > >>>>>> >>>>> >> >> > >> > > > > > > >>>>>> >>>>> >> >> > >> > > > > > > >>>>>> >>>>> >> >> > >> > > > > > > >>>>>> >>>>> >> >> > >> > > > > > > >>>>>> >>>>> >> >> > >> > > > > > > On Thu, Jan 2, 2020 at 11:29 AM Peter >>>>>> Huang < >>>>>> >>>>> >> >> > >> > > > > [hidden email]> >>>>>> >>>>> >> >> > >> > > > > > > wrote: >>>>>> >>>>> >> >> > >> > > > > > > >>>>>> >>>>> >> >> > >> > > > > > > > Hi Dian, >>>>>> >>>>> >> >> > >> > > > > > > > Thanks for giving us valuable >>>>>> feedbacks. >>>>>> >>>>> >> >> > >> > > > > > > > >>>>>> >>>>> >> >> > >> > > > > > > > 1) It's better to have a whole design >>>>>> for this feature >>>>>> >>>>> >> >> > >> > > > > > > > For the suggestion of enabling the >>>>>> cluster mode also session >>>>>> >>>>> >> >> > >> > > > > cluster, I >>>>>> >>>>> >> >> > >> > > > > > > > think Flink already supported it. >>>>>> WebSubmissionExtension >>>>>> >>>>> >> >> > >> already >>>>>> >>>>> >> >> > >> > > > > allows >>>>>> >>>>> >> >> > >> > > > > > > > users to start a job with the >>>>>> specified jar by using web UI. >>>>>> >>>>> >> >> > >> > > > > > > > But we need to enable the feature from >>>>>> CLI for both local >>>>>> >>>>> >> >> > >> jar, >>>>>> >>>>> >> >> > >> > > > remote >>>>>> >>>>> >> >> > >> > > > > > > jar. >>>>>> >>>>> >> >> > >> > > > > > > > I will align with Yang Wang first >>>>>> about the details and >>>>>> >>>>> >> >> > >> update >>>>>> >>>>> >> >> > >> > > the >>>>>> >>>>> >> >> > >> > > > > > design >>>>>> >>>>> >> >> > >> > > > > > > > doc. >>>>>> >>>>> >> >> > >> > > > > > > > >>>>>> >>>>> >> >> > >> > > > > > > > 2) It's better to consider the >>>>>> convenience for users, such >>>>>> >>>>> >> >> > >> as >>>>>> >>>>> >> >> > >> > > > > debugging >>>>>> >>>>> >> >> > >> > > > > > > > >>>>>> >>>>> >> >> > >> > > > > > > > I am wondering whether we can store >>>>>> the exception in >>>>>> >>>>> >> >> > >> jobgragh >>>>>> >>>>> >> >> > >> > > > > > > > generation in application master. As >>>>>> no streaming graph can >>>>>> >>>>> >> >> > >> be >>>>>> >>>>> >> >> > >> > > > > > scheduled >>>>>> >>>>> >> >> > >> > > > > > > in >>>>>> >>>>> >> >> > >> > > > > > > > this case, there will be no more TM >>>>>> will be requested from >>>>>> >>>>> >> >> > >> > > FlinkRM. >>>>>> >>>>> >> >> > >> > > > > > > > If the AM is still running, users can >>>>>> still query it from >>>>>> >>>>> >> >> > >> CLI. As >>>>>> >>>>> >> >> > >> > > > it >>>>>> >>>>> >> >> > >> > > > > > > > requires more change, we can get some >>>>>> feedback from < >>>>>> >>>>> >> >> > >> > > > > > [hidden email] >>>>>> >>>>> >> >> > >> > > > > > > > >>>>>> >>>>> >> >> > >> > > > > > > > and @[hidden email] < >>>>>> [hidden email]>. >>>>>> >>>>> >> >> > >> > > > > > > > >>>>>> >>>>> >> >> > >> > > > > > > > 3) It's better to consider the impact >>>>>> to the stability of >>>>>> >>>>> >> >> > >> the >>>>>> >>>>> >> >> > >> > > > cluster >>>>>> >>>>> >> >> > >> > > > > > > > >>>>>> >>>>> >> >> > >> > > > > > > > I agree with Yang Wang's opinion. >>>>>> >>>>> >> >> > >> > > > > > > > >>>>>> >>>>> >> >> > >> > > > > > > > >>>>>> >>>>> >> >> > >> > > > > > > > >>>>>> >>>>> >> >> > >> > > > > > > > Best Regards >>>>>> >>>>> >> >> > >> > > > > > > > Peter Huang >>>>>> >>>>> >> >> > >> > > > > > > > >>>>>> >>>>> >> >> > >> > > > > > > > >>>>>> >>>>> >> >> > >> > > > > > > > On Sun, Dec 29, 2019 at 9:44 PM Dian >>>>>> Fu < >>>>>> >>>>> >> >> > >> [hidden email]> >>>>>> >>>>> >> >> > >> > > > > wrote: >>>>>> >>>>> >> >> > >> > > > > > > > >>>>>> >>>>> >> >> > >> > > > > > > >> Hi all, >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>> >>>>> >> >> > >> > > > > > > >> Sorry to jump into this discussion. >>>>>> Thanks everyone for the >>>>>> >>>>> >> >> > >> > > > > > discussion. >>>>>> >>>>> >> >> > >> > > > > > > >> I'm very interested in this topic >>>>>> although I'm not an >>>>>> >>>>> >> >> > >> expert in >>>>>> >>>>> >> >> > >> > > > this >>>>>> >>>>> >> >> > >> > > > > > > part. >>>>>> >>>>> >> >> > >> > > > > > > >> So I'm glad to share my thoughts as >>>>>> following: >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>> >>>>> >> >> > >> > > > > > > >> 1) It's better to have a whole design >>>>>> for this feature >>>>>> >>>>> >> >> > >> > > > > > > >> As we know, there are two deployment >>>>>> modes: per-job mode >>>>>> >>>>> >> >> > >> and >>>>>> >>>>> >> >> > >> > > > session >>>>>> >>>>> >> >> > >> > > > > > > >> mode. I'm wondering which mode really >>>>>> needs this feature. >>>>>> >>>>> >> >> > >> As the >>>>>> >>>>> >> >> > >> > > > > > design >>>>>> >>>>> >> >> > >> > > > > > > doc >>>>>> >>>>> >> >> > >> > > > > > > >> mentioned, per-job mode is more used >>>>>> for streaming jobs and >>>>>> >>>>> >> >> > >> > > > session >>>>>> >>>>> >> >> > >> > > > > > > mode is >>>>>> >>>>> >> >> > >> > > > > > > >> usually used for batch jobs(Of >>>>>> course, the job types and >>>>>> >>>>> >> >> > >> the >>>>>> >>>>> >> >> > >> > > > > > deployment >>>>>> >>>>> >> >> > >> > > > > > > >> modes are orthogonal). Usually >>>>>> streaming job is only >>>>>> >>>>> >> >> > >> needed to >>>>>> >>>>> >> >> > >> > > be >>>>>> >>>>> >> >> > >> > > > > > > submitted >>>>>> >>>>> >> >> > >> > > > > > > >> once and it will run for days or >>>>>> weeks, while batch jobs >>>>>> >>>>> >> >> > >> will be >>>>>> >>>>> >> >> > >> > > > > > > submitted >>>>>> >>>>> >> >> > >> > > > > > > >> more frequently compared with >>>>>> streaming jobs. This means >>>>>> >>>>> >> >> > >> that >>>>>> >>>>> >> >> > >> > > > maybe >>>>>> >>>>> >> >> > >> > > > > > > session >>>>>> >>>>> >> >> > >> > > > > > > >> mode also needs this feature. >>>>>> However, if we support this >>>>>> >>>>> >> >> > >> > > feature >>>>>> >>>>> >> >> > >> > > > in >>>>>> >>>>> >> >> > >> > > > > > > >> session mode, the application master >>>>>> will become the new >>>>>> >>>>> >> >> > >> > > > centralized >>>>>> >>>>> >> >> > >> > > > > > > >> service(which should be solved). So >>>>>> in this case, it's >>>>>> >>>>> >> >> > >> better to >>>>>> >>>>> >> >> > >> > > > > have >>>>>> >>>>> >> >> > >> > > > > > a >>>>>> >>>>> >> >> > >> > > > > > > >> complete design for both per-job mode >>>>>> and session mode. >>>>>> >>>>> >> >> > >> > > > Furthermore, >>>>>> >>>>> >> >> > >> > > > > > > even >>>>>> >>>>> >> >> > >> > > > > > > >> if we can do it phase by phase, we >>>>>> need to have a whole >>>>>> >>>>> >> >> > >> picture >>>>>> >>>>> >> >> > >> > > of >>>>>> >>>>> >> >> > >> > > > > how >>>>>> >>>>> >> >> > >> > > > > > > it >>>>>> >>>>> >> >> > >> > > > > > > >> works in both per-job mode and >>>>>> session mode. >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>> >>>>> >> >> > >> > > > > > > >> 2) It's better to consider the >>>>>> convenience for users, such >>>>>> >>>>> >> >> > >> as >>>>>> >>>>> >> >> > >> > > > > > debugging >>>>>> >>>>> >> >> > >> > > > > > > >> After we finish this feature, the job >>>>>> graph will be >>>>>> >>>>> >> >> > >> compiled in >>>>>> >>>>> >> >> > >> > > > the >>>>>> >>>>> >> >> > >> > > > > > > >> application master, which means that >>>>>> users cannot easily >>>>>> >>>>> >> >> > >> get the >>>>>> >>>>> >> >> > >> > > > > > > exception >>>>>> >>>>> >> >> > >> > > > > > > >> message synchorousely in the job >>>>>> client if there are >>>>>> >>>>> >> >> > >> problems >>>>>> >>>>> >> >> > >> > > > during >>>>>> >>>>> >> >> > >> > > > > > the >>>>>> >>>>> >> >> > >> > > > > > > >> job graph compiling (especially for >>>>>> platform users), such >>>>>> >>>>> >> >> > >> as the >>>>>> >>>>> >> >> > >> > > > > > > resource >>>>>> >>>>> >> >> > >> > > > > > > >> path is incorrect, the user program >>>>>> itself has some >>>>>> >>>>> >> >> > >> problems, >>>>>> >>>>> >> >> > >> > > etc. >>>>>> >>>>> >> >> > >> > > > > > What >>>>>> >>>>> >> >> > >> > > > > > > I'm >>>>>> >>>>> >> >> > >> > > > > > > >> thinking is that maybe we should >>>>>> throw the exceptions as >>>>>> >>>>> >> >> > >> early >>>>>> >>>>> >> >> > >> > > as >>>>>> >>>>> >> >> > >> > > > > > > possible >>>>>> >>>>> >> >> > >> > > > > > > >> (during job submission stage). >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>> >>>>> >> >> > >> > > > > > > >> 3) It's better to consider the impact >>>>>> to the stability of >>>>>> >>>>> >> >> > >> the >>>>>> >>>>> >> >> > >> > > > > cluster >>>>>> >>>>> >> >> > >> > > > > > > >> If we perform the compiling in the >>>>>> application master, we >>>>>> >>>>> >> >> > >> should >>>>>> >>>>> >> >> > >> > > > > > > consider >>>>>> >>>>> >> >> > >> > > > > > > >> the impact of the compiling errors. >>>>>> Although YARN could >>>>>> >>>>> >> >> > >> resume >>>>>> >>>>> >> >> > >> > > the >>>>>> >>>>> >> >> > >> > > > > > > >> application master in case of >>>>>> failures, but in some case >>>>>> >>>>> >> >> > >> the >>>>>> >>>>> >> >> > >> > > > > compiling >>>>>> >>>>> >> >> > >> > > > > > > >> failure may be a waste of cluster >>>>>> resource and may impact >>>>>> >>>>> >> >> > >> the >>>>>> >>>>> >> >> > >> > > > > > stability >>>>>> >>>>> >> >> > >> > > > > > > the >>>>>> >>>>> >> >> > >> > > > > > > >> cluster and the other jobs in the >>>>>> cluster, such as the >>>>>> >>>>> >> >> > >> resource >>>>>> >>>>> >> >> > >> > > > path >>>>>> >>>>> >> >> > >> > > > > > is >>>>>> >>>>> >> >> > >> > > > > > > >> incorrect, the user program itself >>>>>> has some problems(in >>>>>> >>>>> >> >> > >> this >>>>>> >>>>> >> >> > >> > > case, >>>>>> >>>>> >> >> > >> > > > > job >>>>>> >>>>> >> >> > >> > > > > > > >> failover cannot solve this kind of >>>>>> problems) etc. In the >>>>>> >>>>> >> >> > >> current >>>>>> >>>>> >> >> > >> > > > > > > >> implemention, the compiling errors >>>>>> are handled in the >>>>>> >>>>> >> >> > >> client >>>>>> >>>>> >> >> > >> > > side >>>>>> >>>>> >> >> > >> > > > > and >>>>>> >>>>> >> >> > >> > > > > > > there >>>>>> >>>>> >> >> > >> > > > > > > >> is no impact to the cluster at all. >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>> >>>>> >> >> > >> > > > > > > >> Regarding to 1), it's clearly pointed >>>>>> in the design doc >>>>>> >>>>> >> >> > >> that >>>>>> >>>>> >> >> > >> > > only >>>>>> >>>>> >> >> > >> > > > > > > per-job >>>>>> >>>>> >> >> > >> > > > > > > >> mode will be supported. However, I >>>>>> think it's better to >>>>>> >>>>> >> >> > >> also >>>>>> >>>>> >> >> > >> > > > > consider >>>>>> >>>>> >> >> > >> > > > > > > the >>>>>> >>>>> >> >> > >> > > > > > > >> session mode in the design doc. >>>>>> >>>>> >> >> > >> > > > > > > >> Regarding to 2) and 3), I have not >>>>>> seen related sections >>>>>> >>>>> >> >> > >> in the >>>>>> >>>>> >> >> > >> > > > > design >>>>>> >>>>> >> >> > >> > > > > > > >> doc. It will be good if we can cover >>>>>> them in the design >>>>>> >>>>> >> >> > >> doc. >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>> >>>>> >> >> > >> > > > > > > >> Feel free to correct me If there is >>>>>> anything I >>>>>> >>>>> >> >> > >> misunderstand. >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>> >>>>> >> >> > >> > > > > > > >> Regards, >>>>>> >>>>> >> >> > >> > > > > > > >> Dian >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>> >>>>> >> >> > >> > > > > > > >> > 在 2019年12月27日,上午3:13,Peter Huang < >>>>>> >>>>> >> >> > >> [hidden email]> >>>>>> >>>>> >> >> > >> > > > 写道: >>>>>> >>>>> >> >> > >> > > > > > > >> > >>>>>> >>>>> >> >> > >> > > > > > > >> > Hi Yang, >>>>>> >>>>> >> >> > >> > > > > > > >> > >>>>>> >>>>> >> >> > >> > > > > > > >> > I can't agree more. The effort >>>>>> definitely needs to align >>>>>> >>>>> >> >> > >> with >>>>>> >>>>> >> >> > >> > > > the >>>>>> >>>>> >> >> > >> > > > > > > final >>>>>> >>>>> >> >> > >> > > > > > > >> > goal of FLIP-73. >>>>>> >>>>> >> >> > >> > > > > > > >> > I am thinking about whether we can >>>>>> achieve the goal with >>>>>> >>>>> >> >> > >> two >>>>>> >>>>> >> >> > >> > > > > phases. >>>>>> >>>>> >> >> > >> > > > > > > >> > >>>>>> >>>>> >> >> > >> > > > > > > >> > 1) Phase I >>>>>> >>>>> >> >> > >> > > > > > > >> > As the CLiFrontend will not be >>>>>> depreciated soon. We can >>>>>> >>>>> >> >> > >> still >>>>>> >>>>> >> >> > >> > > > use >>>>>> >>>>> >> >> > >> > > > > > the >>>>>> >>>>> >> >> > >> > > > > > > >> > deployMode flag there, >>>>>> >>>>> >> >> > >> > > > > > > >> > pass the program info through Flink >>>>>> configuration, use >>>>>> >>>>> >> >> > >> the >>>>>> >>>>> >> >> > >> > > > > > > >> > ClassPathJobGraphRetriever >>>>>> >>>>> >> >> > >> > > > > > > >> > to generate the job graph in >>>>>> ClusterEntrypoints of yarn >>>>>> >>>>> >> >> > >> and >>>>>> >>>>> >> >> > >> > > > > > > Kubernetes. >>>>>> >>>>> >> >> > >> > > > > > > >> > >>>>>> >>>>> >> >> > >> > > > > > > >> > 2) Phase II >>>>>> >>>>> >> >> > >> > > > > > > >> > In AbstractJobClusterExecutor, the >>>>>> job graph is >>>>>> >>>>> >> >> > >> generated in >>>>>> >>>>> >> >> > >> > > > the >>>>>> >>>>> >> >> > >> > > > > > > >> execute >>>>>> >>>>> >> >> > >> > > > > > > >> > function. We can still >>>>>> >>>>> >> >> > >> > > > > > > >> > use the deployMode in it. With >>>>>> deployMode = cluster, the >>>>>> >>>>> >> >> > >> > > execute >>>>>> >>>>> >> >> > >> > > > > > > >> function >>>>>> >>>>> >> >> > >> > > > > > > >> > only starts the cluster. >>>>>> >>>>> >> >> > >> > > > > > > >> > >>>>>> >>>>> >> >> > >> > > > > > > >> > When >>>>>> {Yarn/Kuberneates}PerJobClusterEntrypoint starts, >>>>>> >>>>> >> >> > >> It will >>>>>> >>>>> >> >> > >> > > > > start >>>>>> >>>>> >> >> > >> > > > > > > the >>>>>> >>>>> >> >> > >> > > > > > > >> > dispatch first, then we can use >>>>>> >>>>> >> >> > >> > > > > > > >> > a ClusterEnvironment similar to >>>>>> ContextEnvironment to >>>>>> >>>>> >> >> > >> submit >>>>>> >>>>> >> >> > >> > > the >>>>>> >>>>> >> >> > >> > > > > job >>>>>> >>>>> >> >> > >> > > > > > > >> with >>>>>> >>>>> >> >> > >> > > > > > > >> > jobName the local >>>>>> >>>>> >> >> > >> > > > > > > >> > dispatcher. For the details, we >>>>>> need more investigation. >>>>>> >>>>> >> >> > >> Let's >>>>>> >>>>> >> >> > >> > > > > wait >>>>>> >>>>> >> >> > >> > > > > > > >> > for @Aljoscha >>>>>> >>>>> >> >> > >> > > > > > > >> > Krettek <[hidden email]> >>>>>> @Till Rohrmann < >>>>>> >>>>> >> >> > >> > > > > [hidden email] >>>>>> >>>>> >> >> > >> > > > > > >'s >>>>>> >>>>> >> >> > >> > > > > > > >> > feedback after the holiday season. >>>>>> >>>>> >> >> > >> > > > > > > >> > >>>>>> >>>>> >> >> > >> > > > > > > >> > Thank you in advance. Merry >>>>>> Chrismas and Happy New >>>>>> >>>>> >> >> > >> Year!!! >>>>>> >>>>> >> >> > >> > > > > > > >> > >>>>>> >>>>> >> >> > >> > > > > > > >> > >>>>>> >>>>> >> >> > >> > > > > > > >> > Best Regards >>>>>> >>>>> >> >> > >> > > > > > > >> > Peter Huang >>>>>> >>>>> >> >> > >> > > > > > > >> > >>>>>> >>>>> >> >> > >> > > > > > > >> > >>>>>> >>>>> >> >> > >> > > > > > > >> > >>>>>> >>>>> >> >> > >> > > > > > > >> > >>>>>> >>>>> >> >> > >> > > > > > > >> > >>>>>> >>>>> >> >> > >> > > > > > > >> > >>>>>> >>>>> >> >> > >> > > > > > > >> > >>>>>> >>>>> >> >> > >> > > > > > > >> > >>>>>> >>>>> >> >> > >> > > > > > > >> > On Wed, Dec 25, 2019 at 1:08 AM >>>>>> Yang Wang < >>>>>> >>>>> >> >> > >> > > > [hidden email]> >>>>>> >>>>> >> >> > >> > > > > > > >> wrote: >>>>>> >>>>> >> >> > >> > > > > > > >> > >>>>>> >>>>> >> >> > >> > > > > > > >> >> Hi Peter, >>>>>> >>>>> >> >> > >> > > > > > > >> >> >>>>>> >>>>> >> >> > >> > > > > > > >> >> I think we need to reconsider >>>>>> tison's suggestion >>>>>> >>>>> >> >> > >> seriously. >>>>>> >>>>> >> >> > >> > > > After >>>>>> >>>>> >> >> > >> > > > > > > >> FLIP-73, >>>>>> >>>>> >> >> > >> > > > > > > >> >> the deployJobCluster has >>>>>> >>>>> >> >> > >> > > > > > > >> >> beenmoved into >>>>>> `JobClusterExecutor#execute`. It should >>>>>> >>>>> >> >> > >> not be >>>>>> >>>>> >> >> > >> > > > > > > perceived >>>>>> >>>>> >> >> > >> > > > > > > >> >> for `CliFrontend`. That >>>>>> >>>>> >> >> > >> > > > > > > >> >> means the user program will >>>>>> *ALWAYS* be executed on >>>>>> >>>>> >> >> > >> client >>>>>> >>>>> >> >> > >> > > > side. >>>>>> >>>>> >> >> > >> > > > > > This >>>>>> >>>>> >> >> > >> > > > > > > >> is >>>>>> >>>>> >> >> > >> > > > > > > >> >> the by design behavior. >>>>>> >>>>> >> >> > >> > > > > > > >> >> So, we could not just add >>>>>> `if(client mode) .. else >>>>>> >>>>> >> >> > >> if(cluster >>>>>> >>>>> >> >> > >> > > > > mode) >>>>>> >>>>> >> >> > >> > > > > > > >> ...` >>>>>> >>>>> >> >> > >> > > > > > > >> >> codes in `CliFrontend` to bypass >>>>>> >>>>> >> >> > >> > > > > > > >> >> the executor. We need to find a >>>>>> clean way to decouple >>>>>> >>>>> >> >> > >> > > executing >>>>>> >>>>> >> >> > >> > > > > > user >>>>>> >>>>> >> >> > >> > > > > > > >> >> program and deploying per-job >>>>>> >>>>> >> >> > >> > > > > > > >> >> cluster. Based on this, we could >>>>>> support to execute user >>>>>> >>>>> >> >> > >> > > > program >>>>>> >>>>> >> >> > >> > > > > on >>>>>> >>>>> >> >> > >> > > > > > > >> client >>>>>> >>>>> >> >> > >> > > > > > > >> >> or master side. >>>>>> >>>>> >> >> > >> > > > > > > >> >> >>>>>> >>>>> >> >> > >> > > > > > > >> >> Maybe Aljoscha and Jeff could give >>>>>> some good >>>>>> >>>>> >> >> > >> suggestions. >>>>>> >>>>> >> >> > >> > > > > > > >> >> >>>>>> >>>>> >> >> > >> > > > > > > >> >> >>>>>> >>>>> >> >> > >> > > > > > > >> >> >>>>>> >>>>> >> >> > >> > > > > > > >> >> Best, >>>>>> >>>>> >> >> > >> > > > > > > >> >> Yang >>>>>> >>>>> >> >> > >> > > > > > > >> >> >>>>>> >>>>> >> >> > >> > > > > > > >> >> Peter Huang < >>>>>> [hidden email]> 于2019年12月25日周三 >>>>>> >>>>> >> >> > >> > > > > 上午4:03写道: >>>>>> >>>>> >> >> > >> > > > > > > >> >> >>>>>> >>>>> >> >> > >> > > > > > > >> >>> Hi Jingjing, >>>>>> >>>>> >> >> > >> > > > > > > >> >>> >>>>>> >>>>> >> >> > >> > > > > > > >> >>> The improvement proposed is a >>>>>> deployment option for >>>>>> >>>>> >> >> > >> CLI. For >>>>>> >>>>> >> >> > >> > > > SQL >>>>>> >>>>> >> >> > >> > > > > > > based >>>>>> >>>>> >> >> > >> > > > > > > >> >>> Flink application, It is more >>>>>> convenient to use the >>>>>> >>>>> >> >> > >> existing >>>>>> >>>>> >> >> > >> > > > > model >>>>>> >>>>> >> >> > >> > > > > > > in >>>>>> >>>>> >> >> > >> > > > > > > >> >>> SqlClient in which >>>>>> >>>>> >> >> > >> > > > > > > >> >>> the job graph is generated within >>>>>> SqlClient. After >>>>>> >>>>> >> >> > >> adding >>>>>> >>>>> >> >> > >> > > the >>>>>> >>>>> >> >> > >> > > > > > > delayed >>>>>> >>>>> >> >> > >> > > > > > > >> job >>>>>> >>>>> >> >> > >> > > > > > > >> >>> graph generation, I think there >>>>>> is no change is needed >>>>>> >>>>> >> >> > >> for >>>>>> >>>>> >> >> > >> > > > your >>>>>> >>>>> >> >> > >> > > > > > > side. >>>>>> >>>>> >> >> > >> > > > > > > >> >>> >>>>>> >>>>> >> >> > >> > > > > > > >> >>> >>>>>> >>>>> >> >> > >> > > > > > > >> >>> Best Regards >>>>>> >>>>> >> >> > >> > > > > > > >> >>> Peter Huang >>>>>> >>>>> >> >> > >> > > > > > > >> >>> >>>>>> >>>>> >> >> > >> > > > > > > >> >>> >>>>>> >>>>> >> >> > >> > > > > > > >> >>> On Wed, Dec 18, 2019 at 6:01 AM >>>>>> jingjing bai < >>>>>> >>>>> >> >> > >> > > > > > > >> [hidden email]> >>>>>> >>>>> >> >> > >> > > > > > > >> >>> wrote: >>>>>> >>>>> >> >> > >> > > > > > > >> >>> >>>>>> >>>>> >> >> > >> > > > > > > >> >>>> hi peter: >>>>>> >>>>> >> >> > >> > > > > > > >> >>>> we had extension SqlClent to >>>>>> support sql job >>>>>> >>>>> >> >> > >> submit in >>>>>> >>>>> >> >> > >> > > web >>>>>> >>>>> >> >> > >> > > > > > base >>>>>> >>>>> >> >> > >> > > > > > > on >>>>>> >>>>> >> >> > >> > > > > > > >> >>>> flink 1.9. we support submit >>>>>> to yarn on per job >>>>>> >>>>> >> >> > >> mode too. >>>>>> >>>>> >> >> > >> > > > > > > >> >>>> in this case, the job graph >>>>>> generated on client >>>>>> >>>>> >> >> > >> side >>>>>> >>>>> >> >> > >> > > . I >>>>>> >>>>> >> >> > >> > > > > > think >>>>>> >>>>> >> >> > >> > > > > > > >> >>> this >>>>>> >>>>> >> >> > >> > > > > > > >> >>>> discuss Mainly to improve api >>>>>> programme. but in my >>>>>> >>>>> >> >> > >> case , >>>>>> >>>>> >> >> > >> > > > > there >>>>>> >>>>> >> >> > >> > > > > > is >>>>>> >>>>> >> >> > >> > > > > > > >> no >>>>>> >>>>> >> >> > >> > > > > > > >> >>>> jar to upload but only a sql >>>>>> string . >>>>>> >>>>> >> >> > >> > > > > > > >> >>>> do u had more suggestion to >>>>>> improve for sql mode >>>>>> >>>>> >> >> > >> or it >>>>>> >>>>> >> >> > >> > > is >>>>>> >>>>> >> >> > >> > > > > > only a >>>>>> >>>>> >> >> > >> > > > > > > >> >>>> switch for api programme? >>>>>> >>>>> >> >> > >> > > > > > > >> >>>> >>>>>> >>>>> >> >> > >> > > > > > > >> >>>> >>>>>> >>>>> >> >> > >> > > > > > > >> >>>> best >>>>>> >>>>> >> >> > >> > > > > > > >> >>>> bai jj >>>>>> >>>>> >> >> > >> > > > > > > >> >>>> >>>>>> >>>>> >> >> > >> > > > > > > >> >>>> >>>>>> >>>>> >> >> > >> > > > > > > >> >>>> Yang Wang <[hidden email]> >>>>>> 于2019年12月18日周三 >>>>>> >>>>> >> >> > >> 下午7:21写道: >>>>>> >>>>> >> >> > >> > > > > > > >> >>>> >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>> I just want to revive this >>>>>> discussion. >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>> >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>> Recently, i am thinking about >>>>>> how to natively run >>>>>> >>>>> >> >> > >> flink >>>>>> >>>>> >> >> > >> > > > > per-job >>>>>> >>>>> >> >> > >> > > > > > > >> >>> cluster on >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>> Kubernetes. >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>> The per-job mode on Kubernetes >>>>>> is very different >>>>>> >>>>> >> >> > >> from on >>>>>> >>>>> >> >> > >> > > > Yarn. >>>>>> >>>>> >> >> > >> > > > > > And >>>>>> >>>>> >> >> > >> > > > > > > >> we >>>>>> >>>>> >> >> > >> > > > > > > >> >>> will >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>> have >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>> the same deployment >>>>>> requirements to the client and >>>>>> >>>>> >> >> > >> entry >>>>>> >>>>> >> >> > >> > > > > point. >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>> >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>> 1. Flink client not always need >>>>>> a local jar to start >>>>>> >>>>> >> >> > >> a >>>>>> >>>>> >> >> > >> > > Flink >>>>>> >>>>> >> >> > >> > > > > > > per-job >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>> cluster. We could >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>> support multiple schemas. For >>>>>> example, >>>>>> >>>>> >> >> > >> > > > file:///path/of/my.jar >>>>>> >>>>> >> >> > >> > > > > > > means >>>>>> >>>>> >> >> > >> > > > > > > >> a >>>>>> >>>>> >> >> > >> > > > > > > >> >>> jar >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>> located >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>> at client side, >>>>>> >>>>> >> >> > >> hdfs://myhdfs/user/myname/flink/my.jar >>>>>> >>>>> >> >> > >> > > > means a >>>>>> >>>>> >> >> > >> > > > > > jar >>>>>> >>>>> >> >> > >> > > > > > > >> >>> located >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>> at >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>> remote hdfs, >>>>>> local:///path/in/image/my.jar means a >>>>>> >>>>> >> >> > >> jar >>>>>> >>>>> >> >> > >> > > > located >>>>>> >>>>> >> >> > >> > > > > > at >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>> jobmanager side. >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>> >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>> 2. Support running user program >>>>>> on master side. This >>>>>> >>>>> >> >> > >> also >>>>>> >>>>> >> >> > >> > > > > means >>>>>> >>>>> >> >> > >> > > > > > > the >>>>>> >>>>> >> >> > >> > > > > > > >> >>> entry >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>> point >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>> will generate the job graph on >>>>>> master side. We could >>>>>> >>>>> >> >> > >> use >>>>>> >>>>> >> >> > >> > > the >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>> ClasspathJobGraphRetriever >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>> or start a local Flink client >>>>>> to achieve this >>>>>> >>>>> >> >> > >> purpose. >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>> >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>> >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>> cc tison, Aljoscha & Kostas Do >>>>>> you think this is the >>>>>> >>>>> >> >> > >> right >>>>>> >>>>> >> >> > >> > > > > > > >> direction we >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>> need to work? >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>> >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>> tison <[hidden email]> >>>>>> 于2019年12月12日周四 >>>>>> >>>>> >> >> > >> 下午4:48写道: >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>> >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>> A quick idea is that we >>>>>> separate the deployment >>>>>> >>>>> >> >> > >> from user >>>>>> >>>>> >> >> > >> > > > > > program >>>>>> >>>>> >> >> > >> > > > > > > >> >>> that >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>> it >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>> has always been done >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>> outside the program. On user >>>>>> program executed there >>>>>> >>>>> >> >> > >> is >>>>>> >>>>> >> >> > >> > > > > always a >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>> ClusterClient that >>>>>> communicates with >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>> an existing cluster, remote or >>>>>> local. It will be >>>>>> >>>>> >> >> > >> another >>>>>> >>>>> >> >> > >> > > > > thread >>>>>> >>>>> >> >> > >> > > > > > > so >>>>>> >>>>> >> >> > >> > > > > > > >> >>> just >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>> for >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>> your information. >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>> >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>> Best, >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>> tison. >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>> >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>> >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>> tison <[hidden email]> >>>>>> 于2019年12月12日周四 >>>>>> >>>>> >> >> > >> 下午4:40写道: >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>> >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>> Hi Peter, >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>> >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>> Another concern I realized >>>>>> recently is that with >>>>>> >>>>> >> >> > >> current >>>>>> >>>>> >> >> > >> > > > > > > Executors >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>> abstraction(FLIP-73) >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>> I'm afraid that user program >>>>>> is designed to ALWAYS >>>>>> >>>>> >> >> > >> run >>>>>> >>>>> >> >> > >> > > on >>>>>> >>>>> >> >> > >> > > > > the >>>>>> >>>>> >> >> > >> > > > > > > >> >>> client >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>> side. >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>> Specifically, >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>> we deploy the job in executor >>>>>> when env.execute >>>>>> >>>>> >> >> > >> called. >>>>>> >>>>> >> >> > >> > > > This >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>> abstraction >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>> possibly prevents >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>> Flink runs user program on >>>>>> the cluster side. >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>> >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>> For your proposal, in this >>>>>> case we already >>>>>> >>>>> >> >> > >> compiled the >>>>>> >>>>> >> >> > >> > > > > > program >>>>>> >>>>> >> >> > >> > > > > > > >> and >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>> run >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>> on >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>> the client side, >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>> even we deploy a cluster and >>>>>> retrieve job graph >>>>>> >>>>> >> >> > >> from >>>>>> >>>>> >> >> > >> > > > program >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>> metadata, it >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>> doesn't make >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>> many sense. >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>> >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>> cc Aljoscha & Kostas what do >>>>>> you think about this >>>>>> >>>>> >> >> > >> > > > > constraint? >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>> >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>> Best, >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>> tison. >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>> >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>> >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>> Peter Huang < >>>>>> [hidden email]> >>>>>> >>>>> >> >> > >> 于2019年12月10日周二 >>>>>> >>>>> >> >> > >> > > > > > > >> 下午12:45写道: >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>> >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> Hi Tison, >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> Yes, you are right. I think >>>>>> I made the wrong >>>>>> >>>>> >> >> > >> argument >>>>>> >>>>> >> >> > >> > > in >>>>>> >>>>> >> >> > >> > > > > the >>>>>> >>>>> >> >> > >> > > > > > > doc. >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> Basically, the packaging jar >>>>>> problem is only for >>>>>> >>>>> >> >> > >> > > platform >>>>>> >>>>> >> >> > >> > > > > > > users. >>>>>> >>>>> >> >> > >> > > > > > > >> >>> In >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>> our >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> internal deploy service, >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> we further optimized the >>>>>> deployment latency by >>>>>> >>>>> >> >> > >> letting >>>>>> >>>>> >> >> > >> > > > > users >>>>>> >>>>> >> >> > >> > > > > > to >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>> packaging >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> flink-runtime together with >>>>>> the uber jar, so that >>>>>> >>>>> >> >> > >> we >>>>>> >>>>> >> >> > >> > > > don't >>>>>> >>>>> >> >> > >> > > > > > need >>>>>> >>>>> >> >> > >> > > > > > > >> to >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> consider >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> multiple flink version >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> support for now. In the >>>>>> session client mode, as >>>>>> >>>>> >> >> > >> Flink >>>>>> >>>>> >> >> > >> > > > libs >>>>>> >>>>> >> >> > >> > > > > > will >>>>>> >>>>> >> >> > >> > > > > > > >> be >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>> shipped >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> anyway as local resources of >>>>>> yarn. Users actually >>>>>> >>>>> >> >> > >> don't >>>>>> >>>>> >> >> > >> > > > > need >>>>>> >>>>> >> >> > >> > > > > > to >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>> package >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> those libs into job jar. >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> Best Regards >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> Peter Huang >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> On Mon, Dec 9, 2019 at 8:35 >>>>>> PM tison < >>>>>> >>>>> >> >> > >> > > > [hidden email] >>>>>> >>>>> >> >> > >> > > > > > >>>>>> >>>>> >> >> > >> > > > > > > >> >>> wrote: >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> 3. What do you mean about >>>>>> the package? Do users >>>>>> >>>>> >> >> > >> need >>>>>> >>>>> >> >> > >> > > to >>>>>> >>>>> >> >> > >> > > > > > > >> >>> compile >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>> their >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>> jars >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>> inlcuding flink-clients, >>>>>> flink-optimizer, >>>>>> >>>>> >> >> > >> flink-table >>>>>> >>>>> >> >> > >> > > > > codes? >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>> >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>> The answer should be no >>>>>> because they exist in >>>>>> >>>>> >> >> > >> system >>>>>> >>>>> >> >> > >> > > > > > > classpath. >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>> >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>> Best, >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>> tison. >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>> >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>> >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>> Yang Wang < >>>>>> [hidden email]> 于2019年12月10日周二 >>>>>> >>>>> >> >> > >> > > > > 下午12:18写道: >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>> >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> Hi Peter, >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> Thanks a lot for starting >>>>>> this discussion. I >>>>>> >>>>> >> >> > >> think >>>>>> >>>>> >> >> > >> > > this >>>>>> >>>>> >> >> > >> > > > > is >>>>>> >>>>> >> >> > >> > > > > > a >>>>>> >>>>> >> >> > >> > > > > > > >> >>> very >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> useful >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> feature. >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> Not only for Yarn, i am >>>>>> focused on flink on >>>>>> >>>>> >> >> > >> > > Kubernetes >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>> integration >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>> and >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>> come >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> across the same >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> problem. I do not want the >>>>>> job graph generated >>>>>> >>>>> >> >> > >> on >>>>>> >>>>> >> >> > >> > > > client >>>>>> >>>>> >> >> > >> > > > > > > side. >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> Instead, >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>> the >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> user jars are built in >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> a user-defined image. When >>>>>> the job manager >>>>>> >>>>> >> >> > >> launched, >>>>>> >>>>> >> >> > >> > > we >>>>>> >>>>> >> >> > >> > > > > > just >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>> need to >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> generate the job graph >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> based on local user jars. >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> I have some small >>>>>> suggestion about this. >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> 1. >>>>>> `ProgramJobGraphRetriever` is very similar to >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> >>>>>> `ClasspathJobGraphRetriever`, the differences >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> are the former needs >>>>>> `ProgramMetadata` and the >>>>>> >>>>> >> >> > >> latter >>>>>> >>>>> >> >> > >> > > > > needs >>>>>> >>>>> >> >> > >> > > > > > > >> >>> some >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>> arguments. >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> Is it possible to >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> have an unified >>>>>> `JobGraphRetriever` to support >>>>>> >>>>> >> >> > >> both? >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> 2. Is it possible to not >>>>>> use a local user jar to >>>>>> >>>>> >> >> > >> > > start >>>>>> >>>>> >> >> > >> > > > a >>>>>> >>>>> >> >> > >> > > > > > > >> >>> per-job >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> cluster? >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> In your case, the user >>>>>> jars has >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> existed on hdfs already >>>>>> and we do need to >>>>>> >>>>> >> >> > >> download >>>>>> >>>>> >> >> > >> > > the >>>>>> >>>>> >> >> > >> > > > > jars >>>>>> >>>>> >> >> > >> > > > > > > to >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> deployer >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> service. Currently, we >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> always need a local user >>>>>> jar to start a flink >>>>>> >>>>> >> >> > >> > > cluster. >>>>>> >>>>> >> >> > >> > > > It >>>>>> >>>>> >> >> > >> > > > > > is >>>>>> >>>>> >> >> > >> > > > > > > >> >>> be >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>> great >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> if >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>> we >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> could support remote user >>>>>> jars. >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>>> In the implementation, >>>>>> we assume users package >>>>>> >>>>> >> >> > >> > > > > > > >> >>> flink-clients, >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> flink-optimizer, >>>>>> flink-table together within >>>>>> >>>>> >> >> > >> the job >>>>>> >>>>> >> >> > >> > > > jar. >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>> Otherwise, >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> the >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> job graph generation within >>>>>> >>>>> >> >> > >> JobClusterEntryPoint will >>>>>> >>>>> >> >> > >> > > > > fail. >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> 3. What do you mean about >>>>>> the package? Do users >>>>>> >>>>> >> >> > >> need >>>>>> >>>>> >> >> > >> > > to >>>>>> >>>>> >> >> > >> > > > > > > >> >>> compile >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>> their >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>> jars >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> inlcuding flink-clients, >>>>>> flink-optimizer, >>>>>> >>>>> >> >> > >> flink-table >>>>>> >>>>> >> >> > >> > > > > > codes? >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> Best, >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> Yang >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> Peter Huang < >>>>>> [hidden email]> >>>>>> >>>>> >> >> > >> > > > 于2019年12月10日周二 >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>> 上午2:37写道: >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> Dear All, >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> Recently, the Flink >>>>>> community starts to >>>>>> >>>>> >> >> > >> improve the >>>>>> >>>>> >> >> > >> > > > yarn >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>> cluster >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> descriptor >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> to make job jar and >>>>>> config files configurable >>>>>> >>>>> >> >> > >> from >>>>>> >>>>> >> >> > >> > > > CLI. >>>>>> >>>>> >> >> > >> > > > > It >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>> improves >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> the >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> flexibility of Flink >>>>>> deployment Yarn Per Job >>>>>> >>>>> >> >> > >> Mode. >>>>>> >>>>> >> >> > >> > > > For >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>> platform >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> users >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> who >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> manage tens of hundreds >>>>>> of streaming pipelines >>>>>> >>>>> >> >> > >> for >>>>>> >>>>> >> >> > >> > > the >>>>>> >>>>> >> >> > >> > > > > > whole >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>> org >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>> or >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> company, we found the job >>>>>> graph generation in >>>>>> >>>>> >> >> > >> > > > > client-side >>>>>> >>>>> >> >> > >> > > > > > is >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>> another >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> pinpoint. Thus, we want >>>>>> to propose a >>>>>> >>>>> >> >> > >> configurable >>>>>> >>>>> >> >> > >> > > > > feature >>>>>> >>>>> >> >> > >> > > > > > > >> >>> for >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> FlinkYarnSessionCli. The >>>>>> feature can allow >>>>>> >>>>> >> >> > >> users to >>>>>> >>>>> >> >> > >> > > > > choose >>>>>> >>>>> >> >> > >> > > > > > > >> >>> the >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>> job >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>> graph >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> generation in Flink >>>>>> ClusterEntryPoint so that >>>>>> >>>>> >> >> > >> the >>>>>> >>>>> >> >> > >> > > job >>>>>> >>>>> >> >> > >> > > > > jar >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>> doesn't >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> need >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>> to >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> be locally for the job >>>>>> graph generation. The >>>>>> >>>>> >> >> > >> > > proposal >>>>>> >>>>> >> >> > >> > > > is >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>> organized >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> as a >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> FLIP >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>> >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>> >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>> >>>>>> >>>>> >> >> > >> > > > > > > >> >>> >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>> >>>>> >> >> > >> > > > > > > >>>>>> >>>>> >> >> > >> > > > > > >>>>>> >>>>> >> >> > >> > > > > >>>>>> >>>>> >> >> > >> > > > >>>>>> >>>>> >> >> > >> > > >>>>>> >>>>> >> >> > >> >>>>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-85+Delayed+JobGraph+Generation >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> . >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> Any questions and >>>>>> suggestions are welcomed. >>>>>> >>>>> >> >> > >> Thank >>>>>> >>>>> >> >> > >> > > you >>>>>> >>>>> >> >> > >> > > > in >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>> advance. >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> Best Regards >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> Peter Huang >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>> >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>> >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>> >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>> >>>>>> >>>>> >> >> > >> > > > > > > >> >>>> >>>>>> >>>>> >> >> > >> > > > > > > >> >>> >>>>>> >>>>> >> >> > >> > > > > > > >> >> >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>> >>>>> >> >> > >> > > > > > > >>>>>> >>>>> >> >> > >> > > > > > >>>>>> >>>>> >> >> > >> > > > > >>>>>> >>>>> >> >> > >> > > > >>>>>> >>>>> >> >> > >> > > >>>>>> >>>>> >> >> > >> >>>>>> >>>>> >> >> > > >>>>>> >>>>> >> >> >>>>>> >>>>> |
Hi all,
And thanks for the discussion topics. For the cluster lifecycle, it is the Entrypoint that will tear down the cluster when the application finishes. Probably we should emphasise it a bit more in the FLIP. For the -R flag, this was in the PoC that I published just as a quick implementation, so that I can move fast to the entrypoint part. Personally, I would not even be against having a separate command in the CLI for this, sth like run-on-cluster or something along those lines. What do you think? For fetching jars, in the FLIP we say that as a first implementation we can have Local and DFS. I was wondering if in the case of YARN, both could be somehow implemented using LocalResources, and let Yarn do the actual fetch. But I have not investigated it further. Do you have any opinion on this? Cheers, Kostas On Mon, Mar 9, 2020 at 10:47 AM Becket Qin <[hidden email]> wrote: > > Thanks Yang, > > That would be very helpful! > > Jiangjie (Becket) Qin > > On Mon, Mar 9, 2020 at 3:31 PM Yang Wang <[hidden email]> wrote: >> >> Hi Becket, >> >> Thanks for your suggestion. We will update the FLIP to add/enrich the following parts. >> * User cli option change, use "-R/--remote" to apply the cluster deploy mode >> * Configuration change, how to specify remote user jars and dependencies >> * The whole story about how "application mode" works, upload -> fetch -> submit job >> * The cluster lifecycle, when and how the Flink cluster is destroyed >> >> >> Best, >> Yang >> >> Becket Qin <[hidden email]> 于2020年3月9日周一 下午12:34写道: >>> >>> Thanks for the reply, tison and Yang, >>> >>> Regarding the public interface, is "-R/--remote" option the only change? Will the users also need to provide a remote location to upload and store the jars, and a list of jars as dependencies to be uploaded? >>> >>> It would be important that the public interface section in the FLIP includes all the user sensible changes including the CLI / configuration / metrics, etc. Can we update the FLIP to include the conclusion we have here in the ML? >>> >>> Thanks, >>> >>> Jiangjie (Becket) Qin >>> >>> On Mon, Mar 9, 2020 at 11:59 AM Yang Wang <[hidden email]> wrote: >>>> >>>> Hi Becket, >>>> >>>> Thanks for jumping out and sharing your concerns. I second tison's answer and just >>>> make some additions. >>>> >>>> >>>> > job submission interface >>>> >>>> This FLIP will introduce an interface for running user `main()` on cluster, named as >>>> “ProgramDeployer”. However, it is not a public interface. It will be used in `CliFrontend` >>>> when the remote deploy option(-R/--remote-deploy) is specified. So the only changes >>>> on user side is about the cli option. >>>> >>>> >>>> > How to fetch the jars? >>>> >>>> The “local path” and “dfs path“ could be supported to fetch the user jars and dependencies. >>>> Just like tison has said, we could ship the user jar and dependencies from client side to >>>> HDFS and use the entrypoint to fetch. >>>> >>>> Also we have some other practical ways to use the new “application mode“. >>>> 1. Upload the user jars and dependencies to the DFS(e.g. HDFS, S3, Aliyun OSS) manually >>>> or some external deployer system. For K8s, the user jars and dependencies could also be >>>> built in the docker image. >>>> 2. Specify the remote/local user jar and dependencies in `flink run`. Usually this could also >>>> be done by the external deployer system. >>>> 3. When the `ClusterEntrypoint` is launched, it will fetch the jars and files automatically. We >>>> do not need any specific fetcher implementation. Since we could leverage flink `FileSystem` >>>> to do this. >>>> >>>> >>>> >>>> >>>> Best, >>>> Yang >>>> >>>> tison <[hidden email]> 于2020年3月9日周一 上午11:34写道: >>>>> >>>>> Hi Becket, >>>>> >>>>> Thanks for your attention on FLIP-85! I answered your question inline. >>>>> >>>>> 1. What exactly the job submission interface will look like after this FLIP? The FLIP template has a Public Interface section but was removed from this FLIP. >>>>> >>>>> As Yang mentioned in this thread above: >>>>> >>>>> From user perspective, only a `-R/-- remote-deploy` cli option is visible. They are not aware of the application mode. >>>>> >>>>> 2. How will the new ClusterEntrypoint fetch the jars from external storage? What external storage will be supported out of the box? Will this "jar fetcher" be pluggable? If so, how does the API look like and how will users specify the custom "jar fetcher"? >>>>> >>>>> It depends actually. Here are several points: >>>>> >>>>> i. Currently, shipping user files is handled by Flink, dependencies fetching can be handled by Flink. >>>>> ii. Current, we only support local file system shipfiles. When in Application Mode, to support meaningful jar fetch we should support user to configure richer shipfiles schema at first. >>>>> iii. Dependencies fetching varies from deployments. That is, on YARN, its convention is through HDFS; on Kubernetes, its convention is configured resource server and fetched by initContainer. >>>>> >>>>> Thus, in the First phase of Application Mode dependencies fetching is totally handled within Flink. >>>>> >>>>> 3. It sounds that in this FLIP, the "session cluster" running the application has the same lifecycle as the user application. How will the session cluster be teared down after the application finishes? Will the ClusterEntrypoint do that? Will there be an option of not tearing the cluster down? >>>>> >>>>> The precondition we tear down the cluster is *both* >>>>> >>>>> i. user main reached to its end >>>>> ii. all jobs submitted(current, at most one) reached global terminate state >>>>> >>>>> For the "how", it is an implementation topic, but conceptually it is ClusterEntrypoint's responsibility. >>>>> >>>>> >Will there be an option of not tearing the cluster down? >>>>> >>>>> I think the answer is "No" because the cluster is designed to be bounded with an Application. User logic that communicates with the job is always in its `main`, and for history information we have history server. >>>>> >>>>> Best, >>>>> tison. >>>>> >>>>> >>>>> Becket Qin <[hidden email]> 于2020年3月9日周一 上午8:12写道: >>>>>> >>>>>> Hi Peter and Kostas, >>>>>> >>>>>> Thanks for creating this FLIP. Moving the JobGraph compilation to the cluster makes a lot of sense to me. FLIP-40 had the exactly same idea, but is currently dormant and can probably be superseded by this FLIP. After reading the FLIP, I still have a few questions. >>>>>> >>>>>> 1. What exactly the job submission interface will look like after this FLIP? The FLIP template has a Public Interface section but was removed from this FLIP. >>>>>> 2. How will the new ClusterEntrypoint fetch the jars from external storage? What external storage will be supported out of the box? Will this "jar fetcher" be pluggable? If so, how does the API look like and how will users specify the custom "jar fetcher"? >>>>>> 3. It sounds that in this FLIP, the "session cluster" running the application has the same lifecycle as the user application. How will the session cluster be teared down after the application finishes? Will the ClusterEntrypoint do that? Will there be an option of not tearing the cluster down? >>>>>> >>>>>> Maybe they have been discussed in the ML earlier, but I think they should be part of the FLIP also. >>>>>> >>>>>> Thanks, >>>>>> >>>>>> Jiangjie (Becket) Qin >>>>>> >>>>>> On Thu, Mar 5, 2020 at 10:09 PM Kostas Kloudas <[hidden email]> wrote: >>>>>>> >>>>>>> Also from my side +1 to start voting. >>>>>>> >>>>>>> Cheers, >>>>>>> Kostas >>>>>>> >>>>>>> On Thu, Mar 5, 2020 at 7:45 AM tison <[hidden email]> wrote: >>>>>>> > >>>>>>> > +1 to star voting. >>>>>>> > >>>>>>> > Best, >>>>>>> > tison. >>>>>>> > >>>>>>> > >>>>>>> > Yang Wang <[hidden email]> 于2020年3月5日周四 下午2:29写道: >>>>>>> >> >>>>>>> >> Hi Peter, >>>>>>> >> Really thanks for your response. >>>>>>> >> >>>>>>> >> Hi all @Kostas Kloudas @Zili Chen @Peter Huang @Rong Rong >>>>>>> >> It seems that we have reached an agreement. The “application mode” is regarded as the enhanced “per-job”. It is >>>>>>> >> orthogonal with “cluster deploy”. Currently, we bind the “per-job” to `run-user-main-on-client` and “application mode” >>>>>>> >> to `run-user-main-on-cluster`. >>>>>>> >> >>>>>>> >> Do you have other concerns to moving FLIP-85 to voting? >>>>>>> >> >>>>>>> >> >>>>>>> >> Best, >>>>>>> >> Yang >>>>>>> >> >>>>>>> >> Peter Huang <[hidden email]> 于2020年3月5日周四 下午12:48写道: >>>>>>> >>> >>>>>>> >>> Hi Yang and Kostas, >>>>>>> >>> >>>>>>> >>> Thanks for the clarification. It makes more sense to me if the long term goal is to replace per job mode to application mode >>>>>>> >>> in the future (at the time that multiple execute can be supported). Before that, It will be better to keep the concept of >>>>>>> >>> application mode internally. As Yang suggested, User only need to use a `-R/-- remote-deploy` cli option to launch >>>>>>> >>> a per job cluster with the main function executed in cluster entry-point. +1 for the execution plan. >>>>>>> >>> >>>>>>> >>> >>>>>>> >>> >>>>>>> >>> Best Regards >>>>>>> >>> Peter Huang >>>>>>> >>> >>>>>>> >>> >>>>>>> >>> >>>>>>> >>> >>>>>>> >>> On Tue, Mar 3, 2020 at 7:11 AM Yang Wang <[hidden email]> wrote: >>>>>>> >>>> >>>>>>> >>>> Hi Peter, >>>>>>> >>>> >>>>>>> >>>> Having the application mode does not mean we will drop the cluster-deploy >>>>>>> >>>> option. I just want to share some thoughts about “Application Mode”. >>>>>>> >>>> >>>>>>> >>>> >>>>>>> >>>> 1. The application mode could cover the per-job sematic. Its lifecyle is bound >>>>>>> >>>> to the user `main()`. And all the jobs in the user main will be executed in a same >>>>>>> >>>> Flink cluster. In first phase of FLIP-85 implementation, running user main on the >>>>>>> >>>> cluster side could be supported in application mode. >>>>>>> >>>> >>>>>>> >>>> 2. Maybe in the future, we also need to support multiple `execute()` on client side >>>>>>> >>>> in a same Flink cluster. Then the per-job mode will evolve to application mode. >>>>>>> >>>> >>>>>>> >>>> 3. From user perspective, only a `-R/-- remote-deploy` cli option is visible. They >>>>>>> >>>> are not aware of the application mode. >>>>>>> >>>> >>>>>>> >>>> 4. In the first phase, the application mode is working as “per-job”(only one job in >>>>>>> >>>> the user main). We just leave more potential for the future. >>>>>>> >>>> >>>>>>> >>>> >>>>>>> >>>> I am not against with calling it “cluster deploy mode” if you all think it is clearer for users. >>>>>>> >>>> >>>>>>> >>>> >>>>>>> >>>> >>>>>>> >>>> Best, >>>>>>> >>>> Yang >>>>>>> >>>> >>>>>>> >>>> Kostas Kloudas <[hidden email]> 于2020年3月3日周二 下午6:49写道: >>>>>>> >>>>> >>>>>>> >>>>> Hi Peter, >>>>>>> >>>>> >>>>>>> >>>>> I understand your point. This is why I was also a bit torn about the >>>>>>> >>>>> name and my proposal was a bit aligned with yours (something along the >>>>>>> >>>>> lines of "cluster deploy" mode). >>>>>>> >>>>> >>>>>>> >>>>> But many of the other participants in the discussion suggested the >>>>>>> >>>>> "Application Mode". I think that the reasoning is that now the user's >>>>>>> >>>>> Application is more self-contained. >>>>>>> >>>>> It will be submitted to the cluster and the user can just disconnect. >>>>>>> >>>>> In addition, as discussed briefly in the doc, in the future there may >>>>>>> >>>>> be better support for multi-execute applications which will bring us >>>>>>> >>>>> one step closer to the true "Application Mode". But this is how I >>>>>>> >>>>> interpreted their arguments, of course they can also express their >>>>>>> >>>>> thoughts on the topic :) >>>>>>> >>>>> >>>>>>> >>>>> Cheers, >>>>>>> >>>>> Kostas >>>>>>> >>>>> >>>>>>> >>>>> On Mon, Mar 2, 2020 at 6:15 PM Peter Huang <[hidden email]> wrote: >>>>>>> >>>>> > >>>>>>> >>>>> > Hi Kostas, >>>>>>> >>>>> > >>>>>>> >>>>> > Thanks for updating the wiki. We have aligned with the implementations in the doc. But I feel it is still a little bit confusing of the naming from a user's perspective. It is well known that Flink support per job cluster and session cluster. The concept is in the layer of how a job is managed within Flink. The method introduced util now is a kind of mixing job and session cluster to promising the implementation complexity. We probably don't need to label it as Application Model as the same layer of per job cluster and session cluster. Conceptually, I think it is still a cluster mode implementation for per job cluster. >>>>>>> >>>>> > >>>>>>> >>>>> > To minimize the confusion of users, I think it would be better just an option of per job cluster for each type of cluster manager. How do you think? >>>>>>> >>>>> > >>>>>>> >>>>> > >>>>>>> >>>>> > Best Regards >>>>>>> >>>>> > Peter Huang >>>>>>> >>>>> > >>>>>>> >>>>> > >>>>>>> >>>>> > >>>>>>> >>>>> > >>>>>>> >>>>> > >>>>>>> >>>>> > >>>>>>> >>>>> > >>>>>>> >>>>> > >>>>>>> >>>>> > On Mon, Mar 2, 2020 at 7:22 AM Kostas Kloudas <[hidden email]> wrote: >>>>>>> >>>>> >> >>>>>>> >>>>> >> Hi Yang, >>>>>>> >>>>> >> >>>>>>> >>>>> >> The difference between per-job and application mode is that, as you >>>>>>> >>>>> >> described, in the per-job mode the main is executed on the client >>>>>>> >>>>> >> while in the application mode, the main is executed on the cluster. >>>>>>> >>>>> >> I do not think we have to offer "application mode" with running the >>>>>>> >>>>> >> main on the client side as this is exactly what the per-job mode does >>>>>>> >>>>> >> currently and, as you described also, it would be redundant. >>>>>>> >>>>> >> >>>>>>> >>>>> >> Sorry if this was not clear in the document. >>>>>>> >>>>> >> >>>>>>> >>>>> >> Cheers, >>>>>>> >>>>> >> Kostas >>>>>>> >>>>> >> >>>>>>> >>>>> >> On Mon, Mar 2, 2020 at 3:17 PM Yang Wang <[hidden email]> wrote: >>>>>>> >>>>> >> > >>>>>>> >>>>> >> > Hi Kostas, >>>>>>> >>>>> >> > >>>>>>> >>>>> >> > Thanks a lot for your conclusion and updating the FLIP-85 WIKI. Currently, i have no more >>>>>>> >>>>> >> > questions about motivation, approach, fault tolerance and the first phase implementation. >>>>>>> >>>>> >> > >>>>>>> >>>>> >> > I think the new title "Flink Application Mode" makes a lot senses to me. Especially for the >>>>>>> >>>>> >> > containerized environment, the cluster deploy option will be very useful. >>>>>>> >>>>> >> > >>>>>>> >>>>> >> > Just one concern, how do we introduce this new application mode to our users? >>>>>>> >>>>> >> > Each user program(i.e. `main()`) is an application. Currently, we intend to only support one >>>>>>> >>>>> >> > `execute()`. So what's the difference between per-job and application mode? >>>>>>> >>>>> >> > >>>>>>> >>>>> >> > For per-job, user `main()` is always executed on client side. And For application mode, user >>>>>>> >>>>> >> > `main()` could be executed on client or master side(configured via cli option). >>>>>>> >>>>> >> > Right? We need to have a clear concept. Otherwise, the users will be more and more confusing. >>>>>>> >>>>> >> > >>>>>>> >>>>> >> > >>>>>>> >>>>> >> > Best, >>>>>>> >>>>> >> > Yang >>>>>>> >>>>> >> > >>>>>>> >>>>> >> > Kostas Kloudas <[hidden email]> 于2020年3月2日周一 下午5:58写道: >>>>>>> >>>>> >> >> >>>>>>> >>>>> >> >> Hi all, >>>>>>> >>>>> >> >> >>>>>>> >>>>> >> >> I update https://cwiki.apache.org/confluence/display/FLINK/FLIP-85+Flink+Application+Mode >>>>>>> >>>>> >> >> based on the discussion we had here: >>>>>>> >>>>> >> >> >>>>>>> >>>>> >> >> https://docs.google.com/document/d/1ji72s3FD9DYUyGuKnJoO4ApzV-nSsZa0-bceGXW7Ocw/edit# >>>>>>> >>>>> >> >> >>>>>>> >>>>> >> >> Please let me know what you think and please keep the discussion in the ML :) >>>>>>> >>>>> >> >> >>>>>>> >>>>> >> >> Thanks for starting the discussion and I hope that soon we will be >>>>>>> >>>>> >> >> able to vote on the FLIP. >>>>>>> >>>>> >> >> >>>>>>> >>>>> >> >> Cheers, >>>>>>> >>>>> >> >> Kostas >>>>>>> >>>>> >> >> >>>>>>> >>>>> >> >> On Thu, Jan 16, 2020 at 3:40 AM Yang Wang <[hidden email]> wrote: >>>>>>> >>>>> >> >> > >>>>>>> >>>>> >> >> > Hi all, >>>>>>> >>>>> >> >> > >>>>>>> >>>>> >> >> > Thanks a lot for the feedback from @Kostas Kloudas. Your all concerns are >>>>>>> >>>>> >> >> > on point. The FLIP-85 is mainly >>>>>>> >>>>> >> >> > focused on supporting cluster mode for per-job. Since it is more urgent and >>>>>>> >>>>> >> >> > have much more use >>>>>>> >>>>> >> >> > cases both in Yarn and Kubernetes deployment. For session cluster, we could >>>>>>> >>>>> >> >> > have more discussion >>>>>>> >>>>> >> >> > in a new thread later. >>>>>>> >>>>> >> >> > >>>>>>> >>>>> >> >> > #1, How to download the user jars and dependencies for per-job in cluster >>>>>>> >>>>> >> >> > mode? >>>>>>> >>>>> >> >> > For Yarn, we could register the user jars and dependencies as >>>>>>> >>>>> >> >> > LocalResource. They will be distributed >>>>>>> >>>>> >> >> > by Yarn. And once the JobManager and TaskManager launched, the jars are >>>>>>> >>>>> >> >> > already exists. >>>>>>> >>>>> >> >> > For Standalone per-job and K8s, we expect that the user jars >>>>>>> >>>>> >> >> > and dependencies are built into the image. >>>>>>> >>>>> >> >> > Or the InitContainer could be used for downloading. It is natively >>>>>>> >>>>> >> >> > distributed and we will not have bottleneck. >>>>>>> >>>>> >> >> > >>>>>>> >>>>> >> >> > #2, Job graph recovery >>>>>>> >>>>> >> >> > We could have an optimization to store job graph on the DFS. However, i >>>>>>> >>>>> >> >> > suggest building a new jobgraph >>>>>>> >>>>> >> >> > from the configuration is the default option. Since we will not always have >>>>>>> >>>>> >> >> > a DFS store when deploying a >>>>>>> >>>>> >> >> > Flink per-job cluster. Of course, we assume that using the same >>>>>>> >>>>> >> >> > configuration(e.g. job_id, user_jar, main_class, >>>>>>> >>>>> >> >> > main_args, parallelism, savepoint_settings, etc.) will get a same job >>>>>>> >>>>> >> >> > graph. I think the standalone per-job >>>>>>> >>>>> >> >> > already has the similar behavior. >>>>>>> >>>>> >> >> > >>>>>>> >>>>> >> >> > #3, What happens with jobs that have multiple execute calls? >>>>>>> >>>>> >> >> > Currently, it is really a problem. Even we use a local client on Flink >>>>>>> >>>>> >> >> > master side, it will have different behavior with >>>>>>> >>>>> >> >> > client mode. For client mode, if we execute multiple times, then we will >>>>>>> >>>>> >> >> > deploy multiple Flink clusters for each execute. >>>>>>> >>>>> >> >> > I am not pretty sure whether it is reasonable. However, i still think using >>>>>>> >>>>> >> >> > the local client is a good choice. We could >>>>>>> >>>>> >> >> > continue the discussion in a new thread. @Zili Chen <[hidden email]> Do >>>>>>> >>>>> >> >> > you want to drive this? >>>>>>> >>>>> >> >> > >>>>>>> >>>>> >> >> > >>>>>>> >>>>> >> >> > >>>>>>> >>>>> >> >> > Best, >>>>>>> >>>>> >> >> > Yang >>>>>>> >>>>> >> >> > >>>>>>> >>>>> >> >> > Peter Huang <[hidden email]> 于2020年1月16日周四 上午1:55写道: >>>>>>> >>>>> >> >> > >>>>>>> >>>>> >> >> > > Hi Kostas, >>>>>>> >>>>> >> >> > > >>>>>>> >>>>> >> >> > > Thanks for this feedback. I can't agree more about the opinion. The >>>>>>> >>>>> >> >> > > cluster mode should be added >>>>>>> >>>>> >> >> > > first in per job cluster. >>>>>>> >>>>> >> >> > > >>>>>>> >>>>> >> >> > > 1) For job cluster implementation >>>>>>> >>>>> >> >> > > 1. Job graph recovery from configuration or store as static job graph as >>>>>>> >>>>> >> >> > > session cluster. I think the static one will be better for less recovery >>>>>>> >>>>> >> >> > > time. >>>>>>> >>>>> >> >> > > Let me update the doc for details. >>>>>>> >>>>> >> >> > > >>>>>>> >>>>> >> >> > > 2. For job execute multiple times, I think @Zili Chen >>>>>>> >>>>> >> >> > > <[hidden email]> has proposed the local client solution that can >>>>>>> >>>>> >> >> > > the run program actually in the cluster entry point. We can put the >>>>>>> >>>>> >> >> > > implementation in the second stage, >>>>>>> >>>>> >> >> > > or even a new FLIP for further discussion. >>>>>>> >>>>> >> >> > > >>>>>>> >>>>> >> >> > > 2) For session cluster implementation >>>>>>> >>>>> >> >> > > We can disable the cluster mode for the session cluster in the first >>>>>>> >>>>> >> >> > > stage. I agree the jar downloading will be a painful thing. >>>>>>> >>>>> >> >> > > We can consider about PoC and performance evaluation first. If the end to >>>>>>> >>>>> >> >> > > end experience is good enough, then we can consider >>>>>>> >>>>> >> >> > > proceeding with the solution. >>>>>>> >>>>> >> >> > > >>>>>>> >>>>> >> >> > > Looking forward to more opinions from @Yang Wang <[hidden email]> @Zili >>>>>>> >>>>> >> >> > > Chen <[hidden email]> @Dian Fu <[hidden email]>. >>>>>>> >>>>> >> >> > > >>>>>>> >>>>> >> >> > > >>>>>>> >>>>> >> >> > > Best Regards >>>>>>> >>>>> >> >> > > Peter Huang >>>>>>> >>>>> >> >> > > >>>>>>> >>>>> >> >> > > On Wed, Jan 15, 2020 at 7:50 AM Kostas Kloudas <[hidden email]> wrote: >>>>>>> >>>>> >> >> > > >>>>>>> >>>>> >> >> > >> Hi all, >>>>>>> >>>>> >> >> > >> >>>>>>> >>>>> >> >> > >> I am writing here as the discussion on the Google Doc seems to be a >>>>>>> >>>>> >> >> > >> bit difficult to follow. >>>>>>> >>>>> >> >> > >> >>>>>>> >>>>> >> >> > >> I think that in order to be able to make progress, it would be helpful >>>>>>> >>>>> >> >> > >> to focus on per-job mode for now. >>>>>>> >>>>> >> >> > >> The reason is that: >>>>>>> >>>>> >> >> > >> 1) making the (unique) JobSubmitHandler responsible for creating the >>>>>>> >>>>> >> >> > >> jobgraphs, >>>>>>> >>>>> >> >> > >> which includes downloading dependencies, is not an optimal solution >>>>>>> >>>>> >> >> > >> 2) even if we put the responsibility on the JobMaster, currently each >>>>>>> >>>>> >> >> > >> job has its own >>>>>>> >>>>> >> >> > >> JobMaster but they all run on the same process, so we have again a >>>>>>> >>>>> >> >> > >> single entity. >>>>>>> >>>>> >> >> > >> >>>>>>> >>>>> >> >> > >> Of course after this is done, and if we feel comfortable with the >>>>>>> >>>>> >> >> > >> solution, then we can go to the session mode. >>>>>>> >>>>> >> >> > >> >>>>>>> >>>>> >> >> > >> A second comment has to do with fault-tolerance in the per-job, >>>>>>> >>>>> >> >> > >> cluster-deploy mode. >>>>>>> >>>>> >> >> > >> In the document, it is suggested that upon recovery, the JobMaster of >>>>>>> >>>>> >> >> > >> each job re-creates the JobGraph. >>>>>>> >>>>> >> >> > >> I am just wondering if it is better to create and store the jobGraph >>>>>>> >>>>> >> >> > >> upon submission and only fetch it >>>>>>> >>>>> >> >> > >> upon recovery so that we have a static jobGraph. >>>>>>> >>>>> >> >> > >> >>>>>>> >>>>> >> >> > >> Finally, I have a question which is what happens with jobs that have >>>>>>> >>>>> >> >> > >> multiple execute calls? >>>>>>> >>>>> >> >> > >> The semantics seem to change compared to the current behaviour, right? >>>>>>> >>>>> >> >> > >> >>>>>>> >>>>> >> >> > >> Cheers, >>>>>>> >>>>> >> >> > >> Kostas >>>>>>> >>>>> >> >> > >> >>>>>>> >>>>> >> >> > >> On Wed, Jan 8, 2020 at 8:05 PM tison <[hidden email]> wrote: >>>>>>> >>>>> >> >> > >> > >>>>>>> >>>>> >> >> > >> > not always, Yang Wang is also not yet a committer but he can join the >>>>>>> >>>>> >> >> > >> > channel. I cannot find the id by clicking “Add new member in channel” so >>>>>>> >>>>> >> >> > >> > come to you and ask for try out the link. Possibly I will find other >>>>>>> >>>>> >> >> > >> ways >>>>>>> >>>>> >> >> > >> > but the original purpose is that the slack channel is a public area we >>>>>>> >>>>> >> >> > >> > discuss about developing... >>>>>>> >>>>> >> >> > >> > Best, >>>>>>> >>>>> >> >> > >> > tison. >>>>>>> >>>>> >> >> > >> > >>>>>>> >>>>> >> >> > >> > >>>>>>> >>>>> >> >> > >> > Peter Huang <[hidden email]> 于2020年1月9日周四 上午2:44写道: >>>>>>> >>>>> >> >> > >> > >>>>>>> >>>>> >> >> > >> > > Hi Tison, >>>>>>> >>>>> >> >> > >> > > >>>>>>> >>>>> >> >> > >> > > I am not the committer of Flink yet. I think I can't join it also. >>>>>>> >>>>> >> >> > >> > > >>>>>>> >>>>> >> >> > >> > > >>>>>>> >>>>> >> >> > >> > > Best Regards >>>>>>> >>>>> >> >> > >> > > Peter Huang >>>>>>> >>>>> >> >> > >> > > >>>>>>> >>>>> >> >> > >> > > On Wed, Jan 8, 2020 at 9:39 AM tison <[hidden email]> wrote: >>>>>>> >>>>> >> >> > >> > > >>>>>>> >>>>> >> >> > >> > > > Hi Peter, >>>>>>> >>>>> >> >> > >> > > > >>>>>>> >>>>> >> >> > >> > > > Could you try out this link? >>>>>>> >>>>> >> >> > >> > > https://the-asf.slack.com/messages/CNA3ADZPH >>>>>>> >>>>> >> >> > >> > > > >>>>>>> >>>>> >> >> > >> > > > Best, >>>>>>> >>>>> >> >> > >> > > > tison. >>>>>>> >>>>> >> >> > >> > > > >>>>>>> >>>>> >> >> > >> > > > >>>>>>> >>>>> >> >> > >> > > > Peter Huang <[hidden email]> 于2020年1月9日周四 上午1:22写道: >>>>>>> >>>>> >> >> > >> > > > >>>>>>> >>>>> >> >> > >> > > > > Hi Tison, >>>>>>> >>>>> >> >> > >> > > > > >>>>>>> >>>>> >> >> > >> > > > > I can't join the group with shared link. Would you please add me >>>>>>> >>>>> >> >> > >> into >>>>>>> >>>>> >> >> > >> > > the >>>>>>> >>>>> >> >> > >> > > > > group? My slack account is huangzhenqiu0825. >>>>>>> >>>>> >> >> > >> > > > > Thank you in advance. >>>>>>> >>>>> >> >> > >> > > > > >>>>>>> >>>>> >> >> > >> > > > > >>>>>>> >>>>> >> >> > >> > > > > Best Regards >>>>>>> >>>>> >> >> > >> > > > > Peter Huang >>>>>>> >>>>> >> >> > >> > > > > >>>>>>> >>>>> >> >> > >> > > > > On Wed, Jan 8, 2020 at 12:02 AM tison <[hidden email]> >>>>>>> >>>>> >> >> > >> wrote: >>>>>>> >>>>> >> >> > >> > > > > >>>>>>> >>>>> >> >> > >> > > > > > Hi Peter, >>>>>>> >>>>> >> >> > >> > > > > > >>>>>>> >>>>> >> >> > >> > > > > > As described above, this effort should get attention from people >>>>>>> >>>>> >> >> > >> > > > > developing >>>>>>> >>>>> >> >> > >> > > > > > FLIP-73 a.k.a. Executor abstractions. I recommend you to join >>>>>>> >>>>> >> >> > >> the >>>>>>> >>>>> >> >> > >> > > > public >>>>>>> >>>>> >> >> > >> > > > > > slack channel[1] for Flink Client API Enhancement and you can >>>>>>> >>>>> >> >> > >> try to >>>>>>> >>>>> >> >> > >> > > > > share >>>>>>> >>>>> >> >> > >> > > > > > you detailed thoughts there. It possibly gets more concrete >>>>>>> >>>>> >> >> > >> > > attentions. >>>>>>> >>>>> >> >> > >> > > > > > >>>>>>> >>>>> >> >> > >> > > > > > Best, >>>>>>> >>>>> >> >> > >> > > > > > tison. >>>>>>> >>>>> >> >> > >> > > > > > >>>>>>> >>>>> >> >> > >> > > > > > [1] >>>>>>> >>>>> >> >> > >> > > > > > >>>>>>> >>>>> >> >> > >> > > > > > >>>>>>> >>>>> >> >> > >> > > > > >>>>>>> >>>>> >> >> > >> > > > >>>>>>> >>>>> >> >> > >> > > >>>>>>> >>>>> >> >> > >> https://slack.com/share/IS21SJ75H/Rk8HhUly9FuEHb7oGwBZ33uL/enQtODg2MDYwNjE5MTg3LTA2MjIzNDc1M2ZjZDVlMjdlZjk1M2RkYmJhNjAwMTk2ZDZkODQ4NmY5YmI4OGRhNWJkYTViMTM1NzlmMzc4OWM >>>>>>> >>>>> >> >> > >> > > > > > >>>>>>> >>>>> >> >> > >> > > > > > >>>>>>> >>>>> >> >> > >> > > > > > Peter Huang <[hidden email]> 于2020年1月7日周二 上午5:09写道: >>>>>>> >>>>> >> >> > >> > > > > > >>>>>>> >>>>> >> >> > >> > > > > > > Dear All, >>>>>>> >>>>> >> >> > >> > > > > > > >>>>>>> >>>>> >> >> > >> > > > > > > Happy new year! According to existing feedback from the >>>>>>> >>>>> >> >> > >> community, >>>>>>> >>>>> >> >> > >> > > we >>>>>>> >>>>> >> >> > >> > > > > > > revised the doc with the consideration of session cluster >>>>>>> >>>>> >> >> > >> support, >>>>>>> >>>>> >> >> > >> > > > and >>>>>>> >>>>> >> >> > >> > > > > > > concrete interface changes needed and execution plan. Please >>>>>>> >>>>> >> >> > >> take >>>>>>> >>>>> >> >> > >> > > one >>>>>>> >>>>> >> >> > >> > > > > > more >>>>>>> >>>>> >> >> > >> > > > > > > round of review at your most convenient time. >>>>>>> >>>>> >> >> > >> > > > > > > >>>>>>> >>>>> >> >> > >> > > > > > > >>>>>>> >>>>> >> >> > >> > > > > > > >>>>>>> >>>>> >> >> > >> > > > > > >>>>>>> >>>>> >> >> > >> > > > > >>>>>>> >>>>> >> >> > >> > > > >>>>>>> >>>>> >> >> > >> > > >>>>>>> >>>>> >> >> > >> https://docs.google.com/document/d/1aAwVjdZByA-0CHbgv16Me-vjaaDMCfhX7TzVVTuifYM/edit# >>>>>>> >>>>> >> >> > >> > > > > > > >>>>>>> >>>>> >> >> > >> > > > > > > >>>>>>> >>>>> >> >> > >> > > > > > > Best Regards >>>>>>> >>>>> >> >> > >> > > > > > > Peter Huang >>>>>>> >>>>> >> >> > >> > > > > > > >>>>>>> >>>>> >> >> > >> > > > > > > >>>>>>> >>>>> >> >> > >> > > > > > > >>>>>>> >>>>> >> >> > >> > > > > > > >>>>>>> >>>>> >> >> > >> > > > > > > >>>>>>> >>>>> >> >> > >> > > > > > > On Thu, Jan 2, 2020 at 11:29 AM Peter Huang < >>>>>>> >>>>> >> >> > >> > > > > [hidden email]> >>>>>>> >>>>> >> >> > >> > > > > > > wrote: >>>>>>> >>>>> >> >> > >> > > > > > > >>>>>>> >>>>> >> >> > >> > > > > > > > Hi Dian, >>>>>>> >>>>> >> >> > >> > > > > > > > Thanks for giving us valuable feedbacks. >>>>>>> >>>>> >> >> > >> > > > > > > > >>>>>>> >>>>> >> >> > >> > > > > > > > 1) It's better to have a whole design for this feature >>>>>>> >>>>> >> >> > >> > > > > > > > For the suggestion of enabling the cluster mode also session >>>>>>> >>>>> >> >> > >> > > > > cluster, I >>>>>>> >>>>> >> >> > >> > > > > > > > think Flink already supported it. WebSubmissionExtension >>>>>>> >>>>> >> >> > >> already >>>>>>> >>>>> >> >> > >> > > > > allows >>>>>>> >>>>> >> >> > >> > > > > > > > users to start a job with the specified jar by using web UI. >>>>>>> >>>>> >> >> > >> > > > > > > > But we need to enable the feature from CLI for both local >>>>>>> >>>>> >> >> > >> jar, >>>>>>> >>>>> >> >> > >> > > > remote >>>>>>> >>>>> >> >> > >> > > > > > > jar. >>>>>>> >>>>> >> >> > >> > > > > > > > I will align with Yang Wang first about the details and >>>>>>> >>>>> >> >> > >> update >>>>>>> >>>>> >> >> > >> > > the >>>>>>> >>>>> >> >> > >> > > > > > design >>>>>>> >>>>> >> >> > >> > > > > > > > doc. >>>>>>> >>>>> >> >> > >> > > > > > > > >>>>>>> >>>>> >> >> > >> > > > > > > > 2) It's better to consider the convenience for users, such >>>>>>> >>>>> >> >> > >> as >>>>>>> >>>>> >> >> > >> > > > > debugging >>>>>>> >>>>> >> >> > >> > > > > > > > >>>>>>> >>>>> >> >> > >> > > > > > > > I am wondering whether we can store the exception in >>>>>>> >>>>> >> >> > >> jobgragh >>>>>>> >>>>> >> >> > >> > > > > > > > generation in application master. As no streaming graph can >>>>>>> >>>>> >> >> > >> be >>>>>>> >>>>> >> >> > >> > > > > > scheduled >>>>>>> >>>>> >> >> > >> > > > > > > in >>>>>>> >>>>> >> >> > >> > > > > > > > this case, there will be no more TM will be requested from >>>>>>> >>>>> >> >> > >> > > FlinkRM. >>>>>>> >>>>> >> >> > >> > > > > > > > If the AM is still running, users can still query it from >>>>>>> >>>>> >> >> > >> CLI. As >>>>>>> >>>>> >> >> > >> > > > it >>>>>>> >>>>> >> >> > >> > > > > > > > requires more change, we can get some feedback from < >>>>>>> >>>>> >> >> > >> > > > > > [hidden email] >>>>>>> >>>>> >> >> > >> > > > > > > > >>>>>>> >>>>> >> >> > >> > > > > > > > and @[hidden email] <[hidden email]>. >>>>>>> >>>>> >> >> > >> > > > > > > > >>>>>>> >>>>> >> >> > >> > > > > > > > 3) It's better to consider the impact to the stability of >>>>>>> >>>>> >> >> > >> the >>>>>>> >>>>> >> >> > >> > > > cluster >>>>>>> >>>>> >> >> > >> > > > > > > > >>>>>>> >>>>> >> >> > >> > > > > > > > I agree with Yang Wang's opinion. >>>>>>> >>>>> >> >> > >> > > > > > > > >>>>>>> >>>>> >> >> > >> > > > > > > > >>>>>>> >>>>> >> >> > >> > > > > > > > >>>>>>> >>>>> >> >> > >> > > > > > > > Best Regards >>>>>>> >>>>> >> >> > >> > > > > > > > Peter Huang >>>>>>> >>>>> >> >> > >> > > > > > > > >>>>>>> >>>>> >> >> > >> > > > > > > > >>>>>>> >>>>> >> >> > >> > > > > > > > On Sun, Dec 29, 2019 at 9:44 PM Dian Fu < >>>>>>> >>>>> >> >> > >> [hidden email]> >>>>>>> >>>>> >> >> > >> > > > > wrote: >>>>>>> >>>>> >> >> > >> > > > > > > > >>>>>>> >>>>> >> >> > >> > > > > > > >> Hi all, >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>> >>>>> >> >> > >> > > > > > > >> Sorry to jump into this discussion. Thanks everyone for the >>>>>>> >>>>> >> >> > >> > > > > > discussion. >>>>>>> >>>>> >> >> > >> > > > > > > >> I'm very interested in this topic although I'm not an >>>>>>> >>>>> >> >> > >> expert in >>>>>>> >>>>> >> >> > >> > > > this >>>>>>> >>>>> >> >> > >> > > > > > > part. >>>>>>> >>>>> >> >> > >> > > > > > > >> So I'm glad to share my thoughts as following: >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>> >>>>> >> >> > >> > > > > > > >> 1) It's better to have a whole design for this feature >>>>>>> >>>>> >> >> > >> > > > > > > >> As we know, there are two deployment modes: per-job mode >>>>>>> >>>>> >> >> > >> and >>>>>>> >>>>> >> >> > >> > > > session >>>>>>> >>>>> >> >> > >> > > > > > > >> mode. I'm wondering which mode really needs this feature. >>>>>>> >>>>> >> >> > >> As the >>>>>>> >>>>> >> >> > >> > > > > > design >>>>>>> >>>>> >> >> > >> > > > > > > doc >>>>>>> >>>>> >> >> > >> > > > > > > >> mentioned, per-job mode is more used for streaming jobs and >>>>>>> >>>>> >> >> > >> > > > session >>>>>>> >>>>> >> >> > >> > > > > > > mode is >>>>>>> >>>>> >> >> > >> > > > > > > >> usually used for batch jobs(Of course, the job types and >>>>>>> >>>>> >> >> > >> the >>>>>>> >>>>> >> >> > >> > > > > > deployment >>>>>>> >>>>> >> >> > >> > > > > > > >> modes are orthogonal). Usually streaming job is only >>>>>>> >>>>> >> >> > >> needed to >>>>>>> >>>>> >> >> > >> > > be >>>>>>> >>>>> >> >> > >> > > > > > > submitted >>>>>>> >>>>> >> >> > >> > > > > > > >> once and it will run for days or weeks, while batch jobs >>>>>>> >>>>> >> >> > >> will be >>>>>>> >>>>> >> >> > >> > > > > > > submitted >>>>>>> >>>>> >> >> > >> > > > > > > >> more frequently compared with streaming jobs. This means >>>>>>> >>>>> >> >> > >> that >>>>>>> >>>>> >> >> > >> > > > maybe >>>>>>> >>>>> >> >> > >> > > > > > > session >>>>>>> >>>>> >> >> > >> > > > > > > >> mode also needs this feature. However, if we support this >>>>>>> >>>>> >> >> > >> > > feature >>>>>>> >>>>> >> >> > >> > > > in >>>>>>> >>>>> >> >> > >> > > > > > > >> session mode, the application master will become the new >>>>>>> >>>>> >> >> > >> > > > centralized >>>>>>> >>>>> >> >> > >> > > > > > > >> service(which should be solved). So in this case, it's >>>>>>> >>>>> >> >> > >> better to >>>>>>> >>>>> >> >> > >> > > > > have >>>>>>> >>>>> >> >> > >> > > > > > a >>>>>>> >>>>> >> >> > >> > > > > > > >> complete design for both per-job mode and session mode. >>>>>>> >>>>> >> >> > >> > > > Furthermore, >>>>>>> >>>>> >> >> > >> > > > > > > even >>>>>>> >>>>> >> >> > >> > > > > > > >> if we can do it phase by phase, we need to have a whole >>>>>>> >>>>> >> >> > >> picture >>>>>>> >>>>> >> >> > >> > > of >>>>>>> >>>>> >> >> > >> > > > > how >>>>>>> >>>>> >> >> > >> > > > > > > it >>>>>>> >>>>> >> >> > >> > > > > > > >> works in both per-job mode and session mode. >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>> >>>>> >> >> > >> > > > > > > >> 2) It's better to consider the convenience for users, such >>>>>>> >>>>> >> >> > >> as >>>>>>> >>>>> >> >> > >> > > > > > debugging >>>>>>> >>>>> >> >> > >> > > > > > > >> After we finish this feature, the job graph will be >>>>>>> >>>>> >> >> > >> compiled in >>>>>>> >>>>> >> >> > >> > > > the >>>>>>> >>>>> >> >> > >> > > > > > > >> application master, which means that users cannot easily >>>>>>> >>>>> >> >> > >> get the >>>>>>> >>>>> >> >> > >> > > > > > > exception >>>>>>> >>>>> >> >> > >> > > > > > > >> message synchorousely in the job client if there are >>>>>>> >>>>> >> >> > >> problems >>>>>>> >>>>> >> >> > >> > > > during >>>>>>> >>>>> >> >> > >> > > > > > the >>>>>>> >>>>> >> >> > >> > > > > > > >> job graph compiling (especially for platform users), such >>>>>>> >>>>> >> >> > >> as the >>>>>>> >>>>> >> >> > >> > > > > > > resource >>>>>>> >>>>> >> >> > >> > > > > > > >> path is incorrect, the user program itself has some >>>>>>> >>>>> >> >> > >> problems, >>>>>>> >>>>> >> >> > >> > > etc. >>>>>>> >>>>> >> >> > >> > > > > > What >>>>>>> >>>>> >> >> > >> > > > > > > I'm >>>>>>> >>>>> >> >> > >> > > > > > > >> thinking is that maybe we should throw the exceptions as >>>>>>> >>>>> >> >> > >> early >>>>>>> >>>>> >> >> > >> > > as >>>>>>> >>>>> >> >> > >> > > > > > > possible >>>>>>> >>>>> >> >> > >> > > > > > > >> (during job submission stage). >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>> >>>>> >> >> > >> > > > > > > >> 3) It's better to consider the impact to the stability of >>>>>>> >>>>> >> >> > >> the >>>>>>> >>>>> >> >> > >> > > > > cluster >>>>>>> >>>>> >> >> > >> > > > > > > >> If we perform the compiling in the application master, we >>>>>>> >>>>> >> >> > >> should >>>>>>> >>>>> >> >> > >> > > > > > > consider >>>>>>> >>>>> >> >> > >> > > > > > > >> the impact of the compiling errors. Although YARN could >>>>>>> >>>>> >> >> > >> resume >>>>>>> >>>>> >> >> > >> > > the >>>>>>> >>>>> >> >> > >> > > > > > > >> application master in case of failures, but in some case >>>>>>> >>>>> >> >> > >> the >>>>>>> >>>>> >> >> > >> > > > > compiling >>>>>>> >>>>> >> >> > >> > > > > > > >> failure may be a waste of cluster resource and may impact >>>>>>> >>>>> >> >> > >> the >>>>>>> >>>>> >> >> > >> > > > > > stability >>>>>>> >>>>> >> >> > >> > > > > > > the >>>>>>> >>>>> >> >> > >> > > > > > > >> cluster and the other jobs in the cluster, such as the >>>>>>> >>>>> >> >> > >> resource >>>>>>> >>>>> >> >> > >> > > > path >>>>>>> >>>>> >> >> > >> > > > > > is >>>>>>> >>>>> >> >> > >> > > > > > > >> incorrect, the user program itself has some problems(in >>>>>>> >>>>> >> >> > >> this >>>>>>> >>>>> >> >> > >> > > case, >>>>>>> >>>>> >> >> > >> > > > > job >>>>>>> >>>>> >> >> > >> > > > > > > >> failover cannot solve this kind of problems) etc. In the >>>>>>> >>>>> >> >> > >> current >>>>>>> >>>>> >> >> > >> > > > > > > >> implemention, the compiling errors are handled in the >>>>>>> >>>>> >> >> > >> client >>>>>>> >>>>> >> >> > >> > > side >>>>>>> >>>>> >> >> > >> > > > > and >>>>>>> >>>>> >> >> > >> > > > > > > there >>>>>>> >>>>> >> >> > >> > > > > > > >> is no impact to the cluster at all. >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>> >>>>> >> >> > >> > > > > > > >> Regarding to 1), it's clearly pointed in the design doc >>>>>>> >>>>> >> >> > >> that >>>>>>> >>>>> >> >> > >> > > only >>>>>>> >>>>> >> >> > >> > > > > > > per-job >>>>>>> >>>>> >> >> > >> > > > > > > >> mode will be supported. However, I think it's better to >>>>>>> >>>>> >> >> > >> also >>>>>>> >>>>> >> >> > >> > > > > consider >>>>>>> >>>>> >> >> > >> > > > > > > the >>>>>>> >>>>> >> >> > >> > > > > > > >> session mode in the design doc. >>>>>>> >>>>> >> >> > >> > > > > > > >> Regarding to 2) and 3), I have not seen related sections >>>>>>> >>>>> >> >> > >> in the >>>>>>> >>>>> >> >> > >> > > > > design >>>>>>> >>>>> >> >> > >> > > > > > > >> doc. It will be good if we can cover them in the design >>>>>>> >>>>> >> >> > >> doc. >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>> >>>>> >> >> > >> > > > > > > >> Feel free to correct me If there is anything I >>>>>>> >>>>> >> >> > >> misunderstand. >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>> >>>>> >> >> > >> > > > > > > >> Regards, >>>>>>> >>>>> >> >> > >> > > > > > > >> Dian >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>> >>>>> >> >> > >> > > > > > > >> > 在 2019年12月27日,上午3:13,Peter Huang < >>>>>>> >>>>> >> >> > >> [hidden email]> >>>>>>> >>>>> >> >> > >> > > > 写道: >>>>>>> >>>>> >> >> > >> > > > > > > >> > >>>>>>> >>>>> >> >> > >> > > > > > > >> > Hi Yang, >>>>>>> >>>>> >> >> > >> > > > > > > >> > >>>>>>> >>>>> >> >> > >> > > > > > > >> > I can't agree more. The effort definitely needs to align >>>>>>> >>>>> >> >> > >> with >>>>>>> >>>>> >> >> > >> > > > the >>>>>>> >>>>> >> >> > >> > > > > > > final >>>>>>> >>>>> >> >> > >> > > > > > > >> > goal of FLIP-73. >>>>>>> >>>>> >> >> > >> > > > > > > >> > I am thinking about whether we can achieve the goal with >>>>>>> >>>>> >> >> > >> two >>>>>>> >>>>> >> >> > >> > > > > phases. >>>>>>> >>>>> >> >> > >> > > > > > > >> > >>>>>>> >>>>> >> >> > >> > > > > > > >> > 1) Phase I >>>>>>> >>>>> >> >> > >> > > > > > > >> > As the CLiFrontend will not be depreciated soon. We can >>>>>>> >>>>> >> >> > >> still >>>>>>> >>>>> >> >> > >> > > > use >>>>>>> >>>>> >> >> > >> > > > > > the >>>>>>> >>>>> >> >> > >> > > > > > > >> > deployMode flag there, >>>>>>> >>>>> >> >> > >> > > > > > > >> > pass the program info through Flink configuration, use >>>>>>> >>>>> >> >> > >> the >>>>>>> >>>>> >> >> > >> > > > > > > >> > ClassPathJobGraphRetriever >>>>>>> >>>>> >> >> > >> > > > > > > >> > to generate the job graph in ClusterEntrypoints of yarn >>>>>>> >>>>> >> >> > >> and >>>>>>> >>>>> >> >> > >> > > > > > > Kubernetes. >>>>>>> >>>>> >> >> > >> > > > > > > >> > >>>>>>> >>>>> >> >> > >> > > > > > > >> > 2) Phase II >>>>>>> >>>>> >> >> > >> > > > > > > >> > In AbstractJobClusterExecutor, the job graph is >>>>>>> >>>>> >> >> > >> generated in >>>>>>> >>>>> >> >> > >> > > > the >>>>>>> >>>>> >> >> > >> > > > > > > >> execute >>>>>>> >>>>> >> >> > >> > > > > > > >> > function. We can still >>>>>>> >>>>> >> >> > >> > > > > > > >> > use the deployMode in it. With deployMode = cluster, the >>>>>>> >>>>> >> >> > >> > > execute >>>>>>> >>>>> >> >> > >> > > > > > > >> function >>>>>>> >>>>> >> >> > >> > > > > > > >> > only starts the cluster. >>>>>>> >>>>> >> >> > >> > > > > > > >> > >>>>>>> >>>>> >> >> > >> > > > > > > >> > When {Yarn/Kuberneates}PerJobClusterEntrypoint starts, >>>>>>> >>>>> >> >> > >> It will >>>>>>> >>>>> >> >> > >> > > > > start >>>>>>> >>>>> >> >> > >> > > > > > > the >>>>>>> >>>>> >> >> > >> > > > > > > >> > dispatch first, then we can use >>>>>>> >>>>> >> >> > >> > > > > > > >> > a ClusterEnvironment similar to ContextEnvironment to >>>>>>> >>>>> >> >> > >> submit >>>>>>> >>>>> >> >> > >> > > the >>>>>>> >>>>> >> >> > >> > > > > job >>>>>>> >>>>> >> >> > >> > > > > > > >> with >>>>>>> >>>>> >> >> > >> > > > > > > >> > jobName the local >>>>>>> >>>>> >> >> > >> > > > > > > >> > dispatcher. For the details, we need more investigation. >>>>>>> >>>>> >> >> > >> Let's >>>>>>> >>>>> >> >> > >> > > > > wait >>>>>>> >>>>> >> >> > >> > > > > > > >> > for @Aljoscha >>>>>>> >>>>> >> >> > >> > > > > > > >> > Krettek <[hidden email]> @Till Rohrmann < >>>>>>> >>>>> >> >> > >> > > > > [hidden email] >>>>>>> >>>>> >> >> > >> > > > > > >'s >>>>>>> >>>>> >> >> > >> > > > > > > >> > feedback after the holiday season. >>>>>>> >>>>> >> >> > >> > > > > > > >> > >>>>>>> >>>>> >> >> > >> > > > > > > >> > Thank you in advance. Merry Chrismas and Happy New >>>>>>> >>>>> >> >> > >> Year!!! >>>>>>> >>>>> >> >> > >> > > > > > > >> > >>>>>>> >>>>> >> >> > >> > > > > > > >> > >>>>>>> >>>>> >> >> > >> > > > > > > >> > Best Regards >>>>>>> >>>>> >> >> > >> > > > > > > >> > Peter Huang >>>>>>> >>>>> >> >> > >> > > > > > > >> > >>>>>>> >>>>> >> >> > >> > > > > > > >> > >>>>>>> >>>>> >> >> > >> > > > > > > >> > >>>>>>> >>>>> >> >> > >> > > > > > > >> > >>>>>>> >>>>> >> >> > >> > > > > > > >> > >>>>>>> >>>>> >> >> > >> > > > > > > >> > >>>>>>> >>>>> >> >> > >> > > > > > > >> > >>>>>>> >>>>> >> >> > >> > > > > > > >> > >>>>>>> >>>>> >> >> > >> > > > > > > >> > On Wed, Dec 25, 2019 at 1:08 AM Yang Wang < >>>>>>> >>>>> >> >> > >> > > > [hidden email]> >>>>>>> >>>>> >> >> > >> > > > > > > >> wrote: >>>>>>> >>>>> >> >> > >> > > > > > > >> > >>>>>>> >>>>> >> >> > >> > > > > > > >> >> Hi Peter, >>>>>>> >>>>> >> >> > >> > > > > > > >> >> >>>>>>> >>>>> >> >> > >> > > > > > > >> >> I think we need to reconsider tison's suggestion >>>>>>> >>>>> >> >> > >> seriously. >>>>>>> >>>>> >> >> > >> > > > After >>>>>>> >>>>> >> >> > >> > > > > > > >> FLIP-73, >>>>>>> >>>>> >> >> > >> > > > > > > >> >> the deployJobCluster has >>>>>>> >>>>> >> >> > >> > > > > > > >> >> beenmoved into `JobClusterExecutor#execute`. It should >>>>>>> >>>>> >> >> > >> not be >>>>>>> >>>>> >> >> > >> > > > > > > perceived >>>>>>> >>>>> >> >> > >> > > > > > > >> >> for `CliFrontend`. That >>>>>>> >>>>> >> >> > >> > > > > > > >> >> means the user program will *ALWAYS* be executed on >>>>>>> >>>>> >> >> > >> client >>>>>>> >>>>> >> >> > >> > > > side. >>>>>>> >>>>> >> >> > >> > > > > > This >>>>>>> >>>>> >> >> > >> > > > > > > >> is >>>>>>> >>>>> >> >> > >> > > > > > > >> >> the by design behavior. >>>>>>> >>>>> >> >> > >> > > > > > > >> >> So, we could not just add `if(client mode) .. else >>>>>>> >>>>> >> >> > >> if(cluster >>>>>>> >>>>> >> >> > >> > > > > mode) >>>>>>> >>>>> >> >> > >> > > > > > > >> ...` >>>>>>> >>>>> >> >> > >> > > > > > > >> >> codes in `CliFrontend` to bypass >>>>>>> >>>>> >> >> > >> > > > > > > >> >> the executor. We need to find a clean way to decouple >>>>>>> >>>>> >> >> > >> > > executing >>>>>>> >>>>> >> >> > >> > > > > > user >>>>>>> >>>>> >> >> > >> > > > > > > >> >> program and deploying per-job >>>>>>> >>>>> >> >> > >> > > > > > > >> >> cluster. Based on this, we could support to execute user >>>>>>> >>>>> >> >> > >> > > > program >>>>>>> >>>>> >> >> > >> > > > > on >>>>>>> >>>>> >> >> > >> > > > > > > >> client >>>>>>> >>>>> >> >> > >> > > > > > > >> >> or master side. >>>>>>> >>>>> >> >> > >> > > > > > > >> >> >>>>>>> >>>>> >> >> > >> > > > > > > >> >> Maybe Aljoscha and Jeff could give some good >>>>>>> >>>>> >> >> > >> suggestions. >>>>>>> >>>>> >> >> > >> > > > > > > >> >> >>>>>>> >>>>> >> >> > >> > > > > > > >> >> >>>>>>> >>>>> >> >> > >> > > > > > > >> >> >>>>>>> >>>>> >> >> > >> > > > > > > >> >> Best, >>>>>>> >>>>> >> >> > >> > > > > > > >> >> Yang >>>>>>> >>>>> >> >> > >> > > > > > > >> >> >>>>>>> >>>>> >> >> > >> > > > > > > >> >> Peter Huang <[hidden email]> 于2019年12月25日周三 >>>>>>> >>>>> >> >> > >> > > > > 上午4:03写道: >>>>>>> >>>>> >> >> > >> > > > > > > >> >> >>>>>>> >>>>> >> >> > >> > > > > > > >> >>> Hi Jingjing, >>>>>>> >>>>> >> >> > >> > > > > > > >> >>> >>>>>>> >>>>> >> >> > >> > > > > > > >> >>> The improvement proposed is a deployment option for >>>>>>> >>>>> >> >> > >> CLI. For >>>>>>> >>>>> >> >> > >> > > > SQL >>>>>>> >>>>> >> >> > >> > > > > > > based >>>>>>> >>>>> >> >> > >> > > > > > > >> >>> Flink application, It is more convenient to use the >>>>>>> >>>>> >> >> > >> existing >>>>>>> >>>>> >> >> > >> > > > > model >>>>>>> >>>>> >> >> > >> > > > > > > in >>>>>>> >>>>> >> >> > >> > > > > > > >> >>> SqlClient in which >>>>>>> >>>>> >> >> > >> > > > > > > >> >>> the job graph is generated within SqlClient. After >>>>>>> >>>>> >> >> > >> adding >>>>>>> >>>>> >> >> > >> > > the >>>>>>> >>>>> >> >> > >> > > > > > > delayed >>>>>>> >>>>> >> >> > >> > > > > > > >> job >>>>>>> >>>>> >> >> > >> > > > > > > >> >>> graph generation, I think there is no change is needed >>>>>>> >>>>> >> >> > >> for >>>>>>> >>>>> >> >> > >> > > > your >>>>>>> >>>>> >> >> > >> > > > > > > side. >>>>>>> >>>>> >> >> > >> > > > > > > >> >>> >>>>>>> >>>>> >> >> > >> > > > > > > >> >>> >>>>>>> >>>>> >> >> > >> > > > > > > >> >>> Best Regards >>>>>>> >>>>> >> >> > >> > > > > > > >> >>> Peter Huang >>>>>>> >>>>> >> >> > >> > > > > > > >> >>> >>>>>>> >>>>> >> >> > >> > > > > > > >> >>> >>>>>>> >>>>> >> >> > >> > > > > > > >> >>> On Wed, Dec 18, 2019 at 6:01 AM jingjing bai < >>>>>>> >>>>> >> >> > >> > > > > > > >> [hidden email]> >>>>>>> >>>>> >> >> > >> > > > > > > >> >>> wrote: >>>>>>> >>>>> >> >> > >> > > > > > > >> >>> >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>> hi peter: >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>> we had extension SqlClent to support sql job >>>>>>> >>>>> >> >> > >> submit in >>>>>>> >>>>> >> >> > >> > > web >>>>>>> >>>>> >> >> > >> > > > > > base >>>>>>> >>>>> >> >> > >> > > > > > > on >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>> flink 1.9. we support submit to yarn on per job >>>>>>> >>>>> >> >> > >> mode too. >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>> in this case, the job graph generated on client >>>>>>> >>>>> >> >> > >> side >>>>>>> >>>>> >> >> > >> > > . I >>>>>>> >>>>> >> >> > >> > > > > > think >>>>>>> >>>>> >> >> > >> > > > > > > >> >>> this >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>> discuss Mainly to improve api programme. but in my >>>>>>> >>>>> >> >> > >> case , >>>>>>> >>>>> >> >> > >> > > > > there >>>>>>> >>>>> >> >> > >> > > > > > is >>>>>>> >>>>> >> >> > >> > > > > > > >> no >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>> jar to upload but only a sql string . >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>> do u had more suggestion to improve for sql mode >>>>>>> >>>>> >> >> > >> or it >>>>>>> >>>>> >> >> > >> > > is >>>>>>> >>>>> >> >> > >> > > > > > only a >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>> switch for api programme? >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>> >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>> >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>> best >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>> bai jj >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>> >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>> >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>> Yang Wang <[hidden email]> 于2019年12月18日周三 >>>>>>> >>>>> >> >> > >> 下午7:21写道: >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>> >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>> I just want to revive this discussion. >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>> >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>> Recently, i am thinking about how to natively run >>>>>>> >>>>> >> >> > >> flink >>>>>>> >>>>> >> >> > >> > > > > per-job >>>>>>> >>>>> >> >> > >> > > > > > > >> >>> cluster on >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>> Kubernetes. >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>> The per-job mode on Kubernetes is very different >>>>>>> >>>>> >> >> > >> from on >>>>>>> >>>>> >> >> > >> > > > Yarn. >>>>>>> >>>>> >> >> > >> > > > > > And >>>>>>> >>>>> >> >> > >> > > > > > > >> we >>>>>>> >>>>> >> >> > >> > > > > > > >> >>> will >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>> have >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>> the same deployment requirements to the client and >>>>>>> >>>>> >> >> > >> entry >>>>>>> >>>>> >> >> > >> > > > > point. >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>> >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>> 1. Flink client not always need a local jar to start >>>>>>> >>>>> >> >> > >> a >>>>>>> >>>>> >> >> > >> > > Flink >>>>>>> >>>>> >> >> > >> > > > > > > per-job >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>> cluster. We could >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>> support multiple schemas. For example, >>>>>>> >>>>> >> >> > >> > > > file:///path/of/my.jar >>>>>>> >>>>> >> >> > >> > > > > > > means >>>>>>> >>>>> >> >> > >> > > > > > > >> a >>>>>>> >>>>> >> >> > >> > > > > > > >> >>> jar >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>> located >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>> at client side, >>>>>>> >>>>> >> >> > >> hdfs://myhdfs/user/myname/flink/my.jar >>>>>>> >>>>> >> >> > >> > > > means a >>>>>>> >>>>> >> >> > >> > > > > > jar >>>>>>> >>>>> >> >> > >> > > > > > > >> >>> located >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>> at >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>> remote hdfs, local:///path/in/image/my.jar means a >>>>>>> >>>>> >> >> > >> jar >>>>>>> >>>>> >> >> > >> > > > located >>>>>>> >>>>> >> >> > >> > > > > > at >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>> jobmanager side. >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>> >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>> 2. Support running user program on master side. This >>>>>>> >>>>> >> >> > >> also >>>>>>> >>>>> >> >> > >> > > > > means >>>>>>> >>>>> >> >> > >> > > > > > > the >>>>>>> >>>>> >> >> > >> > > > > > > >> >>> entry >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>> point >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>> will generate the job graph on master side. We could >>>>>>> >>>>> >> >> > >> use >>>>>>> >>>>> >> >> > >> > > the >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>> ClasspathJobGraphRetriever >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>> or start a local Flink client to achieve this >>>>>>> >>>>> >> >> > >> purpose. >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>> >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>> >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>> cc tison, Aljoscha & Kostas Do you think this is the >>>>>>> >>>>> >> >> > >> right >>>>>>> >>>>> >> >> > >> > > > > > > >> direction we >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>> need to work? >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>> >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>> tison <[hidden email]> 于2019年12月12日周四 >>>>>>> >>>>> >> >> > >> 下午4:48写道: >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>> >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>> A quick idea is that we separate the deployment >>>>>>> >>>>> >> >> > >> from user >>>>>>> >>>>> >> >> > >> > > > > > program >>>>>>> >>>>> >> >> > >> > > > > > > >> >>> that >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>> it >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>> has always been done >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>> outside the program. On user program executed there >>>>>>> >>>>> >> >> > >> is >>>>>>> >>>>> >> >> > >> > > > > always a >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>> ClusterClient that communicates with >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>> an existing cluster, remote or local. It will be >>>>>>> >>>>> >> >> > >> another >>>>>>> >>>>> >> >> > >> > > > > thread >>>>>>> >>>>> >> >> > >> > > > > > > so >>>>>>> >>>>> >> >> > >> > > > > > > >> >>> just >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>> for >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>> your information. >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>> >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>> Best, >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>> tison. >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>> >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>> >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>> tison <[hidden email]> 于2019年12月12日周四 >>>>>>> >>>>> >> >> > >> 下午4:40写道: >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>> >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>> Hi Peter, >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>> >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>> Another concern I realized recently is that with >>>>>>> >>>>> >> >> > >> current >>>>>>> >>>>> >> >> > >> > > > > > > Executors >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>> abstraction(FLIP-73) >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>> I'm afraid that user program is designed to ALWAYS >>>>>>> >>>>> >> >> > >> run >>>>>>> >>>>> >> >> > >> > > on >>>>>>> >>>>> >> >> > >> > > > > the >>>>>>> >>>>> >> >> > >> > > > > > > >> >>> client >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>> side. >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>> Specifically, >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>> we deploy the job in executor when env.execute >>>>>>> >>>>> >> >> > >> called. >>>>>>> >>>>> >> >> > >> > > > This >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>> abstraction >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>> possibly prevents >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>> Flink runs user program on the cluster side. >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>> >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>> For your proposal, in this case we already >>>>>>> >>>>> >> >> > >> compiled the >>>>>>> >>>>> >> >> > >> > > > > > program >>>>>>> >>>>> >> >> > >> > > > > > > >> and >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>> run >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>> on >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>> the client side, >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>> even we deploy a cluster and retrieve job graph >>>>>>> >>>>> >> >> > >> from >>>>>>> >>>>> >> >> > >> > > > program >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>> metadata, it >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>> doesn't make >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>> many sense. >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>> >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>> cc Aljoscha & Kostas what do you think about this >>>>>>> >>>>> >> >> > >> > > > > constraint? >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>> >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>> Best, >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>> tison. >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>> >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>> >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>> Peter Huang <[hidden email]> >>>>>>> >>>>> >> >> > >> 于2019年12月10日周二 >>>>>>> >>>>> >> >> > >> > > > > > > >> 下午12:45写道: >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>> >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> Hi Tison, >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> Yes, you are right. I think I made the wrong >>>>>>> >>>>> >> >> > >> argument >>>>>>> >>>>> >> >> > >> > > in >>>>>>> >>>>> >> >> > >> > > > > the >>>>>>> >>>>> >> >> > >> > > > > > > doc. >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> Basically, the packaging jar problem is only for >>>>>>> >>>>> >> >> > >> > > platform >>>>>>> >>>>> >> >> > >> > > > > > > users. >>>>>>> >>>>> >> >> > >> > > > > > > >> >>> In >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>> our >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> internal deploy service, >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> we further optimized the deployment latency by >>>>>>> >>>>> >> >> > >> letting >>>>>>> >>>>> >> >> > >> > > > > users >>>>>>> >>>>> >> >> > >> > > > > > to >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>> packaging >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> flink-runtime together with the uber jar, so that >>>>>>> >>>>> >> >> > >> we >>>>>>> >>>>> >> >> > >> > > > don't >>>>>>> >>>>> >> >> > >> > > > > > need >>>>>>> >>>>> >> >> > >> > > > > > > >> to >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> consider >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> multiple flink version >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> support for now. In the session client mode, as >>>>>>> >>>>> >> >> > >> Flink >>>>>>> >>>>> >> >> > >> > > > libs >>>>>>> >>>>> >> >> > >> > > > > > will >>>>>>> >>>>> >> >> > >> > > > > > > >> be >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>> shipped >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> anyway as local resources of yarn. Users actually >>>>>>> >>>>> >> >> > >> don't >>>>>>> >>>>> >> >> > >> > > > > need >>>>>>> >>>>> >> >> > >> > > > > > to >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>> package >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> those libs into job jar. >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> Best Regards >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> Peter Huang >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> On Mon, Dec 9, 2019 at 8:35 PM tison < >>>>>>> >>>>> >> >> > >> > > > [hidden email] >>>>>>> >>>>> >> >> > >> > > > > > >>>>>>> >>>>> >> >> > >> > > > > > > >> >>> wrote: >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> 3. What do you mean about the package? Do users >>>>>>> >>>>> >> >> > >> need >>>>>>> >>>>> >> >> > >> > > to >>>>>>> >>>>> >> >> > >> > > > > > > >> >>> compile >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>> their >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>> jars >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>> inlcuding flink-clients, flink-optimizer, >>>>>>> >>>>> >> >> > >> flink-table >>>>>>> >>>>> >> >> > >> > > > > codes? >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>> >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>> The answer should be no because they exist in >>>>>>> >>>>> >> >> > >> system >>>>>>> >>>>> >> >> > >> > > > > > > classpath. >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>> >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>> Best, >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>> tison. >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>> >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>> >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>> Yang Wang <[hidden email]> 于2019年12月10日周二 >>>>>>> >>>>> >> >> > >> > > > > 下午12:18写道: >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>> >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> Hi Peter, >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> Thanks a lot for starting this discussion. I >>>>>>> >>>>> >> >> > >> think >>>>>>> >>>>> >> >> > >> > > this >>>>>>> >>>>> >> >> > >> > > > > is >>>>>>> >>>>> >> >> > >> > > > > > a >>>>>>> >>>>> >> >> > >> > > > > > > >> >>> very >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> useful >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> feature. >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> Not only for Yarn, i am focused on flink on >>>>>>> >>>>> >> >> > >> > > Kubernetes >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>> integration >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>> and >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>> come >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> across the same >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> problem. I do not want the job graph generated >>>>>>> >>>>> >> >> > >> on >>>>>>> >>>>> >> >> > >> > > > client >>>>>>> >>>>> >> >> > >> > > > > > > side. >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> Instead, >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>> the >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> user jars are built in >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> a user-defined image. When the job manager >>>>>>> >>>>> >> >> > >> launched, >>>>>>> >>>>> >> >> > >> > > we >>>>>>> >>>>> >> >> > >> > > > > > just >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>> need to >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> generate the job graph >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> based on local user jars. >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> I have some small suggestion about this. >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> 1. `ProgramJobGraphRetriever` is very similar to >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> `ClasspathJobGraphRetriever`, the differences >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> are the former needs `ProgramMetadata` and the >>>>>>> >>>>> >> >> > >> latter >>>>>>> >>>>> >> >> > >> > > > > needs >>>>>>> >>>>> >> >> > >> > > > > > > >> >>> some >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>> arguments. >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> Is it possible to >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> have an unified `JobGraphRetriever` to support >>>>>>> >>>>> >> >> > >> both? >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> 2. Is it possible to not use a local user jar to >>>>>>> >>>>> >> >> > >> > > start >>>>>>> >>>>> >> >> > >> > > > a >>>>>>> >>>>> >> >> > >> > > > > > > >> >>> per-job >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> cluster? >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> In your case, the user jars has >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> existed on hdfs already and we do need to >>>>>>> >>>>> >> >> > >> download >>>>>>> >>>>> >> >> > >> > > the >>>>>>> >>>>> >> >> > >> > > > > jars >>>>>>> >>>>> >> >> > >> > > > > > > to >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> deployer >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> service. Currently, we >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> always need a local user jar to start a flink >>>>>>> >>>>> >> >> > >> > > cluster. >>>>>>> >>>>> >> >> > >> > > > It >>>>>>> >>>>> >> >> > >> > > > > > is >>>>>>> >>>>> >> >> > >> > > > > > > >> >>> be >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>> great >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> if >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>> we >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> could support remote user jars. >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>>> In the implementation, we assume users package >>>>>>> >>>>> >> >> > >> > > > > > > >> >>> flink-clients, >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> flink-optimizer, flink-table together within >>>>>>> >>>>> >> >> > >> the job >>>>>>> >>>>> >> >> > >> > > > jar. >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>> Otherwise, >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> the >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> job graph generation within >>>>>>> >>>>> >> >> > >> JobClusterEntryPoint will >>>>>>> >>>>> >> >> > >> > > > > fail. >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> 3. What do you mean about the package? Do users >>>>>>> >>>>> >> >> > >> need >>>>>>> >>>>> >> >> > >> > > to >>>>>>> >>>>> >> >> > >> > > > > > > >> >>> compile >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>> their >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>> jars >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> inlcuding flink-clients, flink-optimizer, >>>>>>> >>>>> >> >> > >> flink-table >>>>>>> >>>>> >> >> > >> > > > > > codes? >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> Best, >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> Yang >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> Peter Huang <[hidden email]> >>>>>>> >>>>> >> >> > >> > > > 于2019年12月10日周二 >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>> 上午2:37写道: >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> Dear All, >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> Recently, the Flink community starts to >>>>>>> >>>>> >> >> > >> improve the >>>>>>> >>>>> >> >> > >> > > > yarn >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>> cluster >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> descriptor >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> to make job jar and config files configurable >>>>>>> >>>>> >> >> > >> from >>>>>>> >>>>> >> >> > >> > > > CLI. >>>>>>> >>>>> >> >> > >> > > > > It >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>> improves >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> the >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> flexibility of Flink deployment Yarn Per Job >>>>>>> >>>>> >> >> > >> Mode. >>>>>>> >>>>> >> >> > >> > > > For >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>> platform >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> users >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> who >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> manage tens of hundreds of streaming pipelines >>>>>>> >>>>> >> >> > >> for >>>>>>> >>>>> >> >> > >> > > the >>>>>>> >>>>> >> >> > >> > > > > > whole >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>> org >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>> or >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> company, we found the job graph generation in >>>>>>> >>>>> >> >> > >> > > > > client-side >>>>>>> >>>>> >> >> > >> > > > > > is >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>> another >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> pinpoint. Thus, we want to propose a >>>>>>> >>>>> >> >> > >> configurable >>>>>>> >>>>> >> >> > >> > > > > feature >>>>>>> >>>>> >> >> > >> > > > > > > >> >>> for >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> FlinkYarnSessionCli. The feature can allow >>>>>>> >>>>> >> >> > >> users to >>>>>>> >>>>> >> >> > >> > > > > choose >>>>>>> >>>>> >> >> > >> > > > > > > >> >>> the >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>> job >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>> graph >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> generation in Flink ClusterEntryPoint so that >>>>>>> >>>>> >> >> > >> the >>>>>>> >>>>> >> >> > >> > > job >>>>>>> >>>>> >> >> > >> > > > > jar >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>> doesn't >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> need >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>> to >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> be locally for the job graph generation. The >>>>>>> >>>>> >> >> > >> > > proposal >>>>>>> >>>>> >> >> > >> > > > is >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>> organized >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> as a >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> FLIP >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>> >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>> >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>> >>>>>>> >>>>> >> >> > >> > > > > > > >> >>> >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>> >>>>> >> >> > >> > > > > > > >>>>>>> >>>>> >> >> > >> > > > > > >>>>>>> >>>>> >> >> > >> > > > > >>>>>>> >>>>> >> >> > >> > > > >>>>>>> >>>>> >> >> > >> > > >>>>>>> >>>>> >> >> > >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-85+Delayed+JobGraph+Generation >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> . >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> Any questions and suggestions are welcomed. >>>>>>> >>>>> >> >> > >> Thank >>>>>>> >>>>> >> >> > >> > > you >>>>>>> >>>>> >> >> > >> > > > in >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>> advance. >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> Best Regards >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> Peter Huang >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>>> >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>>> >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>>> >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>> >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>> >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>> >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>> >>>>>>> >>>>> >> >> > >> > > > > > > >> >>> >>>>>>> >>>>> >> >> > >> > > > > > > >> >> >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>> >>>>> >> >> > >> > > > > > > >> >>>>>>> >>>>> >> >> > >> > > > > > > >>>>>>> >>>>> >> >> > >> > > > > > >>>>>>> >>>>> >> >> > >> > > > > >>>>>>> >>>>> >> >> > >> > > > >>>>>>> >>>>> >> >> > >> > > >>>>>>> >>>>> >> >> > >> >>>>>>> >>>>> >> >> > > >>>>>>> >>>>> >> >> |
> For the -R flag, this was in the PoC that I published just as a quick
> implementation, so that I can move fast to the entrypoint part. > Personally, I would not even be against having a separate command in > the CLI for this, sth like run-on-cluster or something along those > lines. > What do you think? I would be in favour of something like "bin/flink run-application", maybe we should even have "run-job" in the future to differentiate. > For fetching jars, in the FLIP we say that as a first implementation > we can have Local and DFS. I was wondering if in the case of YARN, > both could be somehow implemented > using LocalResources, and let Yarn do the actual fetch. But I have not > investigated it further. Do you have any opinion on this? By now I'm 99 % sure that we should use YARN for that, i.e. use LocalResource. Then YARN does the fetching. This is also how the current per-job cluster deployment does it, the Flink code uploads local files to (H)DFS and then sets the remote paths as a local resource that the entrypoint then uses. Best, Aljoscha |
Hi Aljoscha, Kostas,
I would be in favour of something like "bin/flink run-application", > maybe we should even have "run-job" in the future to differentiate. I have no preference for the "-R/--remote-deploy" option of "flink run" or new introduced "flink run-application". If we always bind the "application mode" to "run-main-on-cluster", i think both of them make sense to me. For the "run-job", do you mean to submit a Flink job to an existing session or just like the current per-job to start a dedicated Flink cluster? Then will "flink run" be deprecated? How to fetch the jars and dependencies? On Yarn deployment, we could register the local or HDFS jar/files as LocalResource. And let Yarn to localize the resource to workdir, when the entrypoint is launched, all the jars and dependencies exist locally. So the entrypoint will *NOT* do the real fetching, do i understand correctly? If this is the case, for K8s deployment, the jars need to be built in image or fetched by init-container. Then the following code path will be exact same as Yarn. Best, Yang Aljoscha Krettek <[hidden email]> 于2020年3月9日周一 下午9:55写道: > > For the -R flag, this was in the PoC that I published just as a quick > > implementation, so that I can move fast to the entrypoint part. > > Personally, I would not even be against having a separate command in > > the CLI for this, sth like run-on-cluster or something along those > > lines. > > What do you think? > > I would be in favour of something like "bin/flink run-application", > maybe we should even have "run-job" in the future to differentiate. > > > For fetching jars, in the FLIP we say that as a first implementation > > we can have Local and DFS. I was wondering if in the case of YARN, > > both could be somehow implemented > > using LocalResources, and let Yarn do the actual fetch. But I have not > > investigated it further. Do you have any opinion on this? > > By now I'm 99 % sure that we should use YARN for that, i.e. use > LocalResource. Then YARN does the fetching. This is also how the current > per-job cluster deployment does it, the Flink code uploads local files > to (H)DFS and then sets the remote paths as a local resource that the > entrypoint then uses. > > Best, > Aljoscha > |
On 10.03.20 03:31, Yang Wang wrote:
> For the "run-job", do you mean to submit a Flink job to an existing session > or > just like the current per-job to start a dedicated Flink cluster? Then will > "flink run" be deprecated? I was talking about the per-job mode that starts a dedicated Flink cluster. This was more thinking about the future but it might make sense to separate these modes more. "flink run" would then only be used for submitting to a session cluster, on standalone or K8s or whatnot. > On Yarn deployment, we could register the local or HDFS jar/files > as LocalResource. > And let Yarn to localize the resource to workdir, when the entrypoint is > launched, all > the jars and dependencies exist locally. So the entrypoint will *NOT* do > the real fetching, > do i understand correctly? Yes, this is exactly what I meant. Best, Aljoscha |
Thanks for your response.
@Kostas Kloudas <[hidden email]> Could we update the cli changes and how to fetch the user jars to FLIP document? I think other dev or users may have the similar questions. Best, Yang Aljoscha Krettek <[hidden email]> 于2020年3月10日周二 下午9:03写道: > On 10.03.20 03:31, Yang Wang wrote: > > For the "run-job", do you mean to submit a Flink job to an existing > session > > or > > just like the current per-job to start a dedicated Flink cluster? Then > will > > "flink run" be deprecated? > > I was talking about the per-job mode that starts a dedicated Flink > cluster. This was more thinking about the future but it might make sense > to separate these modes more. "flink run" would then only be used for > submitting to a session cluster, on standalone or K8s or whatnot. > > > On Yarn deployment, we could register the local or HDFS jar/files > > as LocalResource. > > And let Yarn to localize the resource to workdir, when the entrypoint is > > launched, all > > the jars and dependencies exist locally. So the entrypoint will *NOT* do > > the real fetching, > > do i understand correctly? > > Yes, this is exactly what I meant. > > Best, > Aljoscha > |
Hi all,
Yes I will do that. From the discussion, I will add that: 1) for the cli, we are planning to add a "run-application" command 2) for deployment in Yarn we are planning to use LocalResources to let Yarn do the jar transfer 3) for Standalone/containers, we assume that dependencies/jars are built into the image. Cheers, Kostas On Tue, Mar 10, 2020 at 3:05 PM Yang Wang <[hidden email]> wrote: > > Thanks for your response. > > @Kostas Kloudas Could we update the cli changes and how to fetch the > user jars to FLIP document? I think other dev or users may have the similar > questions. > > > Best, > Yang > > Aljoscha Krettek <[hidden email]> 于2020年3月10日周二 下午9:03写道: >> >> On 10.03.20 03:31, Yang Wang wrote: >> > For the "run-job", do you mean to submit a Flink job to an existing session >> > or >> > just like the current per-job to start a dedicated Flink cluster? Then will >> > "flink run" be deprecated? >> >> I was talking about the per-job mode that starts a dedicated Flink >> cluster. This was more thinking about the future but it might make sense >> to separate these modes more. "flink run" would then only be used for >> submitting to a session cluster, on standalone or K8s or whatnot. >> >> > On Yarn deployment, we could register the local or HDFS jar/files >> > as LocalResource. >> > And let Yarn to localize the resource to workdir, when the entrypoint is >> > launched, all >> > the jars and dependencies exist locally. So the entrypoint will *NOT* do >> > the real fetching, >> > do i understand correctly? >> >> Yes, this is exactly what I meant. >> >> Best, >> Aljoscha |
Hi all,
The FLIP was updated under the section "First Version Deliverables". Cheers, Kostas On Tue, Mar 10, 2020 at 4:10 PM Kostas Kloudas <[hidden email]> wrote: > > Hi all, > > Yes I will do that. From the discussion, I will add that: > 1) for the cli, we are planning to add a "run-application" command > 2) for deployment in Yarn we are planning to use LocalResources to let > Yarn do the jar transfer > 3) for Standalone/containers, we assume that dependencies/jars are > built into the image. > > Cheers, > Kostas > > On Tue, Mar 10, 2020 at 3:05 PM Yang Wang <[hidden email]> wrote: > > > > Thanks for your response. > > > > @Kostas Kloudas Could we update the cli changes and how to fetch the > > user jars to FLIP document? I think other dev or users may have the similar > > questions. > > > > > > Best, > > Yang > > > > Aljoscha Krettek <[hidden email]> 于2020年3月10日周二 下午9:03写道: > >> > >> On 10.03.20 03:31, Yang Wang wrote: > >> > For the "run-job", do you mean to submit a Flink job to an existing session > >> > or > >> > just like the current per-job to start a dedicated Flink cluster? Then will > >> > "flink run" be deprecated? > >> > >> I was talking about the per-job mode that starts a dedicated Flink > >> cluster. This was more thinking about the future but it might make sense > >> to separate these modes more. "flink run" would then only be used for > >> submitting to a session cluster, on standalone or K8s or whatnot. > >> > >> > On Yarn deployment, we could register the local or HDFS jar/files > >> > as LocalResource. > >> > And let Yarn to localize the resource to workdir, when the entrypoint is > >> > launched, all > >> > the jars and dependencies exist locally. So the entrypoint will *NOT* do > >> > the real fetching, > >> > do i understand correctly? > >> > >> Yes, this is exactly what I meant. > >> > >> Best, > >> Aljoscha |
Thanks for your update Klou!
Best, tison. Kostas Kloudas <[hidden email]> 于2020年3月11日周三 上午2:05写道: > Hi all, > > The FLIP was updated under the section "First Version Deliverables". > > Cheers, > Kostas > > On Tue, Mar 10, 2020 at 4:10 PM Kostas Kloudas <[hidden email]> wrote: > > > > Hi all, > > > > Yes I will do that. From the discussion, I will add that: > > 1) for the cli, we are planning to add a "run-application" command > > 2) for deployment in Yarn we are planning to use LocalResources to let > > Yarn do the jar transfer > > 3) for Standalone/containers, we assume that dependencies/jars are > > built into the image. > > > > Cheers, > > Kostas > > > > On Tue, Mar 10, 2020 at 3:05 PM Yang Wang <[hidden email]> wrote: > > > > > > Thanks for your response. > > > > > > @Kostas Kloudas Could we update the cli changes and how to fetch the > > > user jars to FLIP document? I think other dev or users may have the > similar > > > questions. > > > > > > > > > Best, > > > Yang > > > > > > Aljoscha Krettek <[hidden email]> 于2020年3月10日周二 下午9:03写道: > > >> > > >> On 10.03.20 03:31, Yang Wang wrote: > > >> > For the "run-job", do you mean to submit a Flink job to an existing > session > > >> > or > > >> > just like the current per-job to start a dedicated Flink cluster? > Then will > > >> > "flink run" be deprecated? > > >> > > >> I was talking about the per-job mode that starts a dedicated Flink > > >> cluster. This was more thinking about the future but it might make > sense > > >> to separate these modes more. "flink run" would then only be used for > > >> submitting to a session cluster, on standalone or K8s or whatnot. > > >> > > >> > On Yarn deployment, we could register the local or HDFS jar/files > > >> > as LocalResource. > > >> > And let Yarn to localize the resource to workdir, when the > entrypoint is > > >> > launched, all > > >> > the jars and dependencies exist locally. So the entrypoint will > *NOT* do > > >> > the real fetching, > > >> > do i understand correctly? > > >> > > >> Yes, this is exactly what I meant. > > >> > > >> Best, > > >> Aljoscha > |
Free forum by Nabble | Edit this page |