[DISCUSS] FLIP-85: Delayed Job Graph Generation

classic Classic list List threaded Threaded
54 messages Options
123
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] FLIP-85: Delayed Job Graph Generation

Peter Huang
Hi Dian,
Thanks for giving us valuable feedbacks.

1) It's better to have a whole design for this feature
For the suggestion of enabling the cluster mode also session cluster, I
think Flink already supported it. WebSubmissionExtension already allows
users to start a job with the specified jar by using web UI.
But we need to enable the feature from CLI for both local jar, remote jar.
I will align with Yang Wang first about the details and update the design
doc.

2) It's better to consider the convenience for users, such as debugging

I am wondering whether we can store the exception in jobgragh generation in
application master. As no streaming graph can be scheduled in this case,
there will be no more TM will be requested from FlinkRM.
If the AM is still running, users can still query it from CLI. As it
requires more change, we can get some feedback from <[hidden email]>
and @[hidden email] <[hidden email]>.

3) It's better to consider the impact to the stability of the cluster

I agree with Yang Wang's opinion.



Best Regards
Peter Huang


On Sun, Dec 29, 2019 at 9:44 PM Dian Fu <[hidden email]> wrote:

> Hi all,
>
> Sorry to jump into this discussion. Thanks everyone for the discussion.
> I'm very interested in this topic although I'm not an expert in this part.
> So I'm glad to share my thoughts as following:
>
> 1) It's better to have a whole design for this feature
> As we know, there are two deployment modes: per-job mode and session mode.
> I'm wondering which mode really needs this feature. As the design doc
> mentioned, per-job mode is more used for streaming jobs and session mode is
> usually used for batch jobs(Of course, the job types and the deployment
> modes are orthogonal). Usually streaming job is only needed to be submitted
> once and it will run for days or weeks, while batch jobs will be submitted
> more frequently compared with streaming jobs. This means that maybe session
> mode also needs this feature. However, if we support this feature in
> session mode, the application master will become the new centralized
> service(which should be solved). So in this case, it's better to have a
> complete design for both per-job mode and session mode. Furthermore, even
> if we can do it phase by phase, we need to have a whole picture of how it
> works in both per-job mode and session mode.
>
> 2) It's better to consider the convenience for users, such as debugging
> After we finish this feature, the job graph will be compiled in the
> application master, which means that users cannot easily get the exception
> message synchorousely in the job client if there are problems during the
> job graph compiling (especially for platform users), such as the resource
> path is incorrect, the user program itself has some problems, etc. What I'm
> thinking is that maybe we should throw the exceptions as early as possible
> (during job submission stage).
>
> 3) It's better to consider the impact to the stability of the cluster
> If we perform the compiling in the application master, we should consider
> the impact of the compiling errors. Although YARN could resume the
> application master in case of failures, but in some case the compiling
> failure may be a waste of cluster resource and may impact the stability the
> cluster and the other jobs in the cluster, such as the resource path is
> incorrect, the user program itself has some problems(in this case, job
> failover cannot solve this kind of problems) etc. In the current
> implemention, the compiling errors are handled in the client side and there
> is no impact to the cluster at all.
>
> Regarding to 1), it's clearly pointed in the design doc that only per-job
> mode will be supported. However, I think it's better to also consider the
> session mode in the design doc.
> Regarding to 2) and 3), I have not seen related sections in the design
> doc. It will be good if we can cover them in the design doc.
>
> Feel free to correct me If there is anything I misunderstand.
>
> Regards,
> Dian
>
>
> > 在 2019年12月27日,上午3:13,Peter Huang <[hidden email]> 写道:
> >
> > Hi Yang,
> >
> > I can't agree more. The effort definitely needs to align with the final
> > goal of FLIP-73.
> > I am thinking about whether we can achieve the goal with two phases.
> >
> > 1) Phase I
> > As the CLiFrontend will not be depreciated soon. We can still use the
> > deployMode flag there,
> > pass the program info through Flink configuration,  use the
> > ClassPathJobGraphRetriever
> > to generate the job graph in ClusterEntrypoints of yarn and Kubernetes.
> >
> > 2) Phase II
> > In  AbstractJobClusterExecutor, the job graph is generated in the execute
> > function. We can still
> > use the deployMode in it. With deployMode = cluster, the execute function
> > only starts the cluster.
> >
> > When {Yarn/Kuberneates}PerJobClusterEntrypoint starts, It will start the
> > dispatch first, then we can use
> > a ClusterEnvironment similar to ContextEnvironment to submit the job with
> > jobName the local
> > dispatcher. For the details, we need more investigation. Let's wait
> > for @Aljoscha
> > Krettek <[hidden email]> @Till Rohrmann <[hidden email]>'s
> > feedback after the holiday season.
> >
> > Thank you in advance. Merry Chrismas and Happy New Year!!!
> >
> >
> > Best Regards
> > Peter Huang
> >
> >
> >
> >
> >
> >
> >
> >
> > On Wed, Dec 25, 2019 at 1:08 AM Yang Wang <[hidden email]> wrote:
> >
> >> Hi Peter,
> >>
> >> I think we need to reconsider tison's suggestion seriously. After
> FLIP-73,
> >> the deployJobCluster has
> >> beenmoved into `JobClusterExecutor#execute`. It should not be perceived
> >> for `CliFrontend`. That
> >> means the user program will *ALWAYS* be executed on client side. This is
> >> the by design behavior.
> >> So, we could not just add `if(client mode) .. else if(cluster mode) ...`
> >> codes in `CliFrontend` to bypass
> >> the executor. We need to find a clean way to decouple executing user
> >> program and deploying per-job
> >> cluster. Based on this, we could support to execute user program on
> client
> >> or master side.
> >>
> >> Maybe Aljoscha and Jeff could give some good suggestions.
> >>
> >>
> >>
> >> Best,
> >> Yang
> >>
> >> Peter Huang <[hidden email]> 于2019年12月25日周三 上午4:03写道:
> >>
> >>> Hi Jingjing,
> >>>
> >>> The improvement proposed is a deployment option for CLI. For SQL based
> >>> Flink application, It is more convenient to use the existing model in
> >>> SqlClient in which
> >>> the job graph is generated within SqlClient. After adding the delayed
> job
> >>> graph generation, I think there is no change is needed for your side.
> >>>
> >>>
> >>> Best Regards
> >>> Peter Huang
> >>>
> >>>
> >>> On Wed, Dec 18, 2019 at 6:01 AM jingjing bai <
> [hidden email]>
> >>> wrote:
> >>>
> >>>> hi peter:
> >>>>    we had extension SqlClent to support sql job submit in web base on
> >>>> flink 1.9.   we support submit to yarn on per job mode too.
> >>>>    in this case, the job graph generated  on client side .  I think
> >>> this
> >>>> discuss Mainly to improve api programme.  but in my case , there is no
> >>>> jar to upload but only a sql string .
> >>>>    do u had more suggestion to improve for sql mode or it is only a
> >>>> switch for api programme?
> >>>>
> >>>>
> >>>> best
> >>>> bai jj
> >>>>
> >>>>
> >>>> Yang Wang <[hidden email]> 于2019年12月18日周三 下午7:21写道:
> >>>>
> >>>>> I just want to revive this discussion.
> >>>>>
> >>>>> Recently, i am thinking about how to natively run flink per-job
> >>> cluster on
> >>>>> Kubernetes.
> >>>>> The per-job mode on Kubernetes is very different from on Yarn. And we
> >>> will
> >>>>> have
> >>>>> the same deployment requirements to the client and entry point.
> >>>>>
> >>>>> 1. Flink client not always need a local jar to start a Flink per-job
> >>>>> cluster. We could
> >>>>> support multiple schemas. For example, file:///path/of/my.jar means a
> >>> jar
> >>>>> located
> >>>>> at client side, hdfs://myhdfs/user/myname/flink/my.jar means a jar
> >>> located
> >>>>> at
> >>>>> remote hdfs, local:///path/in/image/my.jar means a jar located at
> >>>>> jobmanager side.
> >>>>>
> >>>>> 2. Support running user program on master side. This also means the
> >>> entry
> >>>>> point
> >>>>> will generate the job graph on master side. We could use the
> >>>>> ClasspathJobGraphRetriever
> >>>>> or start a local Flink client to achieve this purpose.
> >>>>>
> >>>>>
> >>>>> cc tison, Aljoscha & Kostas Do you think this is the right direction
> we
> >>>>> need to work?
> >>>>>
> >>>>> tison <[hidden email]> 于2019年12月12日周四 下午4:48写道:
> >>>>>
> >>>>>> A quick idea is that we separate the deployment from user program
> >>> that
> >>>>> it
> >>>>>> has always been done
> >>>>>> outside the program. On user program executed there is always a
> >>>>>> ClusterClient that communicates with
> >>>>>> an existing cluster, remote or local. It will be another thread so
> >>> just
> >>>>> for
> >>>>>> your information.
> >>>>>>
> >>>>>> Best,
> >>>>>> tison.
> >>>>>>
> >>>>>>
> >>>>>> tison <[hidden email]> 于2019年12月12日周四 下午4:40写道:
> >>>>>>
> >>>>>>> Hi Peter,
> >>>>>>>
> >>>>>>> Another concern I realized recently is that with current Executors
> >>>>>>> abstraction(FLIP-73)
> >>>>>>> I'm afraid that user program is designed to ALWAYS run on the
> >>> client
> >>>>>> side.
> >>>>>>> Specifically,
> >>>>>>> we deploy the job in executor when env.execute called. This
> >>>>> abstraction
> >>>>>>> possibly prevents
> >>>>>>> Flink runs user program on the cluster side.
> >>>>>>>
> >>>>>>> For your proposal, in this case we already compiled the program and
> >>>>> run
> >>>>>> on
> >>>>>>> the client side,
> >>>>>>> even we deploy a cluster and retrieve job graph from program
> >>>>> metadata, it
> >>>>>>> doesn't make
> >>>>>>> many sense.
> >>>>>>>
> >>>>>>> cc Aljoscha & Kostas what do you think about this constraint?
> >>>>>>>
> >>>>>>> Best,
> >>>>>>> tison.
> >>>>>>>
> >>>>>>>
> >>>>>>> Peter Huang <[hidden email]> 于2019年12月10日周二 下午12:45写道:
> >>>>>>>
> >>>>>>>> Hi Tison,
> >>>>>>>>
> >>>>>>>> Yes, you are right. I think I made the wrong argument in the doc.
> >>>>>>>> Basically, the packaging jar problem is only for platform users.
> >>> In
> >>>>> our
> >>>>>>>> internal deploy service,
> >>>>>>>> we further optimized the deployment latency by letting users to
> >>>>>> packaging
> >>>>>>>> flink-runtime together with the uber jar, so that we don't need to
> >>>>>>>> consider
> >>>>>>>> multiple flink version
> >>>>>>>> support for now. In the session client mode, as Flink libs will be
> >>>>>> shipped
> >>>>>>>> anyway as local resources of yarn. Users actually don't need to
> >>>>> package
> >>>>>>>> those libs into job jar.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Best Regards
> >>>>>>>> Peter Huang
> >>>>>>>>
> >>>>>>>> On Mon, Dec 9, 2019 at 8:35 PM tison <[hidden email]>
> >>> wrote:
> >>>>>>>>
> >>>>>>>>>> 3. What do you mean about the package? Do users need to
> >>> compile
> >>>>>> their
> >>>>>>>>> jars
> >>>>>>>>> inlcuding flink-clients, flink-optimizer, flink-table codes?
> >>>>>>>>>
> >>>>>>>>> The answer should be no because they exist in system classpath.
> >>>>>>>>>
> >>>>>>>>> Best,
> >>>>>>>>> tison.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Yang Wang <[hidden email]> 于2019年12月10日周二 下午12:18写道:
> >>>>>>>>>
> >>>>>>>>>> Hi Peter,
> >>>>>>>>>>
> >>>>>>>>>> Thanks a lot for starting this discussion. I think this is a
> >>> very
> >>>>>>>> useful
> >>>>>>>>>> feature.
> >>>>>>>>>>
> >>>>>>>>>> Not only for Yarn, i am focused on flink on Kubernetes
> >>>>> integration
> >>>>>> and
> >>>>>>>>> come
> >>>>>>>>>> across the same
> >>>>>>>>>> problem. I do not want the job graph generated on client side.
> >>>>>>>> Instead,
> >>>>>>>>> the
> >>>>>>>>>> user jars are built in
> >>>>>>>>>> a user-defined image. When the job manager launched, we just
> >>>>> need to
> >>>>>>>>>> generate the job graph
> >>>>>>>>>> based on local user jars.
> >>>>>>>>>>
> >>>>>>>>>> I have some small suggestion about this.
> >>>>>>>>>>
> >>>>>>>>>> 1. `ProgramJobGraphRetriever` is very similar to
> >>>>>>>>>> `ClasspathJobGraphRetriever`, the differences
> >>>>>>>>>> are the former needs `ProgramMetadata` and the latter needs
> >>> some
> >>>>>>>>> arguments.
> >>>>>>>>>> Is it possible to
> >>>>>>>>>> have an unified `JobGraphRetriever` to support both?
> >>>>>>>>>> 2. Is it possible to not use a local user jar to start a
> >>> per-job
> >>>>>>>> cluster?
> >>>>>>>>>> In your case, the user jars has
> >>>>>>>>>> existed on hdfs already and we do need to download the jars to
> >>>>>>>> deployer
> >>>>>>>>>> service. Currently, we
> >>>>>>>>>> always need a local user jar to start a flink cluster. It is
> >>> be
> >>>>>> great
> >>>>>>>> if
> >>>>>>>>> we
> >>>>>>>>>> could support remote user jars.
> >>>>>>>>>>>> In the implementation, we assume users package
> >>> flink-clients,
> >>>>>>>>>> flink-optimizer, flink-table together within the job jar.
> >>>>> Otherwise,
> >>>>>>>> the
> >>>>>>>>>> job graph generation within JobClusterEntryPoint will fail.
> >>>>>>>>>> 3. What do you mean about the package? Do users need to
> >>> compile
> >>>>>> their
> >>>>>>>>> jars
> >>>>>>>>>> inlcuding flink-clients, flink-optimizer, flink-table codes?
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Best,
> >>>>>>>>>> Yang
> >>>>>>>>>>
> >>>>>>>>>> Peter Huang <[hidden email]> 于2019年12月10日周二
> >>>>> 上午2:37写道:
> >>>>>>>>>>
> >>>>>>>>>>> Dear All,
> >>>>>>>>>>>
> >>>>>>>>>>> Recently, the Flink community starts to improve the yarn
> >>>>> cluster
> >>>>>>>>>> descriptor
> >>>>>>>>>>> to make job jar and config files configurable from CLI. It
> >>>>>> improves
> >>>>>>>> the
> >>>>>>>>>>> flexibility of  Flink deployment Yarn Per Job Mode. For
> >>>>> platform
> >>>>>>>> users
> >>>>>>>>>> who
> >>>>>>>>>>> manage tens of hundreds of streaming pipelines for the whole
> >>>>> org
> >>>>>> or
> >>>>>>>>>>> company, we found the job graph generation in client-side is
> >>>>>> another
> >>>>>>>>>>> pinpoint. Thus, we want to propose a configurable feature
> >>> for
> >>>>>>>>>>> FlinkYarnSessionCli. The feature can allow users to choose
> >>> the
> >>>>> job
> >>>>>>>>> graph
> >>>>>>>>>>> generation in Flink ClusterEntryPoint so that the job jar
> >>>>> doesn't
> >>>>>>>> need
> >>>>>>>>> to
> >>>>>>>>>>> be locally for the job graph generation. The proposal is
> >>>>> organized
> >>>>>>>> as a
> >>>>>>>>>>> FLIP
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>>
> >>>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-85+Delayed+JobGraph+Generation
> >>>>>>>>>>> .
> >>>>>>>>>>>
> >>>>>>>>>>> Any questions and suggestions are welcomed. Thank you in
> >>>>> advance.
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> Best Regards
> >>>>>>>>>>> Peter Huang
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] FLIP-85: Delayed Job Graph Generation

Peter Huang
Dear All,

Happy new year! According to existing feedback from the community, we
revised the doc with the consideration of session cluster support, and
concrete interface changes needed and execution plan. Please take one more
round of review at your most convenient time.

https://docs.google.com/document/d/1aAwVjdZByA-0CHbgv16Me-vjaaDMCfhX7TzVVTuifYM/edit#


Best Regards
Peter Huang





On Thu, Jan 2, 2020 at 11:29 AM Peter Huang <[hidden email]>
wrote:

> Hi Dian,
> Thanks for giving us valuable feedbacks.
>
> 1) It's better to have a whole design for this feature
> For the suggestion of enabling the cluster mode also session cluster, I
> think Flink already supported it. WebSubmissionExtension already allows
> users to start a job with the specified jar by using web UI.
> But we need to enable the feature from CLI for both local jar, remote jar.
> I will align with Yang Wang first about the details and update the design
> doc.
>
> 2) It's better to consider the convenience for users, such as debugging
>
> I am wondering whether we can store the exception in jobgragh
> generation in application master. As no streaming graph can be scheduled in
> this case, there will be no more TM will be requested from FlinkRM.
> If the AM is still running, users can still query it from CLI. As it
> requires more change, we can get some feedback from <[hidden email]>
> and @[hidden email] <[hidden email]>.
>
> 3) It's better to consider the impact to the stability of the cluster
>
> I agree with Yang Wang's opinion.
>
>
>
> Best Regards
> Peter Huang
>
>
> On Sun, Dec 29, 2019 at 9:44 PM Dian Fu <[hidden email]> wrote:
>
>> Hi all,
>>
>> Sorry to jump into this discussion. Thanks everyone for the discussion.
>> I'm very interested in this topic although I'm not an expert in this part.
>> So I'm glad to share my thoughts as following:
>>
>> 1) It's better to have a whole design for this feature
>> As we know, there are two deployment modes: per-job mode and session
>> mode. I'm wondering which mode really needs this feature. As the design doc
>> mentioned, per-job mode is more used for streaming jobs and session mode is
>> usually used for batch jobs(Of course, the job types and the deployment
>> modes are orthogonal). Usually streaming job is only needed to be submitted
>> once and it will run for days or weeks, while batch jobs will be submitted
>> more frequently compared with streaming jobs. This means that maybe session
>> mode also needs this feature. However, if we support this feature in
>> session mode, the application master will become the new centralized
>> service(which should be solved). So in this case, it's better to have a
>> complete design for both per-job mode and session mode. Furthermore, even
>> if we can do it phase by phase, we need to have a whole picture of how it
>> works in both per-job mode and session mode.
>>
>> 2) It's better to consider the convenience for users, such as debugging
>> After we finish this feature, the job graph will be compiled in the
>> application master, which means that users cannot easily get the exception
>> message synchorousely in the job client if there are problems during the
>> job graph compiling (especially for platform users), such as the resource
>> path is incorrect, the user program itself has some problems, etc. What I'm
>> thinking is that maybe we should throw the exceptions as early as possible
>> (during job submission stage).
>>
>> 3) It's better to consider the impact to the stability of the cluster
>> If we perform the compiling in the application master, we should consider
>> the impact of the compiling errors. Although YARN could resume the
>> application master in case of failures, but in some case the compiling
>> failure may be a waste of cluster resource and may impact the stability the
>> cluster and the other jobs in the cluster, such as the resource path is
>> incorrect, the user program itself has some problems(in this case, job
>> failover cannot solve this kind of problems) etc. In the current
>> implemention, the compiling errors are handled in the client side and there
>> is no impact to the cluster at all.
>>
>> Regarding to 1), it's clearly pointed in the design doc that only per-job
>> mode will be supported. However, I think it's better to also consider the
>> session mode in the design doc.
>> Regarding to 2) and 3), I have not seen related sections in the design
>> doc. It will be good if we can cover them in the design doc.
>>
>> Feel free to correct me If there is anything I misunderstand.
>>
>> Regards,
>> Dian
>>
>>
>> > 在 2019年12月27日,上午3:13,Peter Huang <[hidden email]> 写道:
>> >
>> > Hi Yang,
>> >
>> > I can't agree more. The effort definitely needs to align with the final
>> > goal of FLIP-73.
>> > I am thinking about whether we can achieve the goal with two phases.
>> >
>> > 1) Phase I
>> > As the CLiFrontend will not be depreciated soon. We can still use the
>> > deployMode flag there,
>> > pass the program info through Flink configuration,  use the
>> > ClassPathJobGraphRetriever
>> > to generate the job graph in ClusterEntrypoints of yarn and Kubernetes.
>> >
>> > 2) Phase II
>> > In  AbstractJobClusterExecutor, the job graph is generated in the
>> execute
>> > function. We can still
>> > use the deployMode in it. With deployMode = cluster, the execute
>> function
>> > only starts the cluster.
>> >
>> > When {Yarn/Kuberneates}PerJobClusterEntrypoint starts, It will start the
>> > dispatch first, then we can use
>> > a ClusterEnvironment similar to ContextEnvironment to submit the job
>> with
>> > jobName the local
>> > dispatcher. For the details, we need more investigation. Let's wait
>> > for @Aljoscha
>> > Krettek <[hidden email]> @Till Rohrmann <[hidden email]>'s
>> > feedback after the holiday season.
>> >
>> > Thank you in advance. Merry Chrismas and Happy New Year!!!
>> >
>> >
>> > Best Regards
>> > Peter Huang
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> > On Wed, Dec 25, 2019 at 1:08 AM Yang Wang <[hidden email]>
>> wrote:
>> >
>> >> Hi Peter,
>> >>
>> >> I think we need to reconsider tison's suggestion seriously. After
>> FLIP-73,
>> >> the deployJobCluster has
>> >> beenmoved into `JobClusterExecutor#execute`. It should not be perceived
>> >> for `CliFrontend`. That
>> >> means the user program will *ALWAYS* be executed on client side. This
>> is
>> >> the by design behavior.
>> >> So, we could not just add `if(client mode) .. else if(cluster mode)
>> ...`
>> >> codes in `CliFrontend` to bypass
>> >> the executor. We need to find a clean way to decouple executing user
>> >> program and deploying per-job
>> >> cluster. Based on this, we could support to execute user program on
>> client
>> >> or master side.
>> >>
>> >> Maybe Aljoscha and Jeff could give some good suggestions.
>> >>
>> >>
>> >>
>> >> Best,
>> >> Yang
>> >>
>> >> Peter Huang <[hidden email]> 于2019年12月25日周三 上午4:03写道:
>> >>
>> >>> Hi Jingjing,
>> >>>
>> >>> The improvement proposed is a deployment option for CLI. For SQL based
>> >>> Flink application, It is more convenient to use the existing model in
>> >>> SqlClient in which
>> >>> the job graph is generated within SqlClient. After adding the delayed
>> job
>> >>> graph generation, I think there is no change is needed for your side.
>> >>>
>> >>>
>> >>> Best Regards
>> >>> Peter Huang
>> >>>
>> >>>
>> >>> On Wed, Dec 18, 2019 at 6:01 AM jingjing bai <
>> [hidden email]>
>> >>> wrote:
>> >>>
>> >>>> hi peter:
>> >>>>    we had extension SqlClent to support sql job submit in web base on
>> >>>> flink 1.9.   we support submit to yarn on per job mode too.
>> >>>>    in this case, the job graph generated  on client side .  I think
>> >>> this
>> >>>> discuss Mainly to improve api programme.  but in my case , there is
>> no
>> >>>> jar to upload but only a sql string .
>> >>>>    do u had more suggestion to improve for sql mode or it is only a
>> >>>> switch for api programme?
>> >>>>
>> >>>>
>> >>>> best
>> >>>> bai jj
>> >>>>
>> >>>>
>> >>>> Yang Wang <[hidden email]> 于2019年12月18日周三 下午7:21写道:
>> >>>>
>> >>>>> I just want to revive this discussion.
>> >>>>>
>> >>>>> Recently, i am thinking about how to natively run flink per-job
>> >>> cluster on
>> >>>>> Kubernetes.
>> >>>>> The per-job mode on Kubernetes is very different from on Yarn. And
>> we
>> >>> will
>> >>>>> have
>> >>>>> the same deployment requirements to the client and entry point.
>> >>>>>
>> >>>>> 1. Flink client not always need a local jar to start a Flink per-job
>> >>>>> cluster. We could
>> >>>>> support multiple schemas. For example, file:///path/of/my.jar means
>> a
>> >>> jar
>> >>>>> located
>> >>>>> at client side, hdfs://myhdfs/user/myname/flink/my.jar means a jar
>> >>> located
>> >>>>> at
>> >>>>> remote hdfs, local:///path/in/image/my.jar means a jar located at
>> >>>>> jobmanager side.
>> >>>>>
>> >>>>> 2. Support running user program on master side. This also means the
>> >>> entry
>> >>>>> point
>> >>>>> will generate the job graph on master side. We could use the
>> >>>>> ClasspathJobGraphRetriever
>> >>>>> or start a local Flink client to achieve this purpose.
>> >>>>>
>> >>>>>
>> >>>>> cc tison, Aljoscha & Kostas Do you think this is the right
>> direction we
>> >>>>> need to work?
>> >>>>>
>> >>>>> tison <[hidden email]> 于2019年12月12日周四 下午4:48写道:
>> >>>>>
>> >>>>>> A quick idea is that we separate the deployment from user program
>> >>> that
>> >>>>> it
>> >>>>>> has always been done
>> >>>>>> outside the program. On user program executed there is always a
>> >>>>>> ClusterClient that communicates with
>> >>>>>> an existing cluster, remote or local. It will be another thread so
>> >>> just
>> >>>>> for
>> >>>>>> your information.
>> >>>>>>
>> >>>>>> Best,
>> >>>>>> tison.
>> >>>>>>
>> >>>>>>
>> >>>>>> tison <[hidden email]> 于2019年12月12日周四 下午4:40写道:
>> >>>>>>
>> >>>>>>> Hi Peter,
>> >>>>>>>
>> >>>>>>> Another concern I realized recently is that with current Executors
>> >>>>>>> abstraction(FLIP-73)
>> >>>>>>> I'm afraid that user program is designed to ALWAYS run on the
>> >>> client
>> >>>>>> side.
>> >>>>>>> Specifically,
>> >>>>>>> we deploy the job in executor when env.execute called. This
>> >>>>> abstraction
>> >>>>>>> possibly prevents
>> >>>>>>> Flink runs user program on the cluster side.
>> >>>>>>>
>> >>>>>>> For your proposal, in this case we already compiled the program
>> and
>> >>>>> run
>> >>>>>> on
>> >>>>>>> the client side,
>> >>>>>>> even we deploy a cluster and retrieve job graph from program
>> >>>>> metadata, it
>> >>>>>>> doesn't make
>> >>>>>>> many sense.
>> >>>>>>>
>> >>>>>>> cc Aljoscha & Kostas what do you think about this constraint?
>> >>>>>>>
>> >>>>>>> Best,
>> >>>>>>> tison.
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> Peter Huang <[hidden email]> 于2019年12月10日周二
>> 下午12:45写道:
>> >>>>>>>
>> >>>>>>>> Hi Tison,
>> >>>>>>>>
>> >>>>>>>> Yes, you are right. I think I made the wrong argument in the doc.
>> >>>>>>>> Basically, the packaging jar problem is only for platform users.
>> >>> In
>> >>>>> our
>> >>>>>>>> internal deploy service,
>> >>>>>>>> we further optimized the deployment latency by letting users to
>> >>>>>> packaging
>> >>>>>>>> flink-runtime together with the uber jar, so that we don't need
>> to
>> >>>>>>>> consider
>> >>>>>>>> multiple flink version
>> >>>>>>>> support for now. In the session client mode, as Flink libs will
>> be
>> >>>>>> shipped
>> >>>>>>>> anyway as local resources of yarn. Users actually don't need to
>> >>>>> package
>> >>>>>>>> those libs into job jar.
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> Best Regards
>> >>>>>>>> Peter Huang
>> >>>>>>>>
>> >>>>>>>> On Mon, Dec 9, 2019 at 8:35 PM tison <[hidden email]>
>> >>> wrote:
>> >>>>>>>>
>> >>>>>>>>>> 3. What do you mean about the package? Do users need to
>> >>> compile
>> >>>>>> their
>> >>>>>>>>> jars
>> >>>>>>>>> inlcuding flink-clients, flink-optimizer, flink-table codes?
>> >>>>>>>>>
>> >>>>>>>>> The answer should be no because they exist in system classpath.
>> >>>>>>>>>
>> >>>>>>>>> Best,
>> >>>>>>>>> tison.
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>> Yang Wang <[hidden email]> 于2019年12月10日周二 下午12:18写道:
>> >>>>>>>>>
>> >>>>>>>>>> Hi Peter,
>> >>>>>>>>>>
>> >>>>>>>>>> Thanks a lot for starting this discussion. I think this is a
>> >>> very
>> >>>>>>>> useful
>> >>>>>>>>>> feature.
>> >>>>>>>>>>
>> >>>>>>>>>> Not only for Yarn, i am focused on flink on Kubernetes
>> >>>>> integration
>> >>>>>> and
>> >>>>>>>>> come
>> >>>>>>>>>> across the same
>> >>>>>>>>>> problem. I do not want the job graph generated on client side.
>> >>>>>>>> Instead,
>> >>>>>>>>> the
>> >>>>>>>>>> user jars are built in
>> >>>>>>>>>> a user-defined image. When the job manager launched, we just
>> >>>>> need to
>> >>>>>>>>>> generate the job graph
>> >>>>>>>>>> based on local user jars.
>> >>>>>>>>>>
>> >>>>>>>>>> I have some small suggestion about this.
>> >>>>>>>>>>
>> >>>>>>>>>> 1. `ProgramJobGraphRetriever` is very similar to
>> >>>>>>>>>> `ClasspathJobGraphRetriever`, the differences
>> >>>>>>>>>> are the former needs `ProgramMetadata` and the latter needs
>> >>> some
>> >>>>>>>>> arguments.
>> >>>>>>>>>> Is it possible to
>> >>>>>>>>>> have an unified `JobGraphRetriever` to support both?
>> >>>>>>>>>> 2. Is it possible to not use a local user jar to start a
>> >>> per-job
>> >>>>>>>> cluster?
>> >>>>>>>>>> In your case, the user jars has
>> >>>>>>>>>> existed on hdfs already and we do need to download the jars to
>> >>>>>>>> deployer
>> >>>>>>>>>> service. Currently, we
>> >>>>>>>>>> always need a local user jar to start a flink cluster. It is
>> >>> be
>> >>>>>> great
>> >>>>>>>> if
>> >>>>>>>>> we
>> >>>>>>>>>> could support remote user jars.
>> >>>>>>>>>>>> In the implementation, we assume users package
>> >>> flink-clients,
>> >>>>>>>>>> flink-optimizer, flink-table together within the job jar.
>> >>>>> Otherwise,
>> >>>>>>>> the
>> >>>>>>>>>> job graph generation within JobClusterEntryPoint will fail.
>> >>>>>>>>>> 3. What do you mean about the package? Do users need to
>> >>> compile
>> >>>>>> their
>> >>>>>>>>> jars
>> >>>>>>>>>> inlcuding flink-clients, flink-optimizer, flink-table codes?
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> Best,
>> >>>>>>>>>> Yang
>> >>>>>>>>>>
>> >>>>>>>>>> Peter Huang <[hidden email]> 于2019年12月10日周二
>> >>>>> 上午2:37写道:
>> >>>>>>>>>>
>> >>>>>>>>>>> Dear All,
>> >>>>>>>>>>>
>> >>>>>>>>>>> Recently, the Flink community starts to improve the yarn
>> >>>>> cluster
>> >>>>>>>>>> descriptor
>> >>>>>>>>>>> to make job jar and config files configurable from CLI. It
>> >>>>>> improves
>> >>>>>>>> the
>> >>>>>>>>>>> flexibility of  Flink deployment Yarn Per Job Mode. For
>> >>>>> platform
>> >>>>>>>> users
>> >>>>>>>>>> who
>> >>>>>>>>>>> manage tens of hundreds of streaming pipelines for the whole
>> >>>>> org
>> >>>>>> or
>> >>>>>>>>>>> company, we found the job graph generation in client-side is
>> >>>>>> another
>> >>>>>>>>>>> pinpoint. Thus, we want to propose a configurable feature
>> >>> for
>> >>>>>>>>>>> FlinkYarnSessionCli. The feature can allow users to choose
>> >>> the
>> >>>>> job
>> >>>>>>>>> graph
>> >>>>>>>>>>> generation in Flink ClusterEntryPoint so that the job jar
>> >>>>> doesn't
>> >>>>>>>> need
>> >>>>>>>>> to
>> >>>>>>>>>>> be locally for the job graph generation. The proposal is
>> >>>>> organized
>> >>>>>>>> as a
>> >>>>>>>>>>> FLIP
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>
>> >>>>>>
>> >>>>>
>> >>>
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-85+Delayed+JobGraph+Generation
>> >>>>>>>>>>> .
>> >>>>>>>>>>>
>> >>>>>>>>>>> Any questions and suggestions are welcomed. Thank you in
>> >>>>> advance.
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>> Best Regards
>> >>>>>>>>>>> Peter Huang
>> >>>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>
>> >>>>>>>
>> >>>>>>
>> >>>>>
>> >>>>
>> >>>
>> >>
>>
>>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] FLIP-85: Delayed Job Graph Generation

tison
Hi Peter,

As described above, this effort should get attention from people developing
FLIP-73 a.k.a. Executor abstractions. I recommend you to join the public
slack channel[1] for Flink Client API Enhancement and you can try to share
you detailed thoughts there. It possibly gets more concrete attentions.

Best,
tison.

[1]
https://slack.com/share/IS21SJ75H/Rk8HhUly9FuEHb7oGwBZ33uL/enQtODg2MDYwNjE5MTg3LTA2MjIzNDc1M2ZjZDVlMjdlZjk1M2RkYmJhNjAwMTk2ZDZkODQ4NmY5YmI4OGRhNWJkYTViMTM1NzlmMzc4OWM


Peter Huang <[hidden email]> 于2020年1月7日周二 上午5:09写道:

> Dear All,
>
> Happy new year! According to existing feedback from the community, we
> revised the doc with the consideration of session cluster support, and
> concrete interface changes needed and execution plan. Please take one more
> round of review at your most convenient time.
>
>
> https://docs.google.com/document/d/1aAwVjdZByA-0CHbgv16Me-vjaaDMCfhX7TzVVTuifYM/edit#
>
>
> Best Regards
> Peter Huang
>
>
>
>
>
> On Thu, Jan 2, 2020 at 11:29 AM Peter Huang <[hidden email]>
> wrote:
>
> > Hi Dian,
> > Thanks for giving us valuable feedbacks.
> >
> > 1) It's better to have a whole design for this feature
> > For the suggestion of enabling the cluster mode also session cluster, I
> > think Flink already supported it. WebSubmissionExtension already allows
> > users to start a job with the specified jar by using web UI.
> > But we need to enable the feature from CLI for both local jar, remote
> jar.
> > I will align with Yang Wang first about the details and update the design
> > doc.
> >
> > 2) It's better to consider the convenience for users, such as debugging
> >
> > I am wondering whether we can store the exception in jobgragh
> > generation in application master. As no streaming graph can be scheduled
> in
> > this case, there will be no more TM will be requested from FlinkRM.
> > If the AM is still running, users can still query it from CLI. As it
> > requires more change, we can get some feedback from <[hidden email]
> >
> > and @[hidden email] <[hidden email]>.
> >
> > 3) It's better to consider the impact to the stability of the cluster
> >
> > I agree with Yang Wang's opinion.
> >
> >
> >
> > Best Regards
> > Peter Huang
> >
> >
> > On Sun, Dec 29, 2019 at 9:44 PM Dian Fu <[hidden email]> wrote:
> >
> >> Hi all,
> >>
> >> Sorry to jump into this discussion. Thanks everyone for the discussion.
> >> I'm very interested in this topic although I'm not an expert in this
> part.
> >> So I'm glad to share my thoughts as following:
> >>
> >> 1) It's better to have a whole design for this feature
> >> As we know, there are two deployment modes: per-job mode and session
> >> mode. I'm wondering which mode really needs this feature. As the design
> doc
> >> mentioned, per-job mode is more used for streaming jobs and session
> mode is
> >> usually used for batch jobs(Of course, the job types and the deployment
> >> modes are orthogonal). Usually streaming job is only needed to be
> submitted
> >> once and it will run for days or weeks, while batch jobs will be
> submitted
> >> more frequently compared with streaming jobs. This means that maybe
> session
> >> mode also needs this feature. However, if we support this feature in
> >> session mode, the application master will become the new centralized
> >> service(which should be solved). So in this case, it's better to have a
> >> complete design for both per-job mode and session mode. Furthermore,
> even
> >> if we can do it phase by phase, we need to have a whole picture of how
> it
> >> works in both per-job mode and session mode.
> >>
> >> 2) It's better to consider the convenience for users, such as debugging
> >> After we finish this feature, the job graph will be compiled in the
> >> application master, which means that users cannot easily get the
> exception
> >> message synchorousely in the job client if there are problems during the
> >> job graph compiling (especially for platform users), such as the
> resource
> >> path is incorrect, the user program itself has some problems, etc. What
> I'm
> >> thinking is that maybe we should throw the exceptions as early as
> possible
> >> (during job submission stage).
> >>
> >> 3) It's better to consider the impact to the stability of the cluster
> >> If we perform the compiling in the application master, we should
> consider
> >> the impact of the compiling errors. Although YARN could resume the
> >> application master in case of failures, but in some case the compiling
> >> failure may be a waste of cluster resource and may impact the stability
> the
> >> cluster and the other jobs in the cluster, such as the resource path is
> >> incorrect, the user program itself has some problems(in this case, job
> >> failover cannot solve this kind of problems) etc. In the current
> >> implemention, the compiling errors are handled in the client side and
> there
> >> is no impact to the cluster at all.
> >>
> >> Regarding to 1), it's clearly pointed in the design doc that only
> per-job
> >> mode will be supported. However, I think it's better to also consider
> the
> >> session mode in the design doc.
> >> Regarding to 2) and 3), I have not seen related sections in the design
> >> doc. It will be good if we can cover them in the design doc.
> >>
> >> Feel free to correct me If there is anything I misunderstand.
> >>
> >> Regards,
> >> Dian
> >>
> >>
> >> > 在 2019年12月27日,上午3:13,Peter Huang <[hidden email]> 写道:
> >> >
> >> > Hi Yang,
> >> >
> >> > I can't agree more. The effort definitely needs to align with the
> final
> >> > goal of FLIP-73.
> >> > I am thinking about whether we can achieve the goal with two phases.
> >> >
> >> > 1) Phase I
> >> > As the CLiFrontend will not be depreciated soon. We can still use the
> >> > deployMode flag there,
> >> > pass the program info through Flink configuration,  use the
> >> > ClassPathJobGraphRetriever
> >> > to generate the job graph in ClusterEntrypoints of yarn and
> Kubernetes.
> >> >
> >> > 2) Phase II
> >> > In  AbstractJobClusterExecutor, the job graph is generated in the
> >> execute
> >> > function. We can still
> >> > use the deployMode in it. With deployMode = cluster, the execute
> >> function
> >> > only starts the cluster.
> >> >
> >> > When {Yarn/Kuberneates}PerJobClusterEntrypoint starts, It will start
> the
> >> > dispatch first, then we can use
> >> > a ClusterEnvironment similar to ContextEnvironment to submit the job
> >> with
> >> > jobName the local
> >> > dispatcher. For the details, we need more investigation. Let's wait
> >> > for @Aljoscha
> >> > Krettek <[hidden email]> @Till Rohrmann <[hidden email]>'s
> >> > feedback after the holiday season.
> >> >
> >> > Thank you in advance. Merry Chrismas and Happy New Year!!!
> >> >
> >> >
> >> > Best Regards
> >> > Peter Huang
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> > On Wed, Dec 25, 2019 at 1:08 AM Yang Wang <[hidden email]>
> >> wrote:
> >> >
> >> >> Hi Peter,
> >> >>
> >> >> I think we need to reconsider tison's suggestion seriously. After
> >> FLIP-73,
> >> >> the deployJobCluster has
> >> >> beenmoved into `JobClusterExecutor#execute`. It should not be
> perceived
> >> >> for `CliFrontend`. That
> >> >> means the user program will *ALWAYS* be executed on client side. This
> >> is
> >> >> the by design behavior.
> >> >> So, we could not just add `if(client mode) .. else if(cluster mode)
> >> ...`
> >> >> codes in `CliFrontend` to bypass
> >> >> the executor. We need to find a clean way to decouple executing user
> >> >> program and deploying per-job
> >> >> cluster. Based on this, we could support to execute user program on
> >> client
> >> >> or master side.
> >> >>
> >> >> Maybe Aljoscha and Jeff could give some good suggestions.
> >> >>
> >> >>
> >> >>
> >> >> Best,
> >> >> Yang
> >> >>
> >> >> Peter Huang <[hidden email]> 于2019年12月25日周三 上午4:03写道:
> >> >>
> >> >>> Hi Jingjing,
> >> >>>
> >> >>> The improvement proposed is a deployment option for CLI. For SQL
> based
> >> >>> Flink application, It is more convenient to use the existing model
> in
> >> >>> SqlClient in which
> >> >>> the job graph is generated within SqlClient. After adding the
> delayed
> >> job
> >> >>> graph generation, I think there is no change is needed for your
> side.
> >> >>>
> >> >>>
> >> >>> Best Regards
> >> >>> Peter Huang
> >> >>>
> >> >>>
> >> >>> On Wed, Dec 18, 2019 at 6:01 AM jingjing bai <
> >> [hidden email]>
> >> >>> wrote:
> >> >>>
> >> >>>> hi peter:
> >> >>>>    we had extension SqlClent to support sql job submit in web base
> on
> >> >>>> flink 1.9.   we support submit to yarn on per job mode too.
> >> >>>>    in this case, the job graph generated  on client side .  I think
> >> >>> this
> >> >>>> discuss Mainly to improve api programme.  but in my case , there is
> >> no
> >> >>>> jar to upload but only a sql string .
> >> >>>>    do u had more suggestion to improve for sql mode or it is only a
> >> >>>> switch for api programme?
> >> >>>>
> >> >>>>
> >> >>>> best
> >> >>>> bai jj
> >> >>>>
> >> >>>>
> >> >>>> Yang Wang <[hidden email]> 于2019年12月18日周三 下午7:21写道:
> >> >>>>
> >> >>>>> I just want to revive this discussion.
> >> >>>>>
> >> >>>>> Recently, i am thinking about how to natively run flink per-job
> >> >>> cluster on
> >> >>>>> Kubernetes.
> >> >>>>> The per-job mode on Kubernetes is very different from on Yarn. And
> >> we
> >> >>> will
> >> >>>>> have
> >> >>>>> the same deployment requirements to the client and entry point.
> >> >>>>>
> >> >>>>> 1. Flink client not always need a local jar to start a Flink
> per-job
> >> >>>>> cluster. We could
> >> >>>>> support multiple schemas. For example, file:///path/of/my.jar
> means
> >> a
> >> >>> jar
> >> >>>>> located
> >> >>>>> at client side, hdfs://myhdfs/user/myname/flink/my.jar means a jar
> >> >>> located
> >> >>>>> at
> >> >>>>> remote hdfs, local:///path/in/image/my.jar means a jar located at
> >> >>>>> jobmanager side.
> >> >>>>>
> >> >>>>> 2. Support running user program on master side. This also means
> the
> >> >>> entry
> >> >>>>> point
> >> >>>>> will generate the job graph on master side. We could use the
> >> >>>>> ClasspathJobGraphRetriever
> >> >>>>> or start a local Flink client to achieve this purpose.
> >> >>>>>
> >> >>>>>
> >> >>>>> cc tison, Aljoscha & Kostas Do you think this is the right
> >> direction we
> >> >>>>> need to work?
> >> >>>>>
> >> >>>>> tison <[hidden email]> 于2019年12月12日周四 下午4:48写道:
> >> >>>>>
> >> >>>>>> A quick idea is that we separate the deployment from user program
> >> >>> that
> >> >>>>> it
> >> >>>>>> has always been done
> >> >>>>>> outside the program. On user program executed there is always a
> >> >>>>>> ClusterClient that communicates with
> >> >>>>>> an existing cluster, remote or local. It will be another thread
> so
> >> >>> just
> >> >>>>> for
> >> >>>>>> your information.
> >> >>>>>>
> >> >>>>>> Best,
> >> >>>>>> tison.
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> tison <[hidden email]> 于2019年12月12日周四 下午4:40写道:
> >> >>>>>>
> >> >>>>>>> Hi Peter,
> >> >>>>>>>
> >> >>>>>>> Another concern I realized recently is that with current
> Executors
> >> >>>>>>> abstraction(FLIP-73)
> >> >>>>>>> I'm afraid that user program is designed to ALWAYS run on the
> >> >>> client
> >> >>>>>> side.
> >> >>>>>>> Specifically,
> >> >>>>>>> we deploy the job in executor when env.execute called. This
> >> >>>>> abstraction
> >> >>>>>>> possibly prevents
> >> >>>>>>> Flink runs user program on the cluster side.
> >> >>>>>>>
> >> >>>>>>> For your proposal, in this case we already compiled the program
> >> and
> >> >>>>> run
> >> >>>>>> on
> >> >>>>>>> the client side,
> >> >>>>>>> even we deploy a cluster and retrieve job graph from program
> >> >>>>> metadata, it
> >> >>>>>>> doesn't make
> >> >>>>>>> many sense.
> >> >>>>>>>
> >> >>>>>>> cc Aljoscha & Kostas what do you think about this constraint?
> >> >>>>>>>
> >> >>>>>>> Best,
> >> >>>>>>> tison.
> >> >>>>>>>
> >> >>>>>>>
> >> >>>>>>> Peter Huang <[hidden email]> 于2019年12月10日周二
> >> 下午12:45写道:
> >> >>>>>>>
> >> >>>>>>>> Hi Tison,
> >> >>>>>>>>
> >> >>>>>>>> Yes, you are right. I think I made the wrong argument in the
> doc.
> >> >>>>>>>> Basically, the packaging jar problem is only for platform
> users.
> >> >>> In
> >> >>>>> our
> >> >>>>>>>> internal deploy service,
> >> >>>>>>>> we further optimized the deployment latency by letting users to
> >> >>>>>> packaging
> >> >>>>>>>> flink-runtime together with the uber jar, so that we don't need
> >> to
> >> >>>>>>>> consider
> >> >>>>>>>> multiple flink version
> >> >>>>>>>> support for now. In the session client mode, as Flink libs will
> >> be
> >> >>>>>> shipped
> >> >>>>>>>> anyway as local resources of yarn. Users actually don't need to
> >> >>>>> package
> >> >>>>>>>> those libs into job jar.
> >> >>>>>>>>
> >> >>>>>>>>
> >> >>>>>>>>
> >> >>>>>>>> Best Regards
> >> >>>>>>>> Peter Huang
> >> >>>>>>>>
> >> >>>>>>>> On Mon, Dec 9, 2019 at 8:35 PM tison <[hidden email]>
> >> >>> wrote:
> >> >>>>>>>>
> >> >>>>>>>>>> 3. What do you mean about the package? Do users need to
> >> >>> compile
> >> >>>>>> their
> >> >>>>>>>>> jars
> >> >>>>>>>>> inlcuding flink-clients, flink-optimizer, flink-table codes?
> >> >>>>>>>>>
> >> >>>>>>>>> The answer should be no because they exist in system
> classpath.
> >> >>>>>>>>>
> >> >>>>>>>>> Best,
> >> >>>>>>>>> tison.
> >> >>>>>>>>>
> >> >>>>>>>>>
> >> >>>>>>>>> Yang Wang <[hidden email]> 于2019年12月10日周二 下午12:18写道:
> >> >>>>>>>>>
> >> >>>>>>>>>> Hi Peter,
> >> >>>>>>>>>>
> >> >>>>>>>>>> Thanks a lot for starting this discussion. I think this is a
> >> >>> very
> >> >>>>>>>> useful
> >> >>>>>>>>>> feature.
> >> >>>>>>>>>>
> >> >>>>>>>>>> Not only for Yarn, i am focused on flink on Kubernetes
> >> >>>>> integration
> >> >>>>>> and
> >> >>>>>>>>> come
> >> >>>>>>>>>> across the same
> >> >>>>>>>>>> problem. I do not want the job graph generated on client
> side.
> >> >>>>>>>> Instead,
> >> >>>>>>>>> the
> >> >>>>>>>>>> user jars are built in
> >> >>>>>>>>>> a user-defined image. When the job manager launched, we just
> >> >>>>> need to
> >> >>>>>>>>>> generate the job graph
> >> >>>>>>>>>> based on local user jars.
> >> >>>>>>>>>>
> >> >>>>>>>>>> I have some small suggestion about this.
> >> >>>>>>>>>>
> >> >>>>>>>>>> 1. `ProgramJobGraphRetriever` is very similar to
> >> >>>>>>>>>> `ClasspathJobGraphRetriever`, the differences
> >> >>>>>>>>>> are the former needs `ProgramMetadata` and the latter needs
> >> >>> some
> >> >>>>>>>>> arguments.
> >> >>>>>>>>>> Is it possible to
> >> >>>>>>>>>> have an unified `JobGraphRetriever` to support both?
> >> >>>>>>>>>> 2. Is it possible to not use a local user jar to start a
> >> >>> per-job
> >> >>>>>>>> cluster?
> >> >>>>>>>>>> In your case, the user jars has
> >> >>>>>>>>>> existed on hdfs already and we do need to download the jars
> to
> >> >>>>>>>> deployer
> >> >>>>>>>>>> service. Currently, we
> >> >>>>>>>>>> always need a local user jar to start a flink cluster. It is
> >> >>> be
> >> >>>>>> great
> >> >>>>>>>> if
> >> >>>>>>>>> we
> >> >>>>>>>>>> could support remote user jars.
> >> >>>>>>>>>>>> In the implementation, we assume users package
> >> >>> flink-clients,
> >> >>>>>>>>>> flink-optimizer, flink-table together within the job jar.
> >> >>>>> Otherwise,
> >> >>>>>>>> the
> >> >>>>>>>>>> job graph generation within JobClusterEntryPoint will fail.
> >> >>>>>>>>>> 3. What do you mean about the package? Do users need to
> >> >>> compile
> >> >>>>>> their
> >> >>>>>>>>> jars
> >> >>>>>>>>>> inlcuding flink-clients, flink-optimizer, flink-table codes?
> >> >>>>>>>>>>
> >> >>>>>>>>>>
> >> >>>>>>>>>>
> >> >>>>>>>>>> Best,
> >> >>>>>>>>>> Yang
> >> >>>>>>>>>>
> >> >>>>>>>>>> Peter Huang <[hidden email]> 于2019年12月10日周二
> >> >>>>> 上午2:37写道:
> >> >>>>>>>>>>
> >> >>>>>>>>>>> Dear All,
> >> >>>>>>>>>>>
> >> >>>>>>>>>>> Recently, the Flink community starts to improve the yarn
> >> >>>>> cluster
> >> >>>>>>>>>> descriptor
> >> >>>>>>>>>>> to make job jar and config files configurable from CLI. It
> >> >>>>>> improves
> >> >>>>>>>> the
> >> >>>>>>>>>>> flexibility of  Flink deployment Yarn Per Job Mode. For
> >> >>>>> platform
> >> >>>>>>>> users
> >> >>>>>>>>>> who
> >> >>>>>>>>>>> manage tens of hundreds of streaming pipelines for the whole
> >> >>>>> org
> >> >>>>>> or
> >> >>>>>>>>>>> company, we found the job graph generation in client-side is
> >> >>>>>> another
> >> >>>>>>>>>>> pinpoint. Thus, we want to propose a configurable feature
> >> >>> for
> >> >>>>>>>>>>> FlinkYarnSessionCli. The feature can allow users to choose
> >> >>> the
> >> >>>>> job
> >> >>>>>>>>> graph
> >> >>>>>>>>>>> generation in Flink ClusterEntryPoint so that the job jar
> >> >>>>> doesn't
> >> >>>>>>>> need
> >> >>>>>>>>> to
> >> >>>>>>>>>>> be locally for the job graph generation. The proposal is
> >> >>>>> organized
> >> >>>>>>>> as a
> >> >>>>>>>>>>> FLIP
> >> >>>>>>>>>>>
> >> >>>>>>>>>>>
> >> >>>>>>>>>>
> >> >>>>>>>>>
> >> >>>>>>>>
> >> >>>>>>
> >> >>>>>
> >> >>>
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-85+Delayed+JobGraph+Generation
> >> >>>>>>>>>>> .
> >> >>>>>>>>>>>
> >> >>>>>>>>>>> Any questions and suggestions are welcomed. Thank you in
> >> >>>>> advance.
> >> >>>>>>>>>>>
> >> >>>>>>>>>>>
> >> >>>>>>>>>>> Best Regards
> >> >>>>>>>>>>> Peter Huang
> >> >>>>>>>>>>>
> >> >>>>>>>>>>
> >> >>>>>>>>>
> >> >>>>>>>>
> >> >>>>>>>
> >> >>>>>>
> >> >>>>>
> >> >>>>
> >> >>>
> >> >>
> >>
> >>
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] FLIP-85: Delayed Job Graph Generation

Peter Huang
Hi Tison,

I can't join the group with shared link. Would you please add me into the
group? My slack account is huangzhenqiu0825.
Thank you in advance.


Best Regards
Peter Huang

On Wed, Jan 8, 2020 at 12:02 AM tison <[hidden email]> wrote:

> Hi Peter,
>
> As described above, this effort should get attention from people developing
> FLIP-73 a.k.a. Executor abstractions. I recommend you to join the public
> slack channel[1] for Flink Client API Enhancement and you can try to share
> you detailed thoughts there. It possibly gets more concrete attentions.
>
> Best,
> tison.
>
> [1]
>
> https://slack.com/share/IS21SJ75H/Rk8HhUly9FuEHb7oGwBZ33uL/enQtODg2MDYwNjE5MTg3LTA2MjIzNDc1M2ZjZDVlMjdlZjk1M2RkYmJhNjAwMTk2ZDZkODQ4NmY5YmI4OGRhNWJkYTViMTM1NzlmMzc4OWM
>
>
> Peter Huang <[hidden email]> 于2020年1月7日周二 上午5:09写道:
>
> > Dear All,
> >
> > Happy new year! According to existing feedback from the community, we
> > revised the doc with the consideration of session cluster support, and
> > concrete interface changes needed and execution plan. Please take one
> more
> > round of review at your most convenient time.
> >
> >
> >
> https://docs.google.com/document/d/1aAwVjdZByA-0CHbgv16Me-vjaaDMCfhX7TzVVTuifYM/edit#
> >
> >
> > Best Regards
> > Peter Huang
> >
> >
> >
> >
> >
> > On Thu, Jan 2, 2020 at 11:29 AM Peter Huang <[hidden email]>
> > wrote:
> >
> > > Hi Dian,
> > > Thanks for giving us valuable feedbacks.
> > >
> > > 1) It's better to have a whole design for this feature
> > > For the suggestion of enabling the cluster mode also session cluster, I
> > > think Flink already supported it. WebSubmissionExtension already allows
> > > users to start a job with the specified jar by using web UI.
> > > But we need to enable the feature from CLI for both local jar, remote
> > jar.
> > > I will align with Yang Wang first about the details and update the
> design
> > > doc.
> > >
> > > 2) It's better to consider the convenience for users, such as debugging
> > >
> > > I am wondering whether we can store the exception in jobgragh
> > > generation in application master. As no streaming graph can be
> scheduled
> > in
> > > this case, there will be no more TM will be requested from FlinkRM.
> > > If the AM is still running, users can still query it from CLI. As it
> > > requires more change, we can get some feedback from <
> [hidden email]
> > >
> > > and @[hidden email] <[hidden email]>.
> > >
> > > 3) It's better to consider the impact to the stability of the cluster
> > >
> > > I agree with Yang Wang's opinion.
> > >
> > >
> > >
> > > Best Regards
> > > Peter Huang
> > >
> > >
> > > On Sun, Dec 29, 2019 at 9:44 PM Dian Fu <[hidden email]> wrote:
> > >
> > >> Hi all,
> > >>
> > >> Sorry to jump into this discussion. Thanks everyone for the
> discussion.
> > >> I'm very interested in this topic although I'm not an expert in this
> > part.
> > >> So I'm glad to share my thoughts as following:
> > >>
> > >> 1) It's better to have a whole design for this feature
> > >> As we know, there are two deployment modes: per-job mode and session
> > >> mode. I'm wondering which mode really needs this feature. As the
> design
> > doc
> > >> mentioned, per-job mode is more used for streaming jobs and session
> > mode is
> > >> usually used for batch jobs(Of course, the job types and the
> deployment
> > >> modes are orthogonal). Usually streaming job is only needed to be
> > submitted
> > >> once and it will run for days or weeks, while batch jobs will be
> > submitted
> > >> more frequently compared with streaming jobs. This means that maybe
> > session
> > >> mode also needs this feature. However, if we support this feature in
> > >> session mode, the application master will become the new centralized
> > >> service(which should be solved). So in this case, it's better to have
> a
> > >> complete design for both per-job mode and session mode. Furthermore,
> > even
> > >> if we can do it phase by phase, we need to have a whole picture of how
> > it
> > >> works in both per-job mode and session mode.
> > >>
> > >> 2) It's better to consider the convenience for users, such as
> debugging
> > >> After we finish this feature, the job graph will be compiled in the
> > >> application master, which means that users cannot easily get the
> > exception
> > >> message synchorousely in the job client if there are problems during
> the
> > >> job graph compiling (especially for platform users), such as the
> > resource
> > >> path is incorrect, the user program itself has some problems, etc.
> What
> > I'm
> > >> thinking is that maybe we should throw the exceptions as early as
> > possible
> > >> (during job submission stage).
> > >>
> > >> 3) It's better to consider the impact to the stability of the cluster
> > >> If we perform the compiling in the application master, we should
> > consider
> > >> the impact of the compiling errors. Although YARN could resume the
> > >> application master in case of failures, but in some case the compiling
> > >> failure may be a waste of cluster resource and may impact the
> stability
> > the
> > >> cluster and the other jobs in the cluster, such as the resource path
> is
> > >> incorrect, the user program itself has some problems(in this case, job
> > >> failover cannot solve this kind of problems) etc. In the current
> > >> implemention, the compiling errors are handled in the client side and
> > there
> > >> is no impact to the cluster at all.
> > >>
> > >> Regarding to 1), it's clearly pointed in the design doc that only
> > per-job
> > >> mode will be supported. However, I think it's better to also consider
> > the
> > >> session mode in the design doc.
> > >> Regarding to 2) and 3), I have not seen related sections in the design
> > >> doc. It will be good if we can cover them in the design doc.
> > >>
> > >> Feel free to correct me If there is anything I misunderstand.
> > >>
> > >> Regards,
> > >> Dian
> > >>
> > >>
> > >> > 在 2019年12月27日,上午3:13,Peter Huang <[hidden email]> 写道:
> > >> >
> > >> > Hi Yang,
> > >> >
> > >> > I can't agree more. The effort definitely needs to align with the
> > final
> > >> > goal of FLIP-73.
> > >> > I am thinking about whether we can achieve the goal with two phases.
> > >> >
> > >> > 1) Phase I
> > >> > As the CLiFrontend will not be depreciated soon. We can still use
> the
> > >> > deployMode flag there,
> > >> > pass the program info through Flink configuration,  use the
> > >> > ClassPathJobGraphRetriever
> > >> > to generate the job graph in ClusterEntrypoints of yarn and
> > Kubernetes.
> > >> >
> > >> > 2) Phase II
> > >> > In  AbstractJobClusterExecutor, the job graph is generated in the
> > >> execute
> > >> > function. We can still
> > >> > use the deployMode in it. With deployMode = cluster, the execute
> > >> function
> > >> > only starts the cluster.
> > >> >
> > >> > When {Yarn/Kuberneates}PerJobClusterEntrypoint starts, It will start
> > the
> > >> > dispatch first, then we can use
> > >> > a ClusterEnvironment similar to ContextEnvironment to submit the job
> > >> with
> > >> > jobName the local
> > >> > dispatcher. For the details, we need more investigation. Let's wait
> > >> > for @Aljoscha
> > >> > Krettek <[hidden email]> @Till Rohrmann <[hidden email]
> >'s
> > >> > feedback after the holiday season.
> > >> >
> > >> > Thank you in advance. Merry Chrismas and Happy New Year!!!
> > >> >
> > >> >
> > >> > Best Regards
> > >> > Peter Huang
> > >> >
> > >> >
> > >> >
> > >> >
> > >> >
> > >> >
> > >> >
> > >> >
> > >> > On Wed, Dec 25, 2019 at 1:08 AM Yang Wang <[hidden email]>
> > >> wrote:
> > >> >
> > >> >> Hi Peter,
> > >> >>
> > >> >> I think we need to reconsider tison's suggestion seriously. After
> > >> FLIP-73,
> > >> >> the deployJobCluster has
> > >> >> beenmoved into `JobClusterExecutor#execute`. It should not be
> > perceived
> > >> >> for `CliFrontend`. That
> > >> >> means the user program will *ALWAYS* be executed on client side.
> This
> > >> is
> > >> >> the by design behavior.
> > >> >> So, we could not just add `if(client mode) .. else if(cluster mode)
> > >> ...`
> > >> >> codes in `CliFrontend` to bypass
> > >> >> the executor. We need to find a clean way to decouple executing
> user
> > >> >> program and deploying per-job
> > >> >> cluster. Based on this, we could support to execute user program on
> > >> client
> > >> >> or master side.
> > >> >>
> > >> >> Maybe Aljoscha and Jeff could give some good suggestions.
> > >> >>
> > >> >>
> > >> >>
> > >> >> Best,
> > >> >> Yang
> > >> >>
> > >> >> Peter Huang <[hidden email]> 于2019年12月25日周三 上午4:03写道:
> > >> >>
> > >> >>> Hi Jingjing,
> > >> >>>
> > >> >>> The improvement proposed is a deployment option for CLI. For SQL
> > based
> > >> >>> Flink application, It is more convenient to use the existing model
> > in
> > >> >>> SqlClient in which
> > >> >>> the job graph is generated within SqlClient. After adding the
> > delayed
> > >> job
> > >> >>> graph generation, I think there is no change is needed for your
> > side.
> > >> >>>
> > >> >>>
> > >> >>> Best Regards
> > >> >>> Peter Huang
> > >> >>>
> > >> >>>
> > >> >>> On Wed, Dec 18, 2019 at 6:01 AM jingjing bai <
> > >> [hidden email]>
> > >> >>> wrote:
> > >> >>>
> > >> >>>> hi peter:
> > >> >>>>    we had extension SqlClent to support sql job submit in web
> base
> > on
> > >> >>>> flink 1.9.   we support submit to yarn on per job mode too.
> > >> >>>>    in this case, the job graph generated  on client side .  I
> think
> > >> >>> this
> > >> >>>> discuss Mainly to improve api programme.  but in my case , there
> is
> > >> no
> > >> >>>> jar to upload but only a sql string .
> > >> >>>>    do u had more suggestion to improve for sql mode or it is
> only a
> > >> >>>> switch for api programme?
> > >> >>>>
> > >> >>>>
> > >> >>>> best
> > >> >>>> bai jj
> > >> >>>>
> > >> >>>>
> > >> >>>> Yang Wang <[hidden email]> 于2019年12月18日周三 下午7:21写道:
> > >> >>>>
> > >> >>>>> I just want to revive this discussion.
> > >> >>>>>
> > >> >>>>> Recently, i am thinking about how to natively run flink per-job
> > >> >>> cluster on
> > >> >>>>> Kubernetes.
> > >> >>>>> The per-job mode on Kubernetes is very different from on Yarn.
> And
> > >> we
> > >> >>> will
> > >> >>>>> have
> > >> >>>>> the same deployment requirements to the client and entry point.
> > >> >>>>>
> > >> >>>>> 1. Flink client not always need a local jar to start a Flink
> > per-job
> > >> >>>>> cluster. We could
> > >> >>>>> support multiple schemas. For example, file:///path/of/my.jar
> > means
> > >> a
> > >> >>> jar
> > >> >>>>> located
> > >> >>>>> at client side, hdfs://myhdfs/user/myname/flink/my.jar means a
> jar
> > >> >>> located
> > >> >>>>> at
> > >> >>>>> remote hdfs, local:///path/in/image/my.jar means a jar located
> at
> > >> >>>>> jobmanager side.
> > >> >>>>>
> > >> >>>>> 2. Support running user program on master side. This also means
> > the
> > >> >>> entry
> > >> >>>>> point
> > >> >>>>> will generate the job graph on master side. We could use the
> > >> >>>>> ClasspathJobGraphRetriever
> > >> >>>>> or start a local Flink client to achieve this purpose.
> > >> >>>>>
> > >> >>>>>
> > >> >>>>> cc tison, Aljoscha & Kostas Do you think this is the right
> > >> direction we
> > >> >>>>> need to work?
> > >> >>>>>
> > >> >>>>> tison <[hidden email]> 于2019年12月12日周四 下午4:48写道:
> > >> >>>>>
> > >> >>>>>> A quick idea is that we separate the deployment from user
> program
> > >> >>> that
> > >> >>>>> it
> > >> >>>>>> has always been done
> > >> >>>>>> outside the program. On user program executed there is always a
> > >> >>>>>> ClusterClient that communicates with
> > >> >>>>>> an existing cluster, remote or local. It will be another thread
> > so
> > >> >>> just
> > >> >>>>> for
> > >> >>>>>> your information.
> > >> >>>>>>
> > >> >>>>>> Best,
> > >> >>>>>> tison.
> > >> >>>>>>
> > >> >>>>>>
> > >> >>>>>> tison <[hidden email]> 于2019年12月12日周四 下午4:40写道:
> > >> >>>>>>
> > >> >>>>>>> Hi Peter,
> > >> >>>>>>>
> > >> >>>>>>> Another concern I realized recently is that with current
> > Executors
> > >> >>>>>>> abstraction(FLIP-73)
> > >> >>>>>>> I'm afraid that user program is designed to ALWAYS run on the
> > >> >>> client
> > >> >>>>>> side.
> > >> >>>>>>> Specifically,
> > >> >>>>>>> we deploy the job in executor when env.execute called. This
> > >> >>>>> abstraction
> > >> >>>>>>> possibly prevents
> > >> >>>>>>> Flink runs user program on the cluster side.
> > >> >>>>>>>
> > >> >>>>>>> For your proposal, in this case we already compiled the
> program
> > >> and
> > >> >>>>> run
> > >> >>>>>> on
> > >> >>>>>>> the client side,
> > >> >>>>>>> even we deploy a cluster and retrieve job graph from program
> > >> >>>>> metadata, it
> > >> >>>>>>> doesn't make
> > >> >>>>>>> many sense.
> > >> >>>>>>>
> > >> >>>>>>> cc Aljoscha & Kostas what do you think about this constraint?
> > >> >>>>>>>
> > >> >>>>>>> Best,
> > >> >>>>>>> tison.
> > >> >>>>>>>
> > >> >>>>>>>
> > >> >>>>>>> Peter Huang <[hidden email]> 于2019年12月10日周二
> > >> 下午12:45写道:
> > >> >>>>>>>
> > >> >>>>>>>> Hi Tison,
> > >> >>>>>>>>
> > >> >>>>>>>> Yes, you are right. I think I made the wrong argument in the
> > doc.
> > >> >>>>>>>> Basically, the packaging jar problem is only for platform
> > users.
> > >> >>> In
> > >> >>>>> our
> > >> >>>>>>>> internal deploy service,
> > >> >>>>>>>> we further optimized the deployment latency by letting users
> to
> > >> >>>>>> packaging
> > >> >>>>>>>> flink-runtime together with the uber jar, so that we don't
> need
> > >> to
> > >> >>>>>>>> consider
> > >> >>>>>>>> multiple flink version
> > >> >>>>>>>> support for now. In the session client mode, as Flink libs
> will
> > >> be
> > >> >>>>>> shipped
> > >> >>>>>>>> anyway as local resources of yarn. Users actually don't need
> to
> > >> >>>>> package
> > >> >>>>>>>> those libs into job jar.
> > >> >>>>>>>>
> > >> >>>>>>>>
> > >> >>>>>>>>
> > >> >>>>>>>> Best Regards
> > >> >>>>>>>> Peter Huang
> > >> >>>>>>>>
> > >> >>>>>>>> On Mon, Dec 9, 2019 at 8:35 PM tison <[hidden email]>
> > >> >>> wrote:
> > >> >>>>>>>>
> > >> >>>>>>>>>> 3. What do you mean about the package? Do users need to
> > >> >>> compile
> > >> >>>>>> their
> > >> >>>>>>>>> jars
> > >> >>>>>>>>> inlcuding flink-clients, flink-optimizer, flink-table codes?
> > >> >>>>>>>>>
> > >> >>>>>>>>> The answer should be no because they exist in system
> > classpath.
> > >> >>>>>>>>>
> > >> >>>>>>>>> Best,
> > >> >>>>>>>>> tison.
> > >> >>>>>>>>>
> > >> >>>>>>>>>
> > >> >>>>>>>>> Yang Wang <[hidden email]> 于2019年12月10日周二 下午12:18写道:
> > >> >>>>>>>>>
> > >> >>>>>>>>>> Hi Peter,
> > >> >>>>>>>>>>
> > >> >>>>>>>>>> Thanks a lot for starting this discussion. I think this is
> a
> > >> >>> very
> > >> >>>>>>>> useful
> > >> >>>>>>>>>> feature.
> > >> >>>>>>>>>>
> > >> >>>>>>>>>> Not only for Yarn, i am focused on flink on Kubernetes
> > >> >>>>> integration
> > >> >>>>>> and
> > >> >>>>>>>>> come
> > >> >>>>>>>>>> across the same
> > >> >>>>>>>>>> problem. I do not want the job graph generated on client
> > side.
> > >> >>>>>>>> Instead,
> > >> >>>>>>>>> the
> > >> >>>>>>>>>> user jars are built in
> > >> >>>>>>>>>> a user-defined image. When the job manager launched, we
> just
> > >> >>>>> need to
> > >> >>>>>>>>>> generate the job graph
> > >> >>>>>>>>>> based on local user jars.
> > >> >>>>>>>>>>
> > >> >>>>>>>>>> I have some small suggestion about this.
> > >> >>>>>>>>>>
> > >> >>>>>>>>>> 1. `ProgramJobGraphRetriever` is very similar to
> > >> >>>>>>>>>> `ClasspathJobGraphRetriever`, the differences
> > >> >>>>>>>>>> are the former needs `ProgramMetadata` and the latter needs
> > >> >>> some
> > >> >>>>>>>>> arguments.
> > >> >>>>>>>>>> Is it possible to
> > >> >>>>>>>>>> have an unified `JobGraphRetriever` to support both?
> > >> >>>>>>>>>> 2. Is it possible to not use a local user jar to start a
> > >> >>> per-job
> > >> >>>>>>>> cluster?
> > >> >>>>>>>>>> In your case, the user jars has
> > >> >>>>>>>>>> existed on hdfs already and we do need to download the jars
> > to
> > >> >>>>>>>> deployer
> > >> >>>>>>>>>> service. Currently, we
> > >> >>>>>>>>>> always need a local user jar to start a flink cluster. It
> is
> > >> >>> be
> > >> >>>>>> great
> > >> >>>>>>>> if
> > >> >>>>>>>>> we
> > >> >>>>>>>>>> could support remote user jars.
> > >> >>>>>>>>>>>> In the implementation, we assume users package
> > >> >>> flink-clients,
> > >> >>>>>>>>>> flink-optimizer, flink-table together within the job jar.
> > >> >>>>> Otherwise,
> > >> >>>>>>>> the
> > >> >>>>>>>>>> job graph generation within JobClusterEntryPoint will fail.
> > >> >>>>>>>>>> 3. What do you mean about the package? Do users need to
> > >> >>> compile
> > >> >>>>>> their
> > >> >>>>>>>>> jars
> > >> >>>>>>>>>> inlcuding flink-clients, flink-optimizer, flink-table
> codes?
> > >> >>>>>>>>>>
> > >> >>>>>>>>>>
> > >> >>>>>>>>>>
> > >> >>>>>>>>>> Best,
> > >> >>>>>>>>>> Yang
> > >> >>>>>>>>>>
> > >> >>>>>>>>>> Peter Huang <[hidden email]> 于2019年12月10日周二
> > >> >>>>> 上午2:37写道:
> > >> >>>>>>>>>>
> > >> >>>>>>>>>>> Dear All,
> > >> >>>>>>>>>>>
> > >> >>>>>>>>>>> Recently, the Flink community starts to improve the yarn
> > >> >>>>> cluster
> > >> >>>>>>>>>> descriptor
> > >> >>>>>>>>>>> to make job jar and config files configurable from CLI. It
> > >> >>>>>> improves
> > >> >>>>>>>> the
> > >> >>>>>>>>>>> flexibility of  Flink deployment Yarn Per Job Mode. For
> > >> >>>>> platform
> > >> >>>>>>>> users
> > >> >>>>>>>>>> who
> > >> >>>>>>>>>>> manage tens of hundreds of streaming pipelines for the
> whole
> > >> >>>>> org
> > >> >>>>>> or
> > >> >>>>>>>>>>> company, we found the job graph generation in client-side
> is
> > >> >>>>>> another
> > >> >>>>>>>>>>> pinpoint. Thus, we want to propose a configurable feature
> > >> >>> for
> > >> >>>>>>>>>>> FlinkYarnSessionCli. The feature can allow users to choose
> > >> >>> the
> > >> >>>>> job
> > >> >>>>>>>>> graph
> > >> >>>>>>>>>>> generation in Flink ClusterEntryPoint so that the job jar
> > >> >>>>> doesn't
> > >> >>>>>>>> need
> > >> >>>>>>>>> to
> > >> >>>>>>>>>>> be locally for the job graph generation. The proposal is
> > >> >>>>> organized
> > >> >>>>>>>> as a
> > >> >>>>>>>>>>> FLIP
> > >> >>>>>>>>>>>
> > >> >>>>>>>>>>>
> > >> >>>>>>>>>>
> > >> >>>>>>>>>
> > >> >>>>>>>>
> > >> >>>>>>
> > >> >>>>>
> > >> >>>
> > >>
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-85+Delayed+JobGraph+Generation
> > >> >>>>>>>>>>> .
> > >> >>>>>>>>>>>
> > >> >>>>>>>>>>> Any questions and suggestions are welcomed. Thank you in
> > >> >>>>> advance.
> > >> >>>>>>>>>>>
> > >> >>>>>>>>>>>
> > >> >>>>>>>>>>> Best Regards
> > >> >>>>>>>>>>> Peter Huang
> > >> >>>>>>>>>>>
> > >> >>>>>>>>>>
> > >> >>>>>>>>>
> > >> >>>>>>>>
> > >> >>>>>>>
> > >> >>>>>>
> > >> >>>>>
> > >> >>>>
> > >> >>>
> > >> >>
> > >>
> > >>
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] FLIP-85: Delayed Job Graph Generation

tison
Hi Peter,

Could you try out this link? https://the-asf.slack.com/messages/CNA3ADZPH

Best,
tison.


Peter Huang <[hidden email]> 于2020年1月9日周四 上午1:22写道:

> Hi Tison,
>
> I can't join the group with shared link. Would you please add me into the
> group? My slack account is huangzhenqiu0825.
> Thank you in advance.
>
>
> Best Regards
> Peter Huang
>
> On Wed, Jan 8, 2020 at 12:02 AM tison <[hidden email]> wrote:
>
> > Hi Peter,
> >
> > As described above, this effort should get attention from people
> developing
> > FLIP-73 a.k.a. Executor abstractions. I recommend you to join the public
> > slack channel[1] for Flink Client API Enhancement and you can try to
> share
> > you detailed thoughts there. It possibly gets more concrete attentions.
> >
> > Best,
> > tison.
> >
> > [1]
> >
> >
> https://slack.com/share/IS21SJ75H/Rk8HhUly9FuEHb7oGwBZ33uL/enQtODg2MDYwNjE5MTg3LTA2MjIzNDc1M2ZjZDVlMjdlZjk1M2RkYmJhNjAwMTk2ZDZkODQ4NmY5YmI4OGRhNWJkYTViMTM1NzlmMzc4OWM
> >
> >
> > Peter Huang <[hidden email]> 于2020年1月7日周二 上午5:09写道:
> >
> > > Dear All,
> > >
> > > Happy new year! According to existing feedback from the community, we
> > > revised the doc with the consideration of session cluster support, and
> > > concrete interface changes needed and execution plan. Please take one
> > more
> > > round of review at your most convenient time.
> > >
> > >
> > >
> >
> https://docs.google.com/document/d/1aAwVjdZByA-0CHbgv16Me-vjaaDMCfhX7TzVVTuifYM/edit#
> > >
> > >
> > > Best Regards
> > > Peter Huang
> > >
> > >
> > >
> > >
> > >
> > > On Thu, Jan 2, 2020 at 11:29 AM Peter Huang <
> [hidden email]>
> > > wrote:
> > >
> > > > Hi Dian,
> > > > Thanks for giving us valuable feedbacks.
> > > >
> > > > 1) It's better to have a whole design for this feature
> > > > For the suggestion of enabling the cluster mode also session
> cluster, I
> > > > think Flink already supported it. WebSubmissionExtension already
> allows
> > > > users to start a job with the specified jar by using web UI.
> > > > But we need to enable the feature from CLI for both local jar, remote
> > > jar.
> > > > I will align with Yang Wang first about the details and update the
> > design
> > > > doc.
> > > >
> > > > 2) It's better to consider the convenience for users, such as
> debugging
> > > >
> > > > I am wondering whether we can store the exception in jobgragh
> > > > generation in application master. As no streaming graph can be
> > scheduled
> > > in
> > > > this case, there will be no more TM will be requested from FlinkRM.
> > > > If the AM is still running, users can still query it from CLI. As it
> > > > requires more change, we can get some feedback from <
> > [hidden email]
> > > >
> > > > and @[hidden email] <[hidden email]>.
> > > >
> > > > 3) It's better to consider the impact to the stability of the cluster
> > > >
> > > > I agree with Yang Wang's opinion.
> > > >
> > > >
> > > >
> > > > Best Regards
> > > > Peter Huang
> > > >
> > > >
> > > > On Sun, Dec 29, 2019 at 9:44 PM Dian Fu <[hidden email]>
> wrote:
> > > >
> > > >> Hi all,
> > > >>
> > > >> Sorry to jump into this discussion. Thanks everyone for the
> > discussion.
> > > >> I'm very interested in this topic although I'm not an expert in this
> > > part.
> > > >> So I'm glad to share my thoughts as following:
> > > >>
> > > >> 1) It's better to have a whole design for this feature
> > > >> As we know, there are two deployment modes: per-job mode and session
> > > >> mode. I'm wondering which mode really needs this feature. As the
> > design
> > > doc
> > > >> mentioned, per-job mode is more used for streaming jobs and session
> > > mode is
> > > >> usually used for batch jobs(Of course, the job types and the
> > deployment
> > > >> modes are orthogonal). Usually streaming job is only needed to be
> > > submitted
> > > >> once and it will run for days or weeks, while batch jobs will be
> > > submitted
> > > >> more frequently compared with streaming jobs. This means that maybe
> > > session
> > > >> mode also needs this feature. However, if we support this feature in
> > > >> session mode, the application master will become the new centralized
> > > >> service(which should be solved). So in this case, it's better to
> have
> > a
> > > >> complete design for both per-job mode and session mode. Furthermore,
> > > even
> > > >> if we can do it phase by phase, we need to have a whole picture of
> how
> > > it
> > > >> works in both per-job mode and session mode.
> > > >>
> > > >> 2) It's better to consider the convenience for users, such as
> > debugging
> > > >> After we finish this feature, the job graph will be compiled in the
> > > >> application master, which means that users cannot easily get the
> > > exception
> > > >> message synchorousely in the job client if there are problems during
> > the
> > > >> job graph compiling (especially for platform users), such as the
> > > resource
> > > >> path is incorrect, the user program itself has some problems, etc.
> > What
> > > I'm
> > > >> thinking is that maybe we should throw the exceptions as early as
> > > possible
> > > >> (during job submission stage).
> > > >>
> > > >> 3) It's better to consider the impact to the stability of the
> cluster
> > > >> If we perform the compiling in the application master, we should
> > > consider
> > > >> the impact of the compiling errors. Although YARN could resume the
> > > >> application master in case of failures, but in some case the
> compiling
> > > >> failure may be a waste of cluster resource and may impact the
> > stability
> > > the
> > > >> cluster and the other jobs in the cluster, such as the resource path
> > is
> > > >> incorrect, the user program itself has some problems(in this case,
> job
> > > >> failover cannot solve this kind of problems) etc. In the current
> > > >> implemention, the compiling errors are handled in the client side
> and
> > > there
> > > >> is no impact to the cluster at all.
> > > >>
> > > >> Regarding to 1), it's clearly pointed in the design doc that only
> > > per-job
> > > >> mode will be supported. However, I think it's better to also
> consider
> > > the
> > > >> session mode in the design doc.
> > > >> Regarding to 2) and 3), I have not seen related sections in the
> design
> > > >> doc. It will be good if we can cover them in the design doc.
> > > >>
> > > >> Feel free to correct me If there is anything I misunderstand.
> > > >>
> > > >> Regards,
> > > >> Dian
> > > >>
> > > >>
> > > >> > 在 2019年12月27日,上午3:13,Peter Huang <[hidden email]> 写道:
> > > >> >
> > > >> > Hi Yang,
> > > >> >
> > > >> > I can't agree more. The effort definitely needs to align with the
> > > final
> > > >> > goal of FLIP-73.
> > > >> > I am thinking about whether we can achieve the goal with two
> phases.
> > > >> >
> > > >> > 1) Phase I
> > > >> > As the CLiFrontend will not be depreciated soon. We can still use
> > the
> > > >> > deployMode flag there,
> > > >> > pass the program info through Flink configuration,  use the
> > > >> > ClassPathJobGraphRetriever
> > > >> > to generate the job graph in ClusterEntrypoints of yarn and
> > > Kubernetes.
> > > >> >
> > > >> > 2) Phase II
> > > >> > In  AbstractJobClusterExecutor, the job graph is generated in the
> > > >> execute
> > > >> > function. We can still
> > > >> > use the deployMode in it. With deployMode = cluster, the execute
> > > >> function
> > > >> > only starts the cluster.
> > > >> >
> > > >> > When {Yarn/Kuberneates}PerJobClusterEntrypoint starts, It will
> start
> > > the
> > > >> > dispatch first, then we can use
> > > >> > a ClusterEnvironment similar to ContextEnvironment to submit the
> job
> > > >> with
> > > >> > jobName the local
> > > >> > dispatcher. For the details, we need more investigation. Let's
> wait
> > > >> > for @Aljoscha
> > > >> > Krettek <[hidden email]> @Till Rohrmann <
> [hidden email]
> > >'s
> > > >> > feedback after the holiday season.
> > > >> >
> > > >> > Thank you in advance. Merry Chrismas and Happy New Year!!!
> > > >> >
> > > >> >
> > > >> > Best Regards
> > > >> > Peter Huang
> > > >> >
> > > >> >
> > > >> >
> > > >> >
> > > >> >
> > > >> >
> > > >> >
> > > >> >
> > > >> > On Wed, Dec 25, 2019 at 1:08 AM Yang Wang <[hidden email]>
> > > >> wrote:
> > > >> >
> > > >> >> Hi Peter,
> > > >> >>
> > > >> >> I think we need to reconsider tison's suggestion seriously. After
> > > >> FLIP-73,
> > > >> >> the deployJobCluster has
> > > >> >> beenmoved into `JobClusterExecutor#execute`. It should not be
> > > perceived
> > > >> >> for `CliFrontend`. That
> > > >> >> means the user program will *ALWAYS* be executed on client side.
> > This
> > > >> is
> > > >> >> the by design behavior.
> > > >> >> So, we could not just add `if(client mode) .. else if(cluster
> mode)
> > > >> ...`
> > > >> >> codes in `CliFrontend` to bypass
> > > >> >> the executor. We need to find a clean way to decouple executing
> > user
> > > >> >> program and deploying per-job
> > > >> >> cluster. Based on this, we could support to execute user program
> on
> > > >> client
> > > >> >> or master side.
> > > >> >>
> > > >> >> Maybe Aljoscha and Jeff could give some good suggestions.
> > > >> >>
> > > >> >>
> > > >> >>
> > > >> >> Best,
> > > >> >> Yang
> > > >> >>
> > > >> >> Peter Huang <[hidden email]> 于2019年12月25日周三
> 上午4:03写道:
> > > >> >>
> > > >> >>> Hi Jingjing,
> > > >> >>>
> > > >> >>> The improvement proposed is a deployment option for CLI. For SQL
> > > based
> > > >> >>> Flink application, It is more convenient to use the existing
> model
> > > in
> > > >> >>> SqlClient in which
> > > >> >>> the job graph is generated within SqlClient. After adding the
> > > delayed
> > > >> job
> > > >> >>> graph generation, I think there is no change is needed for your
> > > side.
> > > >> >>>
> > > >> >>>
> > > >> >>> Best Regards
> > > >> >>> Peter Huang
> > > >> >>>
> > > >> >>>
> > > >> >>> On Wed, Dec 18, 2019 at 6:01 AM jingjing bai <
> > > >> [hidden email]>
> > > >> >>> wrote:
> > > >> >>>
> > > >> >>>> hi peter:
> > > >> >>>>    we had extension SqlClent to support sql job submit in web
> > base
> > > on
> > > >> >>>> flink 1.9.   we support submit to yarn on per job mode too.
> > > >> >>>>    in this case, the job graph generated  on client side .  I
> > think
> > > >> >>> this
> > > >> >>>> discuss Mainly to improve api programme.  but in my case ,
> there
> > is
> > > >> no
> > > >> >>>> jar to upload but only a sql string .
> > > >> >>>>    do u had more suggestion to improve for sql mode or it is
> > only a
> > > >> >>>> switch for api programme?
> > > >> >>>>
> > > >> >>>>
> > > >> >>>> best
> > > >> >>>> bai jj
> > > >> >>>>
> > > >> >>>>
> > > >> >>>> Yang Wang <[hidden email]> 于2019年12月18日周三 下午7:21写道:
> > > >> >>>>
> > > >> >>>>> I just want to revive this discussion.
> > > >> >>>>>
> > > >> >>>>> Recently, i am thinking about how to natively run flink
> per-job
> > > >> >>> cluster on
> > > >> >>>>> Kubernetes.
> > > >> >>>>> The per-job mode on Kubernetes is very different from on Yarn.
> > And
> > > >> we
> > > >> >>> will
> > > >> >>>>> have
> > > >> >>>>> the same deployment requirements to the client and entry
> point.
> > > >> >>>>>
> > > >> >>>>> 1. Flink client not always need a local jar to start a Flink
> > > per-job
> > > >> >>>>> cluster. We could
> > > >> >>>>> support multiple schemas. For example, file:///path/of/my.jar
> > > means
> > > >> a
> > > >> >>> jar
> > > >> >>>>> located
> > > >> >>>>> at client side, hdfs://myhdfs/user/myname/flink/my.jar means a
> > jar
> > > >> >>> located
> > > >> >>>>> at
> > > >> >>>>> remote hdfs, local:///path/in/image/my.jar means a jar located
> > at
> > > >> >>>>> jobmanager side.
> > > >> >>>>>
> > > >> >>>>> 2. Support running user program on master side. This also
> means
> > > the
> > > >> >>> entry
> > > >> >>>>> point
> > > >> >>>>> will generate the job graph on master side. We could use the
> > > >> >>>>> ClasspathJobGraphRetriever
> > > >> >>>>> or start a local Flink client to achieve this purpose.
> > > >> >>>>>
> > > >> >>>>>
> > > >> >>>>> cc tison, Aljoscha & Kostas Do you think this is the right
> > > >> direction we
> > > >> >>>>> need to work?
> > > >> >>>>>
> > > >> >>>>> tison <[hidden email]> 于2019年12月12日周四 下午4:48写道:
> > > >> >>>>>
> > > >> >>>>>> A quick idea is that we separate the deployment from user
> > program
> > > >> >>> that
> > > >> >>>>> it
> > > >> >>>>>> has always been done
> > > >> >>>>>> outside the program. On user program executed there is
> always a
> > > >> >>>>>> ClusterClient that communicates with
> > > >> >>>>>> an existing cluster, remote or local. It will be another
> thread
> > > so
> > > >> >>> just
> > > >> >>>>> for
> > > >> >>>>>> your information.
> > > >> >>>>>>
> > > >> >>>>>> Best,
> > > >> >>>>>> tison.
> > > >> >>>>>>
> > > >> >>>>>>
> > > >> >>>>>> tison <[hidden email]> 于2019年12月12日周四 下午4:40写道:
> > > >> >>>>>>
> > > >> >>>>>>> Hi Peter,
> > > >> >>>>>>>
> > > >> >>>>>>> Another concern I realized recently is that with current
> > > Executors
> > > >> >>>>>>> abstraction(FLIP-73)
> > > >> >>>>>>> I'm afraid that user program is designed to ALWAYS run on
> the
> > > >> >>> client
> > > >> >>>>>> side.
> > > >> >>>>>>> Specifically,
> > > >> >>>>>>> we deploy the job in executor when env.execute called. This
> > > >> >>>>> abstraction
> > > >> >>>>>>> possibly prevents
> > > >> >>>>>>> Flink runs user program on the cluster side.
> > > >> >>>>>>>
> > > >> >>>>>>> For your proposal, in this case we already compiled the
> > program
> > > >> and
> > > >> >>>>> run
> > > >> >>>>>> on
> > > >> >>>>>>> the client side,
> > > >> >>>>>>> even we deploy a cluster and retrieve job graph from program
> > > >> >>>>> metadata, it
> > > >> >>>>>>> doesn't make
> > > >> >>>>>>> many sense.
> > > >> >>>>>>>
> > > >> >>>>>>> cc Aljoscha & Kostas what do you think about this
> constraint?
> > > >> >>>>>>>
> > > >> >>>>>>> Best,
> > > >> >>>>>>> tison.
> > > >> >>>>>>>
> > > >> >>>>>>>
> > > >> >>>>>>> Peter Huang <[hidden email]> 于2019年12月10日周二
> > > >> 下午12:45写道:
> > > >> >>>>>>>
> > > >> >>>>>>>> Hi Tison,
> > > >> >>>>>>>>
> > > >> >>>>>>>> Yes, you are right. I think I made the wrong argument in
> the
> > > doc.
> > > >> >>>>>>>> Basically, the packaging jar problem is only for platform
> > > users.
> > > >> >>> In
> > > >> >>>>> our
> > > >> >>>>>>>> internal deploy service,
> > > >> >>>>>>>> we further optimized the deployment latency by letting
> users
> > to
> > > >> >>>>>> packaging
> > > >> >>>>>>>> flink-runtime together with the uber jar, so that we don't
> > need
> > > >> to
> > > >> >>>>>>>> consider
> > > >> >>>>>>>> multiple flink version
> > > >> >>>>>>>> support for now. In the session client mode, as Flink libs
> > will
> > > >> be
> > > >> >>>>>> shipped
> > > >> >>>>>>>> anyway as local resources of yarn. Users actually don't
> need
> > to
> > > >> >>>>> package
> > > >> >>>>>>>> those libs into job jar.
> > > >> >>>>>>>>
> > > >> >>>>>>>>
> > > >> >>>>>>>>
> > > >> >>>>>>>> Best Regards
> > > >> >>>>>>>> Peter Huang
> > > >> >>>>>>>>
> > > >> >>>>>>>> On Mon, Dec 9, 2019 at 8:35 PM tison <[hidden email]
> >
> > > >> >>> wrote:
> > > >> >>>>>>>>
> > > >> >>>>>>>>>> 3. What do you mean about the package? Do users need to
> > > >> >>> compile
> > > >> >>>>>> their
> > > >> >>>>>>>>> jars
> > > >> >>>>>>>>> inlcuding flink-clients, flink-optimizer, flink-table
> codes?
> > > >> >>>>>>>>>
> > > >> >>>>>>>>> The answer should be no because they exist in system
> > > classpath.
> > > >> >>>>>>>>>
> > > >> >>>>>>>>> Best,
> > > >> >>>>>>>>> tison.
> > > >> >>>>>>>>>
> > > >> >>>>>>>>>
> > > >> >>>>>>>>> Yang Wang <[hidden email]> 于2019年12月10日周二
> 下午12:18写道:
> > > >> >>>>>>>>>
> > > >> >>>>>>>>>> Hi Peter,
> > > >> >>>>>>>>>>
> > > >> >>>>>>>>>> Thanks a lot for starting this discussion. I think this
> is
> > a
> > > >> >>> very
> > > >> >>>>>>>> useful
> > > >> >>>>>>>>>> feature.
> > > >> >>>>>>>>>>
> > > >> >>>>>>>>>> Not only for Yarn, i am focused on flink on Kubernetes
> > > >> >>>>> integration
> > > >> >>>>>> and
> > > >> >>>>>>>>> come
> > > >> >>>>>>>>>> across the same
> > > >> >>>>>>>>>> problem. I do not want the job graph generated on client
> > > side.
> > > >> >>>>>>>> Instead,
> > > >> >>>>>>>>> the
> > > >> >>>>>>>>>> user jars are built in
> > > >> >>>>>>>>>> a user-defined image. When the job manager launched, we
> > just
> > > >> >>>>> need to
> > > >> >>>>>>>>>> generate the job graph
> > > >> >>>>>>>>>> based on local user jars.
> > > >> >>>>>>>>>>
> > > >> >>>>>>>>>> I have some small suggestion about this.
> > > >> >>>>>>>>>>
> > > >> >>>>>>>>>> 1. `ProgramJobGraphRetriever` is very similar to
> > > >> >>>>>>>>>> `ClasspathJobGraphRetriever`, the differences
> > > >> >>>>>>>>>> are the former needs `ProgramMetadata` and the latter
> needs
> > > >> >>> some
> > > >> >>>>>>>>> arguments.
> > > >> >>>>>>>>>> Is it possible to
> > > >> >>>>>>>>>> have an unified `JobGraphRetriever` to support both?
> > > >> >>>>>>>>>> 2. Is it possible to not use a local user jar to start a
> > > >> >>> per-job
> > > >> >>>>>>>> cluster?
> > > >> >>>>>>>>>> In your case, the user jars has
> > > >> >>>>>>>>>> existed on hdfs already and we do need to download the
> jars
> > > to
> > > >> >>>>>>>> deployer
> > > >> >>>>>>>>>> service. Currently, we
> > > >> >>>>>>>>>> always need a local user jar to start a flink cluster. It
> > is
> > > >> >>> be
> > > >> >>>>>> great
> > > >> >>>>>>>> if
> > > >> >>>>>>>>> we
> > > >> >>>>>>>>>> could support remote user jars.
> > > >> >>>>>>>>>>>> In the implementation, we assume users package
> > > >> >>> flink-clients,
> > > >> >>>>>>>>>> flink-optimizer, flink-table together within the job jar.
> > > >> >>>>> Otherwise,
> > > >> >>>>>>>> the
> > > >> >>>>>>>>>> job graph generation within JobClusterEntryPoint will
> fail.
> > > >> >>>>>>>>>> 3. What do you mean about the package? Do users need to
> > > >> >>> compile
> > > >> >>>>>> their
> > > >> >>>>>>>>> jars
> > > >> >>>>>>>>>> inlcuding flink-clients, flink-optimizer, flink-table
> > codes?
> > > >> >>>>>>>>>>
> > > >> >>>>>>>>>>
> > > >> >>>>>>>>>>
> > > >> >>>>>>>>>> Best,
> > > >> >>>>>>>>>> Yang
> > > >> >>>>>>>>>>
> > > >> >>>>>>>>>> Peter Huang <[hidden email]> 于2019年12月10日周二
> > > >> >>>>> 上午2:37写道:
> > > >> >>>>>>>>>>
> > > >> >>>>>>>>>>> Dear All,
> > > >> >>>>>>>>>>>
> > > >> >>>>>>>>>>> Recently, the Flink community starts to improve the yarn
> > > >> >>>>> cluster
> > > >> >>>>>>>>>> descriptor
> > > >> >>>>>>>>>>> to make job jar and config files configurable from CLI.
> It
> > > >> >>>>>> improves
> > > >> >>>>>>>> the
> > > >> >>>>>>>>>>> flexibility of  Flink deployment Yarn Per Job Mode. For
> > > >> >>>>> platform
> > > >> >>>>>>>> users
> > > >> >>>>>>>>>> who
> > > >> >>>>>>>>>>> manage tens of hundreds of streaming pipelines for the
> > whole
> > > >> >>>>> org
> > > >> >>>>>> or
> > > >> >>>>>>>>>>> company, we found the job graph generation in
> client-side
> > is
> > > >> >>>>>> another
> > > >> >>>>>>>>>>> pinpoint. Thus, we want to propose a configurable
> feature
> > > >> >>> for
> > > >> >>>>>>>>>>> FlinkYarnSessionCli. The feature can allow users to
> choose
> > > >> >>> the
> > > >> >>>>> job
> > > >> >>>>>>>>> graph
> > > >> >>>>>>>>>>> generation in Flink ClusterEntryPoint so that the job
> jar
> > > >> >>>>> doesn't
> > > >> >>>>>>>> need
> > > >> >>>>>>>>> to
> > > >> >>>>>>>>>>> be locally for the job graph generation. The proposal is
> > > >> >>>>> organized
> > > >> >>>>>>>> as a
> > > >> >>>>>>>>>>> FLIP
> > > >> >>>>>>>>>>>
> > > >> >>>>>>>>>>>
> > > >> >>>>>>>>>>
> > > >> >>>>>>>>>
> > > >> >>>>>>>>
> > > >> >>>>>>
> > > >> >>>>>
> > > >> >>>
> > > >>
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-85+Delayed+JobGraph+Generation
> > > >> >>>>>>>>>>> .
> > > >> >>>>>>>>>>>
> > > >> >>>>>>>>>>> Any questions and suggestions are welcomed. Thank you in
> > > >> >>>>> advance.
> > > >> >>>>>>>>>>>
> > > >> >>>>>>>>>>>
> > > >> >>>>>>>>>>> Best Regards
> > > >> >>>>>>>>>>> Peter Huang
> > > >> >>>>>>>>>>>
> > > >> >>>>>>>>>>
> > > >> >>>>>>>>>
> > > >> >>>>>>>>
> > > >> >>>>>>>
> > > >> >>>>>>
> > > >> >>>>>
> > > >> >>>>
> > > >> >>>
> > > >> >>
> > > >>
> > > >>
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] FLIP-85: Delayed Job Graph Generation

Peter Huang
Hi Tison,

I am not the committer of Flink yet. I think I can't join it also.


Best Regards
Peter Huang

On Wed, Jan 8, 2020 at 9:39 AM tison <[hidden email]> wrote:

> Hi Peter,
>
> Could you try out this link? https://the-asf.slack.com/messages/CNA3ADZPH
>
> Best,
> tison.
>
>
> Peter Huang <[hidden email]> 于2020年1月9日周四 上午1:22写道:
>
> > Hi Tison,
> >
> > I can't join the group with shared link. Would you please add me into the
> > group? My slack account is huangzhenqiu0825.
> > Thank you in advance.
> >
> >
> > Best Regards
> > Peter Huang
> >
> > On Wed, Jan 8, 2020 at 12:02 AM tison <[hidden email]> wrote:
> >
> > > Hi Peter,
> > >
> > > As described above, this effort should get attention from people
> > developing
> > > FLIP-73 a.k.a. Executor abstractions. I recommend you to join the
> public
> > > slack channel[1] for Flink Client API Enhancement and you can try to
> > share
> > > you detailed thoughts there. It possibly gets more concrete attentions.
> > >
> > > Best,
> > > tison.
> > >
> > > [1]
> > >
> > >
> >
> https://slack.com/share/IS21SJ75H/Rk8HhUly9FuEHb7oGwBZ33uL/enQtODg2MDYwNjE5MTg3LTA2MjIzNDc1M2ZjZDVlMjdlZjk1M2RkYmJhNjAwMTk2ZDZkODQ4NmY5YmI4OGRhNWJkYTViMTM1NzlmMzc4OWM
> > >
> > >
> > > Peter Huang <[hidden email]> 于2020年1月7日周二 上午5:09写道:
> > >
> > > > Dear All,
> > > >
> > > > Happy new year! According to existing feedback from the community, we
> > > > revised the doc with the consideration of session cluster support,
> and
> > > > concrete interface changes needed and execution plan. Please take one
> > > more
> > > > round of review at your most convenient time.
> > > >
> > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1aAwVjdZByA-0CHbgv16Me-vjaaDMCfhX7TzVVTuifYM/edit#
> > > >
> > > >
> > > > Best Regards
> > > > Peter Huang
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > On Thu, Jan 2, 2020 at 11:29 AM Peter Huang <
> > [hidden email]>
> > > > wrote:
> > > >
> > > > > Hi Dian,
> > > > > Thanks for giving us valuable feedbacks.
> > > > >
> > > > > 1) It's better to have a whole design for this feature
> > > > > For the suggestion of enabling the cluster mode also session
> > cluster, I
> > > > > think Flink already supported it. WebSubmissionExtension already
> > allows
> > > > > users to start a job with the specified jar by using web UI.
> > > > > But we need to enable the feature from CLI for both local jar,
> remote
> > > > jar.
> > > > > I will align with Yang Wang first about the details and update the
> > > design
> > > > > doc.
> > > > >
> > > > > 2) It's better to consider the convenience for users, such as
> > debugging
> > > > >
> > > > > I am wondering whether we can store the exception in jobgragh
> > > > > generation in application master. As no streaming graph can be
> > > scheduled
> > > > in
> > > > > this case, there will be no more TM will be requested from FlinkRM.
> > > > > If the AM is still running, users can still query it from CLI. As
> it
> > > > > requires more change, we can get some feedback from <
> > > [hidden email]
> > > > >
> > > > > and @[hidden email] <[hidden email]>.
> > > > >
> > > > > 3) It's better to consider the impact to the stability of the
> cluster
> > > > >
> > > > > I agree with Yang Wang's opinion.
> > > > >
> > > > >
> > > > >
> > > > > Best Regards
> > > > > Peter Huang
> > > > >
> > > > >
> > > > > On Sun, Dec 29, 2019 at 9:44 PM Dian Fu <[hidden email]>
> > wrote:
> > > > >
> > > > >> Hi all,
> > > > >>
> > > > >> Sorry to jump into this discussion. Thanks everyone for the
> > > discussion.
> > > > >> I'm very interested in this topic although I'm not an expert in
> this
> > > > part.
> > > > >> So I'm glad to share my thoughts as following:
> > > > >>
> > > > >> 1) It's better to have a whole design for this feature
> > > > >> As we know, there are two deployment modes: per-job mode and
> session
> > > > >> mode. I'm wondering which mode really needs this feature. As the
> > > design
> > > > doc
> > > > >> mentioned, per-job mode is more used for streaming jobs and
> session
> > > > mode is
> > > > >> usually used for batch jobs(Of course, the job types and the
> > > deployment
> > > > >> modes are orthogonal). Usually streaming job is only needed to be
> > > > submitted
> > > > >> once and it will run for days or weeks, while batch jobs will be
> > > > submitted
> > > > >> more frequently compared with streaming jobs. This means that
> maybe
> > > > session
> > > > >> mode also needs this feature. However, if we support this feature
> in
> > > > >> session mode, the application master will become the new
> centralized
> > > > >> service(which should be solved). So in this case, it's better to
> > have
> > > a
> > > > >> complete design for both per-job mode and session mode.
> Furthermore,
> > > > even
> > > > >> if we can do it phase by phase, we need to have a whole picture of
> > how
> > > > it
> > > > >> works in both per-job mode and session mode.
> > > > >>
> > > > >> 2) It's better to consider the convenience for users, such as
> > > debugging
> > > > >> After we finish this feature, the job graph will be compiled in
> the
> > > > >> application master, which means that users cannot easily get the
> > > > exception
> > > > >> message synchorousely in the job client if there are problems
> during
> > > the
> > > > >> job graph compiling (especially for platform users), such as the
> > > > resource
> > > > >> path is incorrect, the user program itself has some problems, etc.
> > > What
> > > > I'm
> > > > >> thinking is that maybe we should throw the exceptions as early as
> > > > possible
> > > > >> (during job submission stage).
> > > > >>
> > > > >> 3) It's better to consider the impact to the stability of the
> > cluster
> > > > >> If we perform the compiling in the application master, we should
> > > > consider
> > > > >> the impact of the compiling errors. Although YARN could resume the
> > > > >> application master in case of failures, but in some case the
> > compiling
> > > > >> failure may be a waste of cluster resource and may impact the
> > > stability
> > > > the
> > > > >> cluster and the other jobs in the cluster, such as the resource
> path
> > > is
> > > > >> incorrect, the user program itself has some problems(in this case,
> > job
> > > > >> failover cannot solve this kind of problems) etc. In the current
> > > > >> implemention, the compiling errors are handled in the client side
> > and
> > > > there
> > > > >> is no impact to the cluster at all.
> > > > >>
> > > > >> Regarding to 1), it's clearly pointed in the design doc that only
> > > > per-job
> > > > >> mode will be supported. However, I think it's better to also
> > consider
> > > > the
> > > > >> session mode in the design doc.
> > > > >> Regarding to 2) and 3), I have not seen related sections in the
> > design
> > > > >> doc. It will be good if we can cover them in the design doc.
> > > > >>
> > > > >> Feel free to correct me If there is anything I misunderstand.
> > > > >>
> > > > >> Regards,
> > > > >> Dian
> > > > >>
> > > > >>
> > > > >> > 在 2019年12月27日,上午3:13,Peter Huang <[hidden email]>
> 写道:
> > > > >> >
> > > > >> > Hi Yang,
> > > > >> >
> > > > >> > I can't agree more. The effort definitely needs to align with
> the
> > > > final
> > > > >> > goal of FLIP-73.
> > > > >> > I am thinking about whether we can achieve the goal with two
> > phases.
> > > > >> >
> > > > >> > 1) Phase I
> > > > >> > As the CLiFrontend will not be depreciated soon. We can still
> use
> > > the
> > > > >> > deployMode flag there,
> > > > >> > pass the program info through Flink configuration,  use the
> > > > >> > ClassPathJobGraphRetriever
> > > > >> > to generate the job graph in ClusterEntrypoints of yarn and
> > > > Kubernetes.
> > > > >> >
> > > > >> > 2) Phase II
> > > > >> > In  AbstractJobClusterExecutor, the job graph is generated in
> the
> > > > >> execute
> > > > >> > function. We can still
> > > > >> > use the deployMode in it. With deployMode = cluster, the execute
> > > > >> function
> > > > >> > only starts the cluster.
> > > > >> >
> > > > >> > When {Yarn/Kuberneates}PerJobClusterEntrypoint starts, It will
> > start
> > > > the
> > > > >> > dispatch first, then we can use
> > > > >> > a ClusterEnvironment similar to ContextEnvironment to submit the
> > job
> > > > >> with
> > > > >> > jobName the local
> > > > >> > dispatcher. For the details, we need more investigation. Let's
> > wait
> > > > >> > for @Aljoscha
> > > > >> > Krettek <[hidden email]> @Till Rohrmann <
> > [hidden email]
> > > >'s
> > > > >> > feedback after the holiday season.
> > > > >> >
> > > > >> > Thank you in advance. Merry Chrismas and Happy New Year!!!
> > > > >> >
> > > > >> >
> > > > >> > Best Regards
> > > > >> > Peter Huang
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> > On Wed, Dec 25, 2019 at 1:08 AM Yang Wang <
> [hidden email]>
> > > > >> wrote:
> > > > >> >
> > > > >> >> Hi Peter,
> > > > >> >>
> > > > >> >> I think we need to reconsider tison's suggestion seriously.
> After
> > > > >> FLIP-73,
> > > > >> >> the deployJobCluster has
> > > > >> >> beenmoved into `JobClusterExecutor#execute`. It should not be
> > > > perceived
> > > > >> >> for `CliFrontend`. That
> > > > >> >> means the user program will *ALWAYS* be executed on client
> side.
> > > This
> > > > >> is
> > > > >> >> the by design behavior.
> > > > >> >> So, we could not just add `if(client mode) .. else if(cluster
> > mode)
> > > > >> ...`
> > > > >> >> codes in `CliFrontend` to bypass
> > > > >> >> the executor. We need to find a clean way to decouple executing
> > > user
> > > > >> >> program and deploying per-job
> > > > >> >> cluster. Based on this, we could support to execute user
> program
> > on
> > > > >> client
> > > > >> >> or master side.
> > > > >> >>
> > > > >> >> Maybe Aljoscha and Jeff could give some good suggestions.
> > > > >> >>
> > > > >> >>
> > > > >> >>
> > > > >> >> Best,
> > > > >> >> Yang
> > > > >> >>
> > > > >> >> Peter Huang <[hidden email]> 于2019年12月25日周三
> > 上午4:03写道:
> > > > >> >>
> > > > >> >>> Hi Jingjing,
> > > > >> >>>
> > > > >> >>> The improvement proposed is a deployment option for CLI. For
> SQL
> > > > based
> > > > >> >>> Flink application, It is more convenient to use the existing
> > model
> > > > in
> > > > >> >>> SqlClient in which
> > > > >> >>> the job graph is generated within SqlClient. After adding the
> > > > delayed
> > > > >> job
> > > > >> >>> graph generation, I think there is no change is needed for
> your
> > > > side.
> > > > >> >>>
> > > > >> >>>
> > > > >> >>> Best Regards
> > > > >> >>> Peter Huang
> > > > >> >>>
> > > > >> >>>
> > > > >> >>> On Wed, Dec 18, 2019 at 6:01 AM jingjing bai <
> > > > >> [hidden email]>
> > > > >> >>> wrote:
> > > > >> >>>
> > > > >> >>>> hi peter:
> > > > >> >>>>    we had extension SqlClent to support sql job submit in web
> > > base
> > > > on
> > > > >> >>>> flink 1.9.   we support submit to yarn on per job mode too.
> > > > >> >>>>    in this case, the job graph generated  on client side .  I
> > > think
> > > > >> >>> this
> > > > >> >>>> discuss Mainly to improve api programme.  but in my case ,
> > there
> > > is
> > > > >> no
> > > > >> >>>> jar to upload but only a sql string .
> > > > >> >>>>    do u had more suggestion to improve for sql mode or it is
> > > only a
> > > > >> >>>> switch for api programme?
> > > > >> >>>>
> > > > >> >>>>
> > > > >> >>>> best
> > > > >> >>>> bai jj
> > > > >> >>>>
> > > > >> >>>>
> > > > >> >>>> Yang Wang <[hidden email]> 于2019年12月18日周三 下午7:21写道:
> > > > >> >>>>
> > > > >> >>>>> I just want to revive this discussion.
> > > > >> >>>>>
> > > > >> >>>>> Recently, i am thinking about how to natively run flink
> > per-job
> > > > >> >>> cluster on
> > > > >> >>>>> Kubernetes.
> > > > >> >>>>> The per-job mode on Kubernetes is very different from on
> Yarn.
> > > And
> > > > >> we
> > > > >> >>> will
> > > > >> >>>>> have
> > > > >> >>>>> the same deployment requirements to the client and entry
> > point.
> > > > >> >>>>>
> > > > >> >>>>> 1. Flink client not always need a local jar to start a Flink
> > > > per-job
> > > > >> >>>>> cluster. We could
> > > > >> >>>>> support multiple schemas. For example,
> file:///path/of/my.jar
> > > > means
> > > > >> a
> > > > >> >>> jar
> > > > >> >>>>> located
> > > > >> >>>>> at client side, hdfs://myhdfs/user/myname/flink/my.jar
> means a
> > > jar
> > > > >> >>> located
> > > > >> >>>>> at
> > > > >> >>>>> remote hdfs, local:///path/in/image/my.jar means a jar
> located
> > > at
> > > > >> >>>>> jobmanager side.
> > > > >> >>>>>
> > > > >> >>>>> 2. Support running user program on master side. This also
> > means
> > > > the
> > > > >> >>> entry
> > > > >> >>>>> point
> > > > >> >>>>> will generate the job graph on master side. We could use the
> > > > >> >>>>> ClasspathJobGraphRetriever
> > > > >> >>>>> or start a local Flink client to achieve this purpose.
> > > > >> >>>>>
> > > > >> >>>>>
> > > > >> >>>>> cc tison, Aljoscha & Kostas Do you think this is the right
> > > > >> direction we
> > > > >> >>>>> need to work?
> > > > >> >>>>>
> > > > >> >>>>> tison <[hidden email]> 于2019年12月12日周四 下午4:48写道:
> > > > >> >>>>>
> > > > >> >>>>>> A quick idea is that we separate the deployment from user
> > > program
> > > > >> >>> that
> > > > >> >>>>> it
> > > > >> >>>>>> has always been done
> > > > >> >>>>>> outside the program. On user program executed there is
> > always a
> > > > >> >>>>>> ClusterClient that communicates with
> > > > >> >>>>>> an existing cluster, remote or local. It will be another
> > thread
> > > > so
> > > > >> >>> just
> > > > >> >>>>> for
> > > > >> >>>>>> your information.
> > > > >> >>>>>>
> > > > >> >>>>>> Best,
> > > > >> >>>>>> tison.
> > > > >> >>>>>>
> > > > >> >>>>>>
> > > > >> >>>>>> tison <[hidden email]> 于2019年12月12日周四 下午4:40写道:
> > > > >> >>>>>>
> > > > >> >>>>>>> Hi Peter,
> > > > >> >>>>>>>
> > > > >> >>>>>>> Another concern I realized recently is that with current
> > > > Executors
> > > > >> >>>>>>> abstraction(FLIP-73)
> > > > >> >>>>>>> I'm afraid that user program is designed to ALWAYS run on
> > the
> > > > >> >>> client
> > > > >> >>>>>> side.
> > > > >> >>>>>>> Specifically,
> > > > >> >>>>>>> we deploy the job in executor when env.execute called.
> This
> > > > >> >>>>> abstraction
> > > > >> >>>>>>> possibly prevents
> > > > >> >>>>>>> Flink runs user program on the cluster side.
> > > > >> >>>>>>>
> > > > >> >>>>>>> For your proposal, in this case we already compiled the
> > > program
> > > > >> and
> > > > >> >>>>> run
> > > > >> >>>>>> on
> > > > >> >>>>>>> the client side,
> > > > >> >>>>>>> even we deploy a cluster and retrieve job graph from
> program
> > > > >> >>>>> metadata, it
> > > > >> >>>>>>> doesn't make
> > > > >> >>>>>>> many sense.
> > > > >> >>>>>>>
> > > > >> >>>>>>> cc Aljoscha & Kostas what do you think about this
> > constraint?
> > > > >> >>>>>>>
> > > > >> >>>>>>> Best,
> > > > >> >>>>>>> tison.
> > > > >> >>>>>>>
> > > > >> >>>>>>>
> > > > >> >>>>>>> Peter Huang <[hidden email]> 于2019年12月10日周二
> > > > >> 下午12:45写道:
> > > > >> >>>>>>>
> > > > >> >>>>>>>> Hi Tison,
> > > > >> >>>>>>>>
> > > > >> >>>>>>>> Yes, you are right. I think I made the wrong argument in
> > the
> > > > doc.
> > > > >> >>>>>>>> Basically, the packaging jar problem is only for platform
> > > > users.
> > > > >> >>> In
> > > > >> >>>>> our
> > > > >> >>>>>>>> internal deploy service,
> > > > >> >>>>>>>> we further optimized the deployment latency by letting
> > users
> > > to
> > > > >> >>>>>> packaging
> > > > >> >>>>>>>> flink-runtime together with the uber jar, so that we
> don't
> > > need
> > > > >> to
> > > > >> >>>>>>>> consider
> > > > >> >>>>>>>> multiple flink version
> > > > >> >>>>>>>> support for now. In the session client mode, as Flink
> libs
> > > will
> > > > >> be
> > > > >> >>>>>> shipped
> > > > >> >>>>>>>> anyway as local resources of yarn. Users actually don't
> > need
> > > to
> > > > >> >>>>> package
> > > > >> >>>>>>>> those libs into job jar.
> > > > >> >>>>>>>>
> > > > >> >>>>>>>>
> > > > >> >>>>>>>>
> > > > >> >>>>>>>> Best Regards
> > > > >> >>>>>>>> Peter Huang
> > > > >> >>>>>>>>
> > > > >> >>>>>>>> On Mon, Dec 9, 2019 at 8:35 PM tison <
> [hidden email]
> > >
> > > > >> >>> wrote:
> > > > >> >>>>>>>>
> > > > >> >>>>>>>>>> 3. What do you mean about the package? Do users need to
> > > > >> >>> compile
> > > > >> >>>>>> their
> > > > >> >>>>>>>>> jars
> > > > >> >>>>>>>>> inlcuding flink-clients, flink-optimizer, flink-table
> > codes?
> > > > >> >>>>>>>>>
> > > > >> >>>>>>>>> The answer should be no because they exist in system
> > > > classpath.
> > > > >> >>>>>>>>>
> > > > >> >>>>>>>>> Best,
> > > > >> >>>>>>>>> tison.
> > > > >> >>>>>>>>>
> > > > >> >>>>>>>>>
> > > > >> >>>>>>>>> Yang Wang <[hidden email]> 于2019年12月10日周二
> > 下午12:18写道:
> > > > >> >>>>>>>>>
> > > > >> >>>>>>>>>> Hi Peter,
> > > > >> >>>>>>>>>>
> > > > >> >>>>>>>>>> Thanks a lot for starting this discussion. I think this
> > is
> > > a
> > > > >> >>> very
> > > > >> >>>>>>>> useful
> > > > >> >>>>>>>>>> feature.
> > > > >> >>>>>>>>>>
> > > > >> >>>>>>>>>> Not only for Yarn, i am focused on flink on Kubernetes
> > > > >> >>>>> integration
> > > > >> >>>>>> and
> > > > >> >>>>>>>>> come
> > > > >> >>>>>>>>>> across the same
> > > > >> >>>>>>>>>> problem. I do not want the job graph generated on
> client
> > > > side.
> > > > >> >>>>>>>> Instead,
> > > > >> >>>>>>>>> the
> > > > >> >>>>>>>>>> user jars are built in
> > > > >> >>>>>>>>>> a user-defined image. When the job manager launched, we
> > > just
> > > > >> >>>>> need to
> > > > >> >>>>>>>>>> generate the job graph
> > > > >> >>>>>>>>>> based on local user jars.
> > > > >> >>>>>>>>>>
> > > > >> >>>>>>>>>> I have some small suggestion about this.
> > > > >> >>>>>>>>>>
> > > > >> >>>>>>>>>> 1. `ProgramJobGraphRetriever` is very similar to
> > > > >> >>>>>>>>>> `ClasspathJobGraphRetriever`, the differences
> > > > >> >>>>>>>>>> are the former needs `ProgramMetadata` and the latter
> > needs
> > > > >> >>> some
> > > > >> >>>>>>>>> arguments.
> > > > >> >>>>>>>>>> Is it possible to
> > > > >> >>>>>>>>>> have an unified `JobGraphRetriever` to support both?
> > > > >> >>>>>>>>>> 2. Is it possible to not use a local user jar to start
> a
> > > > >> >>> per-job
> > > > >> >>>>>>>> cluster?
> > > > >> >>>>>>>>>> In your case, the user jars has
> > > > >> >>>>>>>>>> existed on hdfs already and we do need to download the
> > jars
> > > > to
> > > > >> >>>>>>>> deployer
> > > > >> >>>>>>>>>> service. Currently, we
> > > > >> >>>>>>>>>> always need a local user jar to start a flink cluster.
> It
> > > is
> > > > >> >>> be
> > > > >> >>>>>> great
> > > > >> >>>>>>>> if
> > > > >> >>>>>>>>> we
> > > > >> >>>>>>>>>> could support remote user jars.
> > > > >> >>>>>>>>>>>> In the implementation, we assume users package
> > > > >> >>> flink-clients,
> > > > >> >>>>>>>>>> flink-optimizer, flink-table together within the job
> jar.
> > > > >> >>>>> Otherwise,
> > > > >> >>>>>>>> the
> > > > >> >>>>>>>>>> job graph generation within JobClusterEntryPoint will
> > fail.
> > > > >> >>>>>>>>>> 3. What do you mean about the package? Do users need to
> > > > >> >>> compile
> > > > >> >>>>>> their
> > > > >> >>>>>>>>> jars
> > > > >> >>>>>>>>>> inlcuding flink-clients, flink-optimizer, flink-table
> > > codes?
> > > > >> >>>>>>>>>>
> > > > >> >>>>>>>>>>
> > > > >> >>>>>>>>>>
> > > > >> >>>>>>>>>> Best,
> > > > >> >>>>>>>>>> Yang
> > > > >> >>>>>>>>>>
> > > > >> >>>>>>>>>> Peter Huang <[hidden email]>
> 于2019年12月10日周二
> > > > >> >>>>> 上午2:37写道:
> > > > >> >>>>>>>>>>
> > > > >> >>>>>>>>>>> Dear All,
> > > > >> >>>>>>>>>>>
> > > > >> >>>>>>>>>>> Recently, the Flink community starts to improve the
> yarn
> > > > >> >>>>> cluster
> > > > >> >>>>>>>>>> descriptor
> > > > >> >>>>>>>>>>> to make job jar and config files configurable from
> CLI.
> > It
> > > > >> >>>>>> improves
> > > > >> >>>>>>>> the
> > > > >> >>>>>>>>>>> flexibility of  Flink deployment Yarn Per Job Mode.
> For
> > > > >> >>>>> platform
> > > > >> >>>>>>>> users
> > > > >> >>>>>>>>>> who
> > > > >> >>>>>>>>>>> manage tens of hundreds of streaming pipelines for the
> > > whole
> > > > >> >>>>> org
> > > > >> >>>>>> or
> > > > >> >>>>>>>>>>> company, we found the job graph generation in
> > client-side
> > > is
> > > > >> >>>>>> another
> > > > >> >>>>>>>>>>> pinpoint. Thus, we want to propose a configurable
> > feature
> > > > >> >>> for
> > > > >> >>>>>>>>>>> FlinkYarnSessionCli. The feature can allow users to
> > choose
> > > > >> >>> the
> > > > >> >>>>> job
> > > > >> >>>>>>>>> graph
> > > > >> >>>>>>>>>>> generation in Flink ClusterEntryPoint so that the job
> > jar
> > > > >> >>>>> doesn't
> > > > >> >>>>>>>> need
> > > > >> >>>>>>>>> to
> > > > >> >>>>>>>>>>> be locally for the job graph generation. The proposal
> is
> > > > >> >>>>> organized
> > > > >> >>>>>>>> as a
> > > > >> >>>>>>>>>>> FLIP
> > > > >> >>>>>>>>>>>
> > > > >> >>>>>>>>>>>
> > > > >> >>>>>>>>>>
> > > > >> >>>>>>>>>
> > > > >> >>>>>>>>
> > > > >> >>>>>>
> > > > >> >>>>>
> > > > >> >>>
> > > > >>
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-85+Delayed+JobGraph+Generation
> > > > >> >>>>>>>>>>> .
> > > > >> >>>>>>>>>>>
> > > > >> >>>>>>>>>>> Any questions and suggestions are welcomed. Thank you
> in
> > > > >> >>>>> advance.
> > > > >> >>>>>>>>>>>
> > > > >> >>>>>>>>>>>
> > > > >> >>>>>>>>>>> Best Regards
> > > > >> >>>>>>>>>>> Peter Huang
> > > > >> >>>>>>>>>>>
> > > > >> >>>>>>>>>>
> > > > >> >>>>>>>>>
> > > > >> >>>>>>>>
> > > > >> >>>>>>>
> > > > >> >>>>>>
> > > > >> >>>>>
> > > > >> >>>>
> > > > >> >>>
> > > > >> >>
> > > > >>
> > > > >>
> > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] FLIP-85: Delayed Job Graph Generation

tison
not always, Yang Wang is also not yet a committer but he can join the
channel. I cannot find the id by clicking “Add new member in channel” so
come to you and ask for try out the link. Possibly I will find other ways
but the original purpose is that the slack channel is a public area we
discuss about developing...
Best,
tison.


Peter Huang <[hidden email]> 于2020年1月9日周四 上午2:44写道:

> Hi Tison,
>
> I am not the committer of Flink yet. I think I can't join it also.
>
>
> Best Regards
> Peter Huang
>
> On Wed, Jan 8, 2020 at 9:39 AM tison <[hidden email]> wrote:
>
> > Hi Peter,
> >
> > Could you try out this link?
> https://the-asf.slack.com/messages/CNA3ADZPH
> >
> > Best,
> > tison.
> >
> >
> > Peter Huang <[hidden email]> 于2020年1月9日周四 上午1:22写道:
> >
> > > Hi Tison,
> > >
> > > I can't join the group with shared link. Would you please add me into
> the
> > > group? My slack account is huangzhenqiu0825.
> > > Thank you in advance.
> > >
> > >
> > > Best Regards
> > > Peter Huang
> > >
> > > On Wed, Jan 8, 2020 at 12:02 AM tison <[hidden email]> wrote:
> > >
> > > > Hi Peter,
> > > >
> > > > As described above, this effort should get attention from people
> > > developing
> > > > FLIP-73 a.k.a. Executor abstractions. I recommend you to join the
> > public
> > > > slack channel[1] for Flink Client API Enhancement and you can try to
> > > share
> > > > you detailed thoughts there. It possibly gets more concrete
> attentions.
> > > >
> > > > Best,
> > > > tison.
> > > >
> > > > [1]
> > > >
> > > >
> > >
> >
> https://slack.com/share/IS21SJ75H/Rk8HhUly9FuEHb7oGwBZ33uL/enQtODg2MDYwNjE5MTg3LTA2MjIzNDc1M2ZjZDVlMjdlZjk1M2RkYmJhNjAwMTk2ZDZkODQ4NmY5YmI4OGRhNWJkYTViMTM1NzlmMzc4OWM
> > > >
> > > >
> > > > Peter Huang <[hidden email]> 于2020年1月7日周二 上午5:09写道:
> > > >
> > > > > Dear All,
> > > > >
> > > > > Happy new year! According to existing feedback from the community,
> we
> > > > > revised the doc with the consideration of session cluster support,
> > and
> > > > > concrete interface changes needed and execution plan. Please take
> one
> > > > more
> > > > > round of review at your most convenient time.
> > > > >
> > > > >
> > > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1aAwVjdZByA-0CHbgv16Me-vjaaDMCfhX7TzVVTuifYM/edit#
> > > > >
> > > > >
> > > > > Best Regards
> > > > > Peter Huang
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Thu, Jan 2, 2020 at 11:29 AM Peter Huang <
> > > [hidden email]>
> > > > > wrote:
> > > > >
> > > > > > Hi Dian,
> > > > > > Thanks for giving us valuable feedbacks.
> > > > > >
> > > > > > 1) It's better to have a whole design for this feature
> > > > > > For the suggestion of enabling the cluster mode also session
> > > cluster, I
> > > > > > think Flink already supported it. WebSubmissionExtension already
> > > allows
> > > > > > users to start a job with the specified jar by using web UI.
> > > > > > But we need to enable the feature from CLI for both local jar,
> > remote
> > > > > jar.
> > > > > > I will align with Yang Wang first about the details and update
> the
> > > > design
> > > > > > doc.
> > > > > >
> > > > > > 2) It's better to consider the convenience for users, such as
> > > debugging
> > > > > >
> > > > > > I am wondering whether we can store the exception in jobgragh
> > > > > > generation in application master. As no streaming graph can be
> > > > scheduled
> > > > > in
> > > > > > this case, there will be no more TM will be requested from
> FlinkRM.
> > > > > > If the AM is still running, users can still query it from CLI. As
> > it
> > > > > > requires more change, we can get some feedback from <
> > > > [hidden email]
> > > > > >
> > > > > > and @[hidden email] <[hidden email]>.
> > > > > >
> > > > > > 3) It's better to consider the impact to the stability of the
> > cluster
> > > > > >
> > > > > > I agree with Yang Wang's opinion.
> > > > > >
> > > > > >
> > > > > >
> > > > > > Best Regards
> > > > > > Peter Huang
> > > > > >
> > > > > >
> > > > > > On Sun, Dec 29, 2019 at 9:44 PM Dian Fu <[hidden email]>
> > > wrote:
> > > > > >
> > > > > >> Hi all,
> > > > > >>
> > > > > >> Sorry to jump into this discussion. Thanks everyone for the
> > > > discussion.
> > > > > >> I'm very interested in this topic although I'm not an expert in
> > this
> > > > > part.
> > > > > >> So I'm glad to share my thoughts as following:
> > > > > >>
> > > > > >> 1) It's better to have a whole design for this feature
> > > > > >> As we know, there are two deployment modes: per-job mode and
> > session
> > > > > >> mode. I'm wondering which mode really needs this feature. As the
> > > > design
> > > > > doc
> > > > > >> mentioned, per-job mode is more used for streaming jobs and
> > session
> > > > > mode is
> > > > > >> usually used for batch jobs(Of course, the job types and the
> > > > deployment
> > > > > >> modes are orthogonal). Usually streaming job is only needed to
> be
> > > > > submitted
> > > > > >> once and it will run for days or weeks, while batch jobs will be
> > > > > submitted
> > > > > >> more frequently compared with streaming jobs. This means that
> > maybe
> > > > > session
> > > > > >> mode also needs this feature. However, if we support this
> feature
> > in
> > > > > >> session mode, the application master will become the new
> > centralized
> > > > > >> service(which should be solved). So in this case, it's better to
> > > have
> > > > a
> > > > > >> complete design for both per-job mode and session mode.
> > Furthermore,
> > > > > even
> > > > > >> if we can do it phase by phase, we need to have a whole picture
> of
> > > how
> > > > > it
> > > > > >> works in both per-job mode and session mode.
> > > > > >>
> > > > > >> 2) It's better to consider the convenience for users, such as
> > > > debugging
> > > > > >> After we finish this feature, the job graph will be compiled in
> > the
> > > > > >> application master, which means that users cannot easily get the
> > > > > exception
> > > > > >> message synchorousely in the job client if there are problems
> > during
> > > > the
> > > > > >> job graph compiling (especially for platform users), such as the
> > > > > resource
> > > > > >> path is incorrect, the user program itself has some problems,
> etc.
> > > > What
> > > > > I'm
> > > > > >> thinking is that maybe we should throw the exceptions as early
> as
> > > > > possible
> > > > > >> (during job submission stage).
> > > > > >>
> > > > > >> 3) It's better to consider the impact to the stability of the
> > > cluster
> > > > > >> If we perform the compiling in the application master, we should
> > > > > consider
> > > > > >> the impact of the compiling errors. Although YARN could resume
> the
> > > > > >> application master in case of failures, but in some case the
> > > compiling
> > > > > >> failure may be a waste of cluster resource and may impact the
> > > > stability
> > > > > the
> > > > > >> cluster and the other jobs in the cluster, such as the resource
> > path
> > > > is
> > > > > >> incorrect, the user program itself has some problems(in this
> case,
> > > job
> > > > > >> failover cannot solve this kind of problems) etc. In the current
> > > > > >> implemention, the compiling errors are handled in the client
> side
> > > and
> > > > > there
> > > > > >> is no impact to the cluster at all.
> > > > > >>
> > > > > >> Regarding to 1), it's clearly pointed in the design doc that
> only
> > > > > per-job
> > > > > >> mode will be supported. However, I think it's better to also
> > > consider
> > > > > the
> > > > > >> session mode in the design doc.
> > > > > >> Regarding to 2) and 3), I have not seen related sections in the
> > > design
> > > > > >> doc. It will be good if we can cover them in the design doc.
> > > > > >>
> > > > > >> Feel free to correct me If there is anything I misunderstand.
> > > > > >>
> > > > > >> Regards,
> > > > > >> Dian
> > > > > >>
> > > > > >>
> > > > > >> > 在 2019年12月27日,上午3:13,Peter Huang <[hidden email]>
> > 写道:
> > > > > >> >
> > > > > >> > Hi Yang,
> > > > > >> >
> > > > > >> > I can't agree more. The effort definitely needs to align with
> > the
> > > > > final
> > > > > >> > goal of FLIP-73.
> > > > > >> > I am thinking about whether we can achieve the goal with two
> > > phases.
> > > > > >> >
> > > > > >> > 1) Phase I
> > > > > >> > As the CLiFrontend will not be depreciated soon. We can still
> > use
> > > > the
> > > > > >> > deployMode flag there,
> > > > > >> > pass the program info through Flink configuration,  use the
> > > > > >> > ClassPathJobGraphRetriever
> > > > > >> > to generate the job graph in ClusterEntrypoints of yarn and
> > > > > Kubernetes.
> > > > > >> >
> > > > > >> > 2) Phase II
> > > > > >> > In  AbstractJobClusterExecutor, the job graph is generated in
> > the
> > > > > >> execute
> > > > > >> > function. We can still
> > > > > >> > use the deployMode in it. With deployMode = cluster, the
> execute
> > > > > >> function
> > > > > >> > only starts the cluster.
> > > > > >> >
> > > > > >> > When {Yarn/Kuberneates}PerJobClusterEntrypoint starts, It will
> > > start
> > > > > the
> > > > > >> > dispatch first, then we can use
> > > > > >> > a ClusterEnvironment similar to ContextEnvironment to submit
> the
> > > job
> > > > > >> with
> > > > > >> > jobName the local
> > > > > >> > dispatcher. For the details, we need more investigation. Let's
> > > wait
> > > > > >> > for @Aljoscha
> > > > > >> > Krettek <[hidden email]> @Till Rohrmann <
> > > [hidden email]
> > > > >'s
> > > > > >> > feedback after the holiday season.
> > > > > >> >
> > > > > >> > Thank you in advance. Merry Chrismas and Happy New Year!!!
> > > > > >> >
> > > > > >> >
> > > > > >> > Best Regards
> > > > > >> > Peter Huang
> > > > > >> >
> > > > > >> >
> > > > > >> >
> > > > > >> >
> > > > > >> >
> > > > > >> >
> > > > > >> >
> > > > > >> >
> > > > > >> > On Wed, Dec 25, 2019 at 1:08 AM Yang Wang <
> > [hidden email]>
> > > > > >> wrote:
> > > > > >> >
> > > > > >> >> Hi Peter,
> > > > > >> >>
> > > > > >> >> I think we need to reconsider tison's suggestion seriously.
> > After
> > > > > >> FLIP-73,
> > > > > >> >> the deployJobCluster has
> > > > > >> >> beenmoved into `JobClusterExecutor#execute`. It should not be
> > > > > perceived
> > > > > >> >> for `CliFrontend`. That
> > > > > >> >> means the user program will *ALWAYS* be executed on client
> > side.
> > > > This
> > > > > >> is
> > > > > >> >> the by design behavior.
> > > > > >> >> So, we could not just add `if(client mode) .. else if(cluster
> > > mode)
> > > > > >> ...`
> > > > > >> >> codes in `CliFrontend` to bypass
> > > > > >> >> the executor. We need to find a clean way to decouple
> executing
> > > > user
> > > > > >> >> program and deploying per-job
> > > > > >> >> cluster. Based on this, we could support to execute user
> > program
> > > on
> > > > > >> client
> > > > > >> >> or master side.
> > > > > >> >>
> > > > > >> >> Maybe Aljoscha and Jeff could give some good suggestions.
> > > > > >> >>
> > > > > >> >>
> > > > > >> >>
> > > > > >> >> Best,
> > > > > >> >> Yang
> > > > > >> >>
> > > > > >> >> Peter Huang <[hidden email]> 于2019年12月25日周三
> > > 上午4:03写道:
> > > > > >> >>
> > > > > >> >>> Hi Jingjing,
> > > > > >> >>>
> > > > > >> >>> The improvement proposed is a deployment option for CLI. For
> > SQL
> > > > > based
> > > > > >> >>> Flink application, It is more convenient to use the existing
> > > model
> > > > > in
> > > > > >> >>> SqlClient in which
> > > > > >> >>> the job graph is generated within SqlClient. After adding
> the
> > > > > delayed
> > > > > >> job
> > > > > >> >>> graph generation, I think there is no change is needed for
> > your
> > > > > side.
> > > > > >> >>>
> > > > > >> >>>
> > > > > >> >>> Best Regards
> > > > > >> >>> Peter Huang
> > > > > >> >>>
> > > > > >> >>>
> > > > > >> >>> On Wed, Dec 18, 2019 at 6:01 AM jingjing bai <
> > > > > >> [hidden email]>
> > > > > >> >>> wrote:
> > > > > >> >>>
> > > > > >> >>>> hi peter:
> > > > > >> >>>>    we had extension SqlClent to support sql job submit in
> web
> > > > base
> > > > > on
> > > > > >> >>>> flink 1.9.   we support submit to yarn on per job mode too.
> > > > > >> >>>>    in this case, the job graph generated  on client side
> .  I
> > > > think
> > > > > >> >>> this
> > > > > >> >>>> discuss Mainly to improve api programme.  but in my case ,
> > > there
> > > > is
> > > > > >> no
> > > > > >> >>>> jar to upload but only a sql string .
> > > > > >> >>>>    do u had more suggestion to improve for sql mode or it
> is
> > > > only a
> > > > > >> >>>> switch for api programme?
> > > > > >> >>>>
> > > > > >> >>>>
> > > > > >> >>>> best
> > > > > >> >>>> bai jj
> > > > > >> >>>>
> > > > > >> >>>>
> > > > > >> >>>> Yang Wang <[hidden email]> 于2019年12月18日周三 下午7:21写道:
> > > > > >> >>>>
> > > > > >> >>>>> I just want to revive this discussion.
> > > > > >> >>>>>
> > > > > >> >>>>> Recently, i am thinking about how to natively run flink
> > > per-job
> > > > > >> >>> cluster on
> > > > > >> >>>>> Kubernetes.
> > > > > >> >>>>> The per-job mode on Kubernetes is very different from on
> > Yarn.
> > > > And
> > > > > >> we
> > > > > >> >>> will
> > > > > >> >>>>> have
> > > > > >> >>>>> the same deployment requirements to the client and entry
> > > point.
> > > > > >> >>>>>
> > > > > >> >>>>> 1. Flink client not always need a local jar to start a
> Flink
> > > > > per-job
> > > > > >> >>>>> cluster. We could
> > > > > >> >>>>> support multiple schemas. For example,
> > file:///path/of/my.jar
> > > > > means
> > > > > >> a
> > > > > >> >>> jar
> > > > > >> >>>>> located
> > > > > >> >>>>> at client side, hdfs://myhdfs/user/myname/flink/my.jar
> > means a
> > > > jar
> > > > > >> >>> located
> > > > > >> >>>>> at
> > > > > >> >>>>> remote hdfs, local:///path/in/image/my.jar means a jar
> > located
> > > > at
> > > > > >> >>>>> jobmanager side.
> > > > > >> >>>>>
> > > > > >> >>>>> 2. Support running user program on master side. This also
> > > means
> > > > > the
> > > > > >> >>> entry
> > > > > >> >>>>> point
> > > > > >> >>>>> will generate the job graph on master side. We could use
> the
> > > > > >> >>>>> ClasspathJobGraphRetriever
> > > > > >> >>>>> or start a local Flink client to achieve this purpose.
> > > > > >> >>>>>
> > > > > >> >>>>>
> > > > > >> >>>>> cc tison, Aljoscha & Kostas Do you think this is the right
> > > > > >> direction we
> > > > > >> >>>>> need to work?
> > > > > >> >>>>>
> > > > > >> >>>>> tison <[hidden email]> 于2019年12月12日周四 下午4:48写道:
> > > > > >> >>>>>
> > > > > >> >>>>>> A quick idea is that we separate the deployment from user
> > > > program
> > > > > >> >>> that
> > > > > >> >>>>> it
> > > > > >> >>>>>> has always been done
> > > > > >> >>>>>> outside the program. On user program executed there is
> > > always a
> > > > > >> >>>>>> ClusterClient that communicates with
> > > > > >> >>>>>> an existing cluster, remote or local. It will be another
> > > thread
> > > > > so
> > > > > >> >>> just
> > > > > >> >>>>> for
> > > > > >> >>>>>> your information.
> > > > > >> >>>>>>
> > > > > >> >>>>>> Best,
> > > > > >> >>>>>> tison.
> > > > > >> >>>>>>
> > > > > >> >>>>>>
> > > > > >> >>>>>> tison <[hidden email]> 于2019年12月12日周四 下午4:40写道:
> > > > > >> >>>>>>
> > > > > >> >>>>>>> Hi Peter,
> > > > > >> >>>>>>>
> > > > > >> >>>>>>> Another concern I realized recently is that with current
> > > > > Executors
> > > > > >> >>>>>>> abstraction(FLIP-73)
> > > > > >> >>>>>>> I'm afraid that user program is designed to ALWAYS run
> on
> > > the
> > > > > >> >>> client
> > > > > >> >>>>>> side.
> > > > > >> >>>>>>> Specifically,
> > > > > >> >>>>>>> we deploy the job in executor when env.execute called.
> > This
> > > > > >> >>>>> abstraction
> > > > > >> >>>>>>> possibly prevents
> > > > > >> >>>>>>> Flink runs user program on the cluster side.
> > > > > >> >>>>>>>
> > > > > >> >>>>>>> For your proposal, in this case we already compiled the
> > > > program
> > > > > >> and
> > > > > >> >>>>> run
> > > > > >> >>>>>> on
> > > > > >> >>>>>>> the client side,
> > > > > >> >>>>>>> even we deploy a cluster and retrieve job graph from
> > program
> > > > > >> >>>>> metadata, it
> > > > > >> >>>>>>> doesn't make
> > > > > >> >>>>>>> many sense.
> > > > > >> >>>>>>>
> > > > > >> >>>>>>> cc Aljoscha & Kostas what do you think about this
> > > constraint?
> > > > > >> >>>>>>>
> > > > > >> >>>>>>> Best,
> > > > > >> >>>>>>> tison.
> > > > > >> >>>>>>>
> > > > > >> >>>>>>>
> > > > > >> >>>>>>> Peter Huang <[hidden email]> 于2019年12月10日周二
> > > > > >> 下午12:45写道:
> > > > > >> >>>>>>>
> > > > > >> >>>>>>>> Hi Tison,
> > > > > >> >>>>>>>>
> > > > > >> >>>>>>>> Yes, you are right. I think I made the wrong argument
> in
> > > the
> > > > > doc.
> > > > > >> >>>>>>>> Basically, the packaging jar problem is only for
> platform
> > > > > users.
> > > > > >> >>> In
> > > > > >> >>>>> our
> > > > > >> >>>>>>>> internal deploy service,
> > > > > >> >>>>>>>> we further optimized the deployment latency by letting
> > > users
> > > > to
> > > > > >> >>>>>> packaging
> > > > > >> >>>>>>>> flink-runtime together with the uber jar, so that we
> > don't
> > > > need
> > > > > >> to
> > > > > >> >>>>>>>> consider
> > > > > >> >>>>>>>> multiple flink version
> > > > > >> >>>>>>>> support for now. In the session client mode, as Flink
> > libs
> > > > will
> > > > > >> be
> > > > > >> >>>>>> shipped
> > > > > >> >>>>>>>> anyway as local resources of yarn. Users actually don't
> > > need
> > > > to
> > > > > >> >>>>> package
> > > > > >> >>>>>>>> those libs into job jar.
> > > > > >> >>>>>>>>
> > > > > >> >>>>>>>>
> > > > > >> >>>>>>>>
> > > > > >> >>>>>>>> Best Regards
> > > > > >> >>>>>>>> Peter Huang
> > > > > >> >>>>>>>>
> > > > > >> >>>>>>>> On Mon, Dec 9, 2019 at 8:35 PM tison <
> > [hidden email]
> > > >
> > > > > >> >>> wrote:
> > > > > >> >>>>>>>>
> > > > > >> >>>>>>>>>> 3. What do you mean about the package? Do users need
> to
> > > > > >> >>> compile
> > > > > >> >>>>>> their
> > > > > >> >>>>>>>>> jars
> > > > > >> >>>>>>>>> inlcuding flink-clients, flink-optimizer, flink-table
> > > codes?
> > > > > >> >>>>>>>>>
> > > > > >> >>>>>>>>> The answer should be no because they exist in system
> > > > > classpath.
> > > > > >> >>>>>>>>>
> > > > > >> >>>>>>>>> Best,
> > > > > >> >>>>>>>>> tison.
> > > > > >> >>>>>>>>>
> > > > > >> >>>>>>>>>
> > > > > >> >>>>>>>>> Yang Wang <[hidden email]> 于2019年12月10日周二
> > > 下午12:18写道:
> > > > > >> >>>>>>>>>
> > > > > >> >>>>>>>>>> Hi Peter,
> > > > > >> >>>>>>>>>>
> > > > > >> >>>>>>>>>> Thanks a lot for starting this discussion. I think
> this
> > > is
> > > > a
> > > > > >> >>> very
> > > > > >> >>>>>>>> useful
> > > > > >> >>>>>>>>>> feature.
> > > > > >> >>>>>>>>>>
> > > > > >> >>>>>>>>>> Not only for Yarn, i am focused on flink on
> Kubernetes
> > > > > >> >>>>> integration
> > > > > >> >>>>>> and
> > > > > >> >>>>>>>>> come
> > > > > >> >>>>>>>>>> across the same
> > > > > >> >>>>>>>>>> problem. I do not want the job graph generated on
> > client
> > > > > side.
> > > > > >> >>>>>>>> Instead,
> > > > > >> >>>>>>>>> the
> > > > > >> >>>>>>>>>> user jars are built in
> > > > > >> >>>>>>>>>> a user-defined image. When the job manager launched,
> we
> > > > just
> > > > > >> >>>>> need to
> > > > > >> >>>>>>>>>> generate the job graph
> > > > > >> >>>>>>>>>> based on local user jars.
> > > > > >> >>>>>>>>>>
> > > > > >> >>>>>>>>>> I have some small suggestion about this.
> > > > > >> >>>>>>>>>>
> > > > > >> >>>>>>>>>> 1. `ProgramJobGraphRetriever` is very similar to
> > > > > >> >>>>>>>>>> `ClasspathJobGraphRetriever`, the differences
> > > > > >> >>>>>>>>>> are the former needs `ProgramMetadata` and the latter
> > > needs
> > > > > >> >>> some
> > > > > >> >>>>>>>>> arguments.
> > > > > >> >>>>>>>>>> Is it possible to
> > > > > >> >>>>>>>>>> have an unified `JobGraphRetriever` to support both?
> > > > > >> >>>>>>>>>> 2. Is it possible to not use a local user jar to
> start
> > a
> > > > > >> >>> per-job
> > > > > >> >>>>>>>> cluster?
> > > > > >> >>>>>>>>>> In your case, the user jars has
> > > > > >> >>>>>>>>>> existed on hdfs already and we do need to download
> the
> > > jars
> > > > > to
> > > > > >> >>>>>>>> deployer
> > > > > >> >>>>>>>>>> service. Currently, we
> > > > > >> >>>>>>>>>> always need a local user jar to start a flink
> cluster.
> > It
> > > > is
> > > > > >> >>> be
> > > > > >> >>>>>> great
> > > > > >> >>>>>>>> if
> > > > > >> >>>>>>>>> we
> > > > > >> >>>>>>>>>> could support remote user jars.
> > > > > >> >>>>>>>>>>>> In the implementation, we assume users package
> > > > > >> >>> flink-clients,
> > > > > >> >>>>>>>>>> flink-optimizer, flink-table together within the job
> > jar.
> > > > > >> >>>>> Otherwise,
> > > > > >> >>>>>>>> the
> > > > > >> >>>>>>>>>> job graph generation within JobClusterEntryPoint will
> > > fail.
> > > > > >> >>>>>>>>>> 3. What do you mean about the package? Do users need
> to
> > > > > >> >>> compile
> > > > > >> >>>>>> their
> > > > > >> >>>>>>>>> jars
> > > > > >> >>>>>>>>>> inlcuding flink-clients, flink-optimizer, flink-table
> > > > codes?
> > > > > >> >>>>>>>>>>
> > > > > >> >>>>>>>>>>
> > > > > >> >>>>>>>>>>
> > > > > >> >>>>>>>>>> Best,
> > > > > >> >>>>>>>>>> Yang
> > > > > >> >>>>>>>>>>
> > > > > >> >>>>>>>>>> Peter Huang <[hidden email]>
> > 于2019年12月10日周二
> > > > > >> >>>>> 上午2:37写道:
> > > > > >> >>>>>>>>>>
> > > > > >> >>>>>>>>>>> Dear All,
> > > > > >> >>>>>>>>>>>
> > > > > >> >>>>>>>>>>> Recently, the Flink community starts to improve the
> > yarn
> > > > > >> >>>>> cluster
> > > > > >> >>>>>>>>>> descriptor
> > > > > >> >>>>>>>>>>> to make job jar and config files configurable from
> > CLI.
> > > It
> > > > > >> >>>>>> improves
> > > > > >> >>>>>>>> the
> > > > > >> >>>>>>>>>>> flexibility of  Flink deployment Yarn Per Job Mode.
> > For
> > > > > >> >>>>> platform
> > > > > >> >>>>>>>> users
> > > > > >> >>>>>>>>>> who
> > > > > >> >>>>>>>>>>> manage tens of hundreds of streaming pipelines for
> the
> > > > whole
> > > > > >> >>>>> org
> > > > > >> >>>>>> or
> > > > > >> >>>>>>>>>>> company, we found the job graph generation in
> > > client-side
> > > > is
> > > > > >> >>>>>> another
> > > > > >> >>>>>>>>>>> pinpoint. Thus, we want to propose a configurable
> > > feature
> > > > > >> >>> for
> > > > > >> >>>>>>>>>>> FlinkYarnSessionCli. The feature can allow users to
> > > choose
> > > > > >> >>> the
> > > > > >> >>>>> job
> > > > > >> >>>>>>>>> graph
> > > > > >> >>>>>>>>>>> generation in Flink ClusterEntryPoint so that the
> job
> > > jar
> > > > > >> >>>>> doesn't
> > > > > >> >>>>>>>> need
> > > > > >> >>>>>>>>> to
> > > > > >> >>>>>>>>>>> be locally for the job graph generation. The
> proposal
> > is
> > > > > >> >>>>> organized
> > > > > >> >>>>>>>> as a
> > > > > >> >>>>>>>>>>> FLIP
> > > > > >> >>>>>>>>>>>
> > > > > >> >>>>>>>>>>>
> > > > > >> >>>>>>>>>>
> > > > > >> >>>>>>>>>
> > > > > >> >>>>>>>>
> > > > > >> >>>>>>
> > > > > >> >>>>>
> > > > > >> >>>
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-85+Delayed+JobGraph+Generation
> > > > > >> >>>>>>>>>>> .
> > > > > >> >>>>>>>>>>>
> > > > > >> >>>>>>>>>>> Any questions and suggestions are welcomed. Thank
> you
> > in
> > > > > >> >>>>> advance.
> > > > > >> >>>>>>>>>>>
> > > > > >> >>>>>>>>>>>
> > > > > >> >>>>>>>>>>> Best Regards
> > > > > >> >>>>>>>>>>> Peter Huang
> > > > > >> >>>>>>>>>>>
> > > > > >> >>>>>>>>>>
> > > > > >> >>>>>>>>>
> > > > > >> >>>>>>>>
> > > > > >> >>>>>>>
> > > > > >> >>>>>>
> > > > > >> >>>>>
> > > > > >> >>>>
> > > > > >> >>>
> > > > > >> >>
> > > > > >>
> > > > > >>
> > > > >
> > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] FLIP-85: Delayed Job Graph Generation

Kostas Kloudas-4
Hi all,

I am writing here as the discussion on the Google Doc seems to be a
bit difficult to follow.

I think that in order to be able to make progress, it would be helpful
to focus on per-job mode for now.
The reason is that:
 1) making the (unique) JobSubmitHandler responsible for creating the jobgraphs,
  which includes downloading dependencies, is not an optimal solution
 2) even if we put the responsibility on the JobMaster, currently each
job has its own
  JobMaster but they all run on the same process, so we have again a
single entity.

Of course after this is done, and if we feel comfortable with the
solution, then we can go to the session mode.

A second comment has to do with fault-tolerance in the per-job,
cluster-deploy mode.
In the document, it is suggested that upon recovery, the JobMaster of
each job re-creates the JobGraph.
I am just wondering if it is better to create and store the jobGraph
upon submission and only fetch it
upon recovery so that we have a static jobGraph.

Finally, I have a question which is what happens with jobs that have
multiple execute calls?
The semantics seem to change compared to the current behaviour, right?

Cheers,
Kostas

On Wed, Jan 8, 2020 at 8:05 PM tison <[hidden email]> wrote:

>
> not always, Yang Wang is also not yet a committer but he can join the
> channel. I cannot find the id by clicking “Add new member in channel” so
> come to you and ask for try out the link. Possibly I will find other ways
> but the original purpose is that the slack channel is a public area we
> discuss about developing...
> Best,
> tison.
>
>
> Peter Huang <[hidden email]> 于2020年1月9日周四 上午2:44写道:
>
> > Hi Tison,
> >
> > I am not the committer of Flink yet. I think I can't join it also.
> >
> >
> > Best Regards
> > Peter Huang
> >
> > On Wed, Jan 8, 2020 at 9:39 AM tison <[hidden email]> wrote:
> >
> > > Hi Peter,
> > >
> > > Could you try out this link?
> > https://the-asf.slack.com/messages/CNA3ADZPH
> > >
> > > Best,
> > > tison.
> > >
> > >
> > > Peter Huang <[hidden email]> 于2020年1月9日周四 上午1:22写道:
> > >
> > > > Hi Tison,
> > > >
> > > > I can't join the group with shared link. Would you please add me into
> > the
> > > > group? My slack account is huangzhenqiu0825.
> > > > Thank you in advance.
> > > >
> > > >
> > > > Best Regards
> > > > Peter Huang
> > > >
> > > > On Wed, Jan 8, 2020 at 12:02 AM tison <[hidden email]> wrote:
> > > >
> > > > > Hi Peter,
> > > > >
> > > > > As described above, this effort should get attention from people
> > > > developing
> > > > > FLIP-73 a.k.a. Executor abstractions. I recommend you to join the
> > > public
> > > > > slack channel[1] for Flink Client API Enhancement and you can try to
> > > > share
> > > > > you detailed thoughts there. It possibly gets more concrete
> > attentions.
> > > > >
> > > > > Best,
> > > > > tison.
> > > > >
> > > > > [1]
> > > > >
> > > > >
> > > >
> > >
> > https://slack.com/share/IS21SJ75H/Rk8HhUly9FuEHb7oGwBZ33uL/enQtODg2MDYwNjE5MTg3LTA2MjIzNDc1M2ZjZDVlMjdlZjk1M2RkYmJhNjAwMTk2ZDZkODQ4NmY5YmI4OGRhNWJkYTViMTM1NzlmMzc4OWM
> > > > >
> > > > >
> > > > > Peter Huang <[hidden email]> 于2020年1月7日周二 上午5:09写道:
> > > > >
> > > > > > Dear All,
> > > > > >
> > > > > > Happy new year! According to existing feedback from the community,
> > we
> > > > > > revised the doc with the consideration of session cluster support,
> > > and
> > > > > > concrete interface changes needed and execution plan. Please take
> > one
> > > > > more
> > > > > > round of review at your most convenient time.
> > > > > >
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > https://docs.google.com/document/d/1aAwVjdZByA-0CHbgv16Me-vjaaDMCfhX7TzVVTuifYM/edit#
> > > > > >
> > > > > >
> > > > > > Best Regards
> > > > > > Peter Huang
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Thu, Jan 2, 2020 at 11:29 AM Peter Huang <
> > > > [hidden email]>
> > > > > > wrote:
> > > > > >
> > > > > > > Hi Dian,
> > > > > > > Thanks for giving us valuable feedbacks.
> > > > > > >
> > > > > > > 1) It's better to have a whole design for this feature
> > > > > > > For the suggestion of enabling the cluster mode also session
> > > > cluster, I
> > > > > > > think Flink already supported it. WebSubmissionExtension already
> > > > allows
> > > > > > > users to start a job with the specified jar by using web UI.
> > > > > > > But we need to enable the feature from CLI for both local jar,
> > > remote
> > > > > > jar.
> > > > > > > I will align with Yang Wang first about the details and update
> > the
> > > > > design
> > > > > > > doc.
> > > > > > >
> > > > > > > 2) It's better to consider the convenience for users, such as
> > > > debugging
> > > > > > >
> > > > > > > I am wondering whether we can store the exception in jobgragh
> > > > > > > generation in application master. As no streaming graph can be
> > > > > scheduled
> > > > > > in
> > > > > > > this case, there will be no more TM will be requested from
> > FlinkRM.
> > > > > > > If the AM is still running, users can still query it from CLI. As
> > > it
> > > > > > > requires more change, we can get some feedback from <
> > > > > [hidden email]
> > > > > > >
> > > > > > > and @[hidden email] <[hidden email]>.
> > > > > > >
> > > > > > > 3) It's better to consider the impact to the stability of the
> > > cluster
> > > > > > >
> > > > > > > I agree with Yang Wang's opinion.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > Best Regards
> > > > > > > Peter Huang
> > > > > > >
> > > > > > >
> > > > > > > On Sun, Dec 29, 2019 at 9:44 PM Dian Fu <[hidden email]>
> > > > wrote:
> > > > > > >
> > > > > > >> Hi all,
> > > > > > >>
> > > > > > >> Sorry to jump into this discussion. Thanks everyone for the
> > > > > discussion.
> > > > > > >> I'm very interested in this topic although I'm not an expert in
> > > this
> > > > > > part.
> > > > > > >> So I'm glad to share my thoughts as following:
> > > > > > >>
> > > > > > >> 1) It's better to have a whole design for this feature
> > > > > > >> As we know, there are two deployment modes: per-job mode and
> > > session
> > > > > > >> mode. I'm wondering which mode really needs this feature. As the
> > > > > design
> > > > > > doc
> > > > > > >> mentioned, per-job mode is more used for streaming jobs and
> > > session
> > > > > > mode is
> > > > > > >> usually used for batch jobs(Of course, the job types and the
> > > > > deployment
> > > > > > >> modes are orthogonal). Usually streaming job is only needed to
> > be
> > > > > > submitted
> > > > > > >> once and it will run for days or weeks, while batch jobs will be
> > > > > > submitted
> > > > > > >> more frequently compared with streaming jobs. This means that
> > > maybe
> > > > > > session
> > > > > > >> mode also needs this feature. However, if we support this
> > feature
> > > in
> > > > > > >> session mode, the application master will become the new
> > > centralized
> > > > > > >> service(which should be solved). So in this case, it's better to
> > > > have
> > > > > a
> > > > > > >> complete design for both per-job mode and session mode.
> > > Furthermore,
> > > > > > even
> > > > > > >> if we can do it phase by phase, we need to have a whole picture
> > of
> > > > how
> > > > > > it
> > > > > > >> works in both per-job mode and session mode.
> > > > > > >>
> > > > > > >> 2) It's better to consider the convenience for users, such as
> > > > > debugging
> > > > > > >> After we finish this feature, the job graph will be compiled in
> > > the
> > > > > > >> application master, which means that users cannot easily get the
> > > > > > exception
> > > > > > >> message synchorousely in the job client if there are problems
> > > during
> > > > > the
> > > > > > >> job graph compiling (especially for platform users), such as the
> > > > > > resource
> > > > > > >> path is incorrect, the user program itself has some problems,
> > etc.
> > > > > What
> > > > > > I'm
> > > > > > >> thinking is that maybe we should throw the exceptions as early
> > as
> > > > > > possible
> > > > > > >> (during job submission stage).
> > > > > > >>
> > > > > > >> 3) It's better to consider the impact to the stability of the
> > > > cluster
> > > > > > >> If we perform the compiling in the application master, we should
> > > > > > consider
> > > > > > >> the impact of the compiling errors. Although YARN could resume
> > the
> > > > > > >> application master in case of failures, but in some case the
> > > > compiling
> > > > > > >> failure may be a waste of cluster resource and may impact the
> > > > > stability
> > > > > > the
> > > > > > >> cluster and the other jobs in the cluster, such as the resource
> > > path
> > > > > is
> > > > > > >> incorrect, the user program itself has some problems(in this
> > case,
> > > > job
> > > > > > >> failover cannot solve this kind of problems) etc. In the current
> > > > > > >> implemention, the compiling errors are handled in the client
> > side
> > > > and
> > > > > > there
> > > > > > >> is no impact to the cluster at all.
> > > > > > >>
> > > > > > >> Regarding to 1), it's clearly pointed in the design doc that
> > only
> > > > > > per-job
> > > > > > >> mode will be supported. However, I think it's better to also
> > > > consider
> > > > > > the
> > > > > > >> session mode in the design doc.
> > > > > > >> Regarding to 2) and 3), I have not seen related sections in the
> > > > design
> > > > > > >> doc. It will be good if we can cover them in the design doc.
> > > > > > >>
> > > > > > >> Feel free to correct me If there is anything I misunderstand.
> > > > > > >>
> > > > > > >> Regards,
> > > > > > >> Dian
> > > > > > >>
> > > > > > >>
> > > > > > >> > 在 2019年12月27日,上午3:13,Peter Huang <[hidden email]>
> > > 写道:
> > > > > > >> >
> > > > > > >> > Hi Yang,
> > > > > > >> >
> > > > > > >> > I can't agree more. The effort definitely needs to align with
> > > the
> > > > > > final
> > > > > > >> > goal of FLIP-73.
> > > > > > >> > I am thinking about whether we can achieve the goal with two
> > > > phases.
> > > > > > >> >
> > > > > > >> > 1) Phase I
> > > > > > >> > As the CLiFrontend will not be depreciated soon. We can still
> > > use
> > > > > the
> > > > > > >> > deployMode flag there,
> > > > > > >> > pass the program info through Flink configuration,  use the
> > > > > > >> > ClassPathJobGraphRetriever
> > > > > > >> > to generate the job graph in ClusterEntrypoints of yarn and
> > > > > > Kubernetes.
> > > > > > >> >
> > > > > > >> > 2) Phase II
> > > > > > >> > In  AbstractJobClusterExecutor, the job graph is generated in
> > > the
> > > > > > >> execute
> > > > > > >> > function. We can still
> > > > > > >> > use the deployMode in it. With deployMode = cluster, the
> > execute
> > > > > > >> function
> > > > > > >> > only starts the cluster.
> > > > > > >> >
> > > > > > >> > When {Yarn/Kuberneates}PerJobClusterEntrypoint starts, It will
> > > > start
> > > > > > the
> > > > > > >> > dispatch first, then we can use
> > > > > > >> > a ClusterEnvironment similar to ContextEnvironment to submit
> > the
> > > > job
> > > > > > >> with
> > > > > > >> > jobName the local
> > > > > > >> > dispatcher. For the details, we need more investigation. Let's
> > > > wait
> > > > > > >> > for @Aljoscha
> > > > > > >> > Krettek <[hidden email]> @Till Rohrmann <
> > > > [hidden email]
> > > > > >'s
> > > > > > >> > feedback after the holiday season.
> > > > > > >> >
> > > > > > >> > Thank you in advance. Merry Chrismas and Happy New Year!!!
> > > > > > >> >
> > > > > > >> >
> > > > > > >> > Best Regards
> > > > > > >> > Peter Huang
> > > > > > >> >
> > > > > > >> >
> > > > > > >> >
> > > > > > >> >
> > > > > > >> >
> > > > > > >> >
> > > > > > >> >
> > > > > > >> >
> > > > > > >> > On Wed, Dec 25, 2019 at 1:08 AM Yang Wang <
> > > [hidden email]>
> > > > > > >> wrote:
> > > > > > >> >
> > > > > > >> >> Hi Peter,
> > > > > > >> >>
> > > > > > >> >> I think we need to reconsider tison's suggestion seriously.
> > > After
> > > > > > >> FLIP-73,
> > > > > > >> >> the deployJobCluster has
> > > > > > >> >> beenmoved into `JobClusterExecutor#execute`. It should not be
> > > > > > perceived
> > > > > > >> >> for `CliFrontend`. That
> > > > > > >> >> means the user program will *ALWAYS* be executed on client
> > > side.
> > > > > This
> > > > > > >> is
> > > > > > >> >> the by design behavior.
> > > > > > >> >> So, we could not just add `if(client mode) .. else if(cluster
> > > > mode)
> > > > > > >> ...`
> > > > > > >> >> codes in `CliFrontend` to bypass
> > > > > > >> >> the executor. We need to find a clean way to decouple
> > executing
> > > > > user
> > > > > > >> >> program and deploying per-job
> > > > > > >> >> cluster. Based on this, we could support to execute user
> > > program
> > > > on
> > > > > > >> client
> > > > > > >> >> or master side.
> > > > > > >> >>
> > > > > > >> >> Maybe Aljoscha and Jeff could give some good suggestions.
> > > > > > >> >>
> > > > > > >> >>
> > > > > > >> >>
> > > > > > >> >> Best,
> > > > > > >> >> Yang
> > > > > > >> >>
> > > > > > >> >> Peter Huang <[hidden email]> 于2019年12月25日周三
> > > > 上午4:03写道:
> > > > > > >> >>
> > > > > > >> >>> Hi Jingjing,
> > > > > > >> >>>
> > > > > > >> >>> The improvement proposed is a deployment option for CLI. For
> > > SQL
> > > > > > based
> > > > > > >> >>> Flink application, It is more convenient to use the existing
> > > > model
> > > > > > in
> > > > > > >> >>> SqlClient in which
> > > > > > >> >>> the job graph is generated within SqlClient. After adding
> > the
> > > > > > delayed
> > > > > > >> job
> > > > > > >> >>> graph generation, I think there is no change is needed for
> > > your
> > > > > > side.
> > > > > > >> >>>
> > > > > > >> >>>
> > > > > > >> >>> Best Regards
> > > > > > >> >>> Peter Huang
> > > > > > >> >>>
> > > > > > >> >>>
> > > > > > >> >>> On Wed, Dec 18, 2019 at 6:01 AM jingjing bai <
> > > > > > >> [hidden email]>
> > > > > > >> >>> wrote:
> > > > > > >> >>>
> > > > > > >> >>>> hi peter:
> > > > > > >> >>>>    we had extension SqlClent to support sql job submit in
> > web
> > > > > base
> > > > > > on
> > > > > > >> >>>> flink 1.9.   we support submit to yarn on per job mode too.
> > > > > > >> >>>>    in this case, the job graph generated  on client side
> > .  I
> > > > > think
> > > > > > >> >>> this
> > > > > > >> >>>> discuss Mainly to improve api programme.  but in my case ,
> > > > there
> > > > > is
> > > > > > >> no
> > > > > > >> >>>> jar to upload but only a sql string .
> > > > > > >> >>>>    do u had more suggestion to improve for sql mode or it
> > is
> > > > > only a
> > > > > > >> >>>> switch for api programme?
> > > > > > >> >>>>
> > > > > > >> >>>>
> > > > > > >> >>>> best
> > > > > > >> >>>> bai jj
> > > > > > >> >>>>
> > > > > > >> >>>>
> > > > > > >> >>>> Yang Wang <[hidden email]> 于2019年12月18日周三 下午7:21写道:
> > > > > > >> >>>>
> > > > > > >> >>>>> I just want to revive this discussion.
> > > > > > >> >>>>>
> > > > > > >> >>>>> Recently, i am thinking about how to natively run flink
> > > > per-job
> > > > > > >> >>> cluster on
> > > > > > >> >>>>> Kubernetes.
> > > > > > >> >>>>> The per-job mode on Kubernetes is very different from on
> > > Yarn.
> > > > > And
> > > > > > >> we
> > > > > > >> >>> will
> > > > > > >> >>>>> have
> > > > > > >> >>>>> the same deployment requirements to the client and entry
> > > > point.
> > > > > > >> >>>>>
> > > > > > >> >>>>> 1. Flink client not always need a local jar to start a
> > Flink
> > > > > > per-job
> > > > > > >> >>>>> cluster. We could
> > > > > > >> >>>>> support multiple schemas. For example,
> > > file:///path/of/my.jar
> > > > > > means
> > > > > > >> a
> > > > > > >> >>> jar
> > > > > > >> >>>>> located
> > > > > > >> >>>>> at client side, hdfs://myhdfs/user/myname/flink/my.jar
> > > means a
> > > > > jar
> > > > > > >> >>> located
> > > > > > >> >>>>> at
> > > > > > >> >>>>> remote hdfs, local:///path/in/image/my.jar means a jar
> > > located
> > > > > at
> > > > > > >> >>>>> jobmanager side.
> > > > > > >> >>>>>
> > > > > > >> >>>>> 2. Support running user program on master side. This also
> > > > means
> > > > > > the
> > > > > > >> >>> entry
> > > > > > >> >>>>> point
> > > > > > >> >>>>> will generate the job graph on master side. We could use
> > the
> > > > > > >> >>>>> ClasspathJobGraphRetriever
> > > > > > >> >>>>> or start a local Flink client to achieve this purpose.
> > > > > > >> >>>>>
> > > > > > >> >>>>>
> > > > > > >> >>>>> cc tison, Aljoscha & Kostas Do you think this is the right
> > > > > > >> direction we
> > > > > > >> >>>>> need to work?
> > > > > > >> >>>>>
> > > > > > >> >>>>> tison <[hidden email]> 于2019年12月12日周四 下午4:48写道:
> > > > > > >> >>>>>
> > > > > > >> >>>>>> A quick idea is that we separate the deployment from user
> > > > > program
> > > > > > >> >>> that
> > > > > > >> >>>>> it
> > > > > > >> >>>>>> has always been done
> > > > > > >> >>>>>> outside the program. On user program executed there is
> > > > always a
> > > > > > >> >>>>>> ClusterClient that communicates with
> > > > > > >> >>>>>> an existing cluster, remote or local. It will be another
> > > > thread
> > > > > > so
> > > > > > >> >>> just
> > > > > > >> >>>>> for
> > > > > > >> >>>>>> your information.
> > > > > > >> >>>>>>
> > > > > > >> >>>>>> Best,
> > > > > > >> >>>>>> tison.
> > > > > > >> >>>>>>
> > > > > > >> >>>>>>
> > > > > > >> >>>>>> tison <[hidden email]> 于2019年12月12日周四 下午4:40写道:
> > > > > > >> >>>>>>
> > > > > > >> >>>>>>> Hi Peter,
> > > > > > >> >>>>>>>
> > > > > > >> >>>>>>> Another concern I realized recently is that with current
> > > > > > Executors
> > > > > > >> >>>>>>> abstraction(FLIP-73)
> > > > > > >> >>>>>>> I'm afraid that user program is designed to ALWAYS run
> > on
> > > > the
> > > > > > >> >>> client
> > > > > > >> >>>>>> side.
> > > > > > >> >>>>>>> Specifically,
> > > > > > >> >>>>>>> we deploy the job in executor when env.execute called.
> > > This
> > > > > > >> >>>>> abstraction
> > > > > > >> >>>>>>> possibly prevents
> > > > > > >> >>>>>>> Flink runs user program on the cluster side.
> > > > > > >> >>>>>>>
> > > > > > >> >>>>>>> For your proposal, in this case we already compiled the
> > > > > program
> > > > > > >> and
> > > > > > >> >>>>> run
> > > > > > >> >>>>>> on
> > > > > > >> >>>>>>> the client side,
> > > > > > >> >>>>>>> even we deploy a cluster and retrieve job graph from
> > > program
> > > > > > >> >>>>> metadata, it
> > > > > > >> >>>>>>> doesn't make
> > > > > > >> >>>>>>> many sense.
> > > > > > >> >>>>>>>
> > > > > > >> >>>>>>> cc Aljoscha & Kostas what do you think about this
> > > > constraint?
> > > > > > >> >>>>>>>
> > > > > > >> >>>>>>> Best,
> > > > > > >> >>>>>>> tison.
> > > > > > >> >>>>>>>
> > > > > > >> >>>>>>>
> > > > > > >> >>>>>>> Peter Huang <[hidden email]> 于2019年12月10日周二
> > > > > > >> 下午12:45写道:
> > > > > > >> >>>>>>>
> > > > > > >> >>>>>>>> Hi Tison,
> > > > > > >> >>>>>>>>
> > > > > > >> >>>>>>>> Yes, you are right. I think I made the wrong argument
> > in
> > > > the
> > > > > > doc.
> > > > > > >> >>>>>>>> Basically, the packaging jar problem is only for
> > platform
> > > > > > users.
> > > > > > >> >>> In
> > > > > > >> >>>>> our
> > > > > > >> >>>>>>>> internal deploy service,
> > > > > > >> >>>>>>>> we further optimized the deployment latency by letting
> > > > users
> > > > > to
> > > > > > >> >>>>>> packaging
> > > > > > >> >>>>>>>> flink-runtime together with the uber jar, so that we
> > > don't
> > > > > need
> > > > > > >> to
> > > > > > >> >>>>>>>> consider
> > > > > > >> >>>>>>>> multiple flink version
> > > > > > >> >>>>>>>> support for now. In the session client mode, as Flink
> > > libs
> > > > > will
> > > > > > >> be
> > > > > > >> >>>>>> shipped
> > > > > > >> >>>>>>>> anyway as local resources of yarn. Users actually don't
> > > > need
> > > > > to
> > > > > > >> >>>>> package
> > > > > > >> >>>>>>>> those libs into job jar.
> > > > > > >> >>>>>>>>
> > > > > > >> >>>>>>>>
> > > > > > >> >>>>>>>>
> > > > > > >> >>>>>>>> Best Regards
> > > > > > >> >>>>>>>> Peter Huang
> > > > > > >> >>>>>>>>
> > > > > > >> >>>>>>>> On Mon, Dec 9, 2019 at 8:35 PM tison <
> > > [hidden email]
> > > > >
> > > > > > >> >>> wrote:
> > > > > > >> >>>>>>>>
> > > > > > >> >>>>>>>>>> 3. What do you mean about the package? Do users need
> > to
> > > > > > >> >>> compile
> > > > > > >> >>>>>> their
> > > > > > >> >>>>>>>>> jars
> > > > > > >> >>>>>>>>> inlcuding flink-clients, flink-optimizer, flink-table
> > > > codes?
> > > > > > >> >>>>>>>>>
> > > > > > >> >>>>>>>>> The answer should be no because they exist in system
> > > > > > classpath.
> > > > > > >> >>>>>>>>>
> > > > > > >> >>>>>>>>> Best,
> > > > > > >> >>>>>>>>> tison.
> > > > > > >> >>>>>>>>>
> > > > > > >> >>>>>>>>>
> > > > > > >> >>>>>>>>> Yang Wang <[hidden email]> 于2019年12月10日周二
> > > > 下午12:18写道:
> > > > > > >> >>>>>>>>>
> > > > > > >> >>>>>>>>>> Hi Peter,
> > > > > > >> >>>>>>>>>>
> > > > > > >> >>>>>>>>>> Thanks a lot for starting this discussion. I think
> > this
> > > > is
> > > > > a
> > > > > > >> >>> very
> > > > > > >> >>>>>>>> useful
> > > > > > >> >>>>>>>>>> feature.
> > > > > > >> >>>>>>>>>>
> > > > > > >> >>>>>>>>>> Not only for Yarn, i am focused on flink on
> > Kubernetes
> > > > > > >> >>>>> integration
> > > > > > >> >>>>>> and
> > > > > > >> >>>>>>>>> come
> > > > > > >> >>>>>>>>>> across the same
> > > > > > >> >>>>>>>>>> problem. I do not want the job graph generated on
> > > client
> > > > > > side.
> > > > > > >> >>>>>>>> Instead,
> > > > > > >> >>>>>>>>> the
> > > > > > >> >>>>>>>>>> user jars are built in
> > > > > > >> >>>>>>>>>> a user-defined image. When the job manager launched,
> > we
> > > > > just
> > > > > > >> >>>>> need to
> > > > > > >> >>>>>>>>>> generate the job graph
> > > > > > >> >>>>>>>>>> based on local user jars.
> > > > > > >> >>>>>>>>>>
> > > > > > >> >>>>>>>>>> I have some small suggestion about this.
> > > > > > >> >>>>>>>>>>
> > > > > > >> >>>>>>>>>> 1. `ProgramJobGraphRetriever` is very similar to
> > > > > > >> >>>>>>>>>> `ClasspathJobGraphRetriever`, the differences
> > > > > > >> >>>>>>>>>> are the former needs `ProgramMetadata` and the latter
> > > > needs
> > > > > > >> >>> some
> > > > > > >> >>>>>>>>> arguments.
> > > > > > >> >>>>>>>>>> Is it possible to
> > > > > > >> >>>>>>>>>> have an unified `JobGraphRetriever` to support both?
> > > > > > >> >>>>>>>>>> 2. Is it possible to not use a local user jar to
> > start
> > > a
> > > > > > >> >>> per-job
> > > > > > >> >>>>>>>> cluster?
> > > > > > >> >>>>>>>>>> In your case, the user jars has
> > > > > > >> >>>>>>>>>> existed on hdfs already and we do need to download
> > the
> > > > jars
> > > > > > to
> > > > > > >> >>>>>>>> deployer
> > > > > > >> >>>>>>>>>> service. Currently, we
> > > > > > >> >>>>>>>>>> always need a local user jar to start a flink
> > cluster.
> > > It
> > > > > is
> > > > > > >> >>> be
> > > > > > >> >>>>>> great
> > > > > > >> >>>>>>>> if
> > > > > > >> >>>>>>>>> we
> > > > > > >> >>>>>>>>>> could support remote user jars.
> > > > > > >> >>>>>>>>>>>> In the implementation, we assume users package
> > > > > > >> >>> flink-clients,
> > > > > > >> >>>>>>>>>> flink-optimizer, flink-table together within the job
> > > jar.
> > > > > > >> >>>>> Otherwise,
> > > > > > >> >>>>>>>> the
> > > > > > >> >>>>>>>>>> job graph generation within JobClusterEntryPoint will
> > > > fail.
> > > > > > >> >>>>>>>>>> 3. What do you mean about the package? Do users need
> > to
> > > > > > >> >>> compile
> > > > > > >> >>>>>> their
> > > > > > >> >>>>>>>>> jars
> > > > > > >> >>>>>>>>>> inlcuding flink-clients, flink-optimizer, flink-table
> > > > > codes?
> > > > > > >> >>>>>>>>>>
> > > > > > >> >>>>>>>>>>
> > > > > > >> >>>>>>>>>>
> > > > > > >> >>>>>>>>>> Best,
> > > > > > >> >>>>>>>>>> Yang
> > > > > > >> >>>>>>>>>>
> > > > > > >> >>>>>>>>>> Peter Huang <[hidden email]>
> > > 于2019年12月10日周二
> > > > > > >> >>>>> 上午2:37写道:
> > > > > > >> >>>>>>>>>>
> > > > > > >> >>>>>>>>>>> Dear All,
> > > > > > >> >>>>>>>>>>>
> > > > > > >> >>>>>>>>>>> Recently, the Flink community starts to improve the
> > > yarn
> > > > > > >> >>>>> cluster
> > > > > > >> >>>>>>>>>> descriptor
> > > > > > >> >>>>>>>>>>> to make job jar and config files configurable from
> > > CLI.
> > > > It
> > > > > > >> >>>>>> improves
> > > > > > >> >>>>>>>> the
> > > > > > >> >>>>>>>>>>> flexibility of  Flink deployment Yarn Per Job Mode.
> > > For
> > > > > > >> >>>>> platform
> > > > > > >> >>>>>>>> users
> > > > > > >> >>>>>>>>>> who
> > > > > > >> >>>>>>>>>>> manage tens of hundreds of streaming pipelines for
> > the
> > > > > whole
> > > > > > >> >>>>> org
> > > > > > >> >>>>>> or
> > > > > > >> >>>>>>>>>>> company, we found the job graph generation in
> > > > client-side
> > > > > is
> > > > > > >> >>>>>> another
> > > > > > >> >>>>>>>>>>> pinpoint. Thus, we want to propose a configurable
> > > > feature
> > > > > > >> >>> for
> > > > > > >> >>>>>>>>>>> FlinkYarnSessionCli. The feature can allow users to
> > > > choose
> > > > > > >> >>> the
> > > > > > >> >>>>> job
> > > > > > >> >>>>>>>>> graph
> > > > > > >> >>>>>>>>>>> generation in Flink ClusterEntryPoint so that the
> > job
> > > > jar
> > > > > > >> >>>>> doesn't
> > > > > > >> >>>>>>>> need
> > > > > > >> >>>>>>>>> to
> > > > > > >> >>>>>>>>>>> be locally for the job graph generation. The
> > proposal
> > > is
> > > > > > >> >>>>> organized
> > > > > > >> >>>>>>>> as a
> > > > > > >> >>>>>>>>>>> FLIP
> > > > > > >> >>>>>>>>>>>
> > > > > > >> >>>>>>>>>>>
> > > > > > >> >>>>>>>>>>
> > > > > > >> >>>>>>>>>
> > > > > > >> >>>>>>>>
> > > > > > >> >>>>>>
> > > > > > >> >>>>>
> > > > > > >> >>>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-85+Delayed+JobGraph+Generation
> > > > > > >> >>>>>>>>>>> .
> > > > > > >> >>>>>>>>>>>
> > > > > > >> >>>>>>>>>>> Any questions and suggestions are welcomed. Thank
> > you
> > > in
> > > > > > >> >>>>> advance.
> > > > > > >> >>>>>>>>>>>
> > > > > > >> >>>>>>>>>>>
> > > > > > >> >>>>>>>>>>> Best Regards
> > > > > > >> >>>>>>>>>>> Peter Huang
> > > > > > >> >>>>>>>>>>>
> > > > > > >> >>>>>>>>>>
> > > > > > >> >>>>>>>>>
> > > > > > >> >>>>>>>>
> > > > > > >> >>>>>>>
> > > > > > >> >>>>>>
> > > > > > >> >>>>>
> > > > > > >> >>>>
> > > > > > >> >>>
> > > > > > >> >>
> > > > > > >>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] FLIP-85: Delayed Job Graph Generation

Peter Huang
Hi Kostas,

Thanks for this feedback. I can't agree more about the opinion. The cluster
mode should be added
first in per job cluster.

1) For job cluster implementation
1. Job graph recovery from configuration or store as static job graph as
session cluster. I think the static one will be better for less recovery
time.
Let me update the doc for details.

2. For job execute multiple times, I think @Zili Chen
<[hidden email]> has
proposed the local client solution that can
the run program actually in the cluster entry point. We can put the
implementation in the second stage,
or even a new FLIP for further discussion.

2) For session cluster implementation
We can disable the cluster mode for the session cluster in the first stage.
I agree the jar downloading will be a painful thing.
We can consider about PoC and performance evaluation first. If the end to
end experience is good enough, then we can consider
proceeding with the solution.

Looking forward to more opinions from @Yang Wang <[hidden email]> @Zili
Chen <[hidden email]> @Dian Fu <[hidden email]>.


Best Regards
Peter Huang

On Wed, Jan 15, 2020 at 7:50 AM Kostas Kloudas <[hidden email]> wrote:

> Hi all,
>
> I am writing here as the discussion on the Google Doc seems to be a
> bit difficult to follow.
>
> I think that in order to be able to make progress, it would be helpful
> to focus on per-job mode for now.
> The reason is that:
>  1) making the (unique) JobSubmitHandler responsible for creating the
> jobgraphs,
>   which includes downloading dependencies, is not an optimal solution
>  2) even if we put the responsibility on the JobMaster, currently each
> job has its own
>   JobMaster but they all run on the same process, so we have again a
> single entity.
>
> Of course after this is done, and if we feel comfortable with the
> solution, then we can go to the session mode.
>
> A second comment has to do with fault-tolerance in the per-job,
> cluster-deploy mode.
> In the document, it is suggested that upon recovery, the JobMaster of
> each job re-creates the JobGraph.
> I am just wondering if it is better to create and store the jobGraph
> upon submission and only fetch it
> upon recovery so that we have a static jobGraph.
>
> Finally, I have a question which is what happens with jobs that have
> multiple execute calls?
> The semantics seem to change compared to the current behaviour, right?
>
> Cheers,
> Kostas
>
> On Wed, Jan 8, 2020 at 8:05 PM tison <[hidden email]> wrote:
> >
> > not always, Yang Wang is also not yet a committer but he can join the
> > channel. I cannot find the id by clicking “Add new member in channel” so
> > come to you and ask for try out the link. Possibly I will find other ways
> > but the original purpose is that the slack channel is a public area we
> > discuss about developing...
> > Best,
> > tison.
> >
> >
> > Peter Huang <[hidden email]> 于2020年1月9日周四 上午2:44写道:
> >
> > > Hi Tison,
> > >
> > > I am not the committer of Flink yet. I think I can't join it also.
> > >
> > >
> > > Best Regards
> > > Peter Huang
> > >
> > > On Wed, Jan 8, 2020 at 9:39 AM tison <[hidden email]> wrote:
> > >
> > > > Hi Peter,
> > > >
> > > > Could you try out this link?
> > > https://the-asf.slack.com/messages/CNA3ADZPH
> > > >
> > > > Best,
> > > > tison.
> > > >
> > > >
> > > > Peter Huang <[hidden email]> 于2020年1月9日周四 上午1:22写道:
> > > >
> > > > > Hi Tison,
> > > > >
> > > > > I can't join the group with shared link. Would you please add me
> into
> > > the
> > > > > group? My slack account is huangzhenqiu0825.
> > > > > Thank you in advance.
> > > > >
> > > > >
> > > > > Best Regards
> > > > > Peter Huang
> > > > >
> > > > > On Wed, Jan 8, 2020 at 12:02 AM tison <[hidden email]>
> wrote:
> > > > >
> > > > > > Hi Peter,
> > > > > >
> > > > > > As described above, this effort should get attention from people
> > > > > developing
> > > > > > FLIP-73 a.k.a. Executor abstractions. I recommend you to join the
> > > > public
> > > > > > slack channel[1] for Flink Client API Enhancement and you can
> try to
> > > > > share
> > > > > > you detailed thoughts there. It possibly gets more concrete
> > > attentions.
> > > > > >
> > > > > > Best,
> > > > > > tison.
> > > > > >
> > > > > > [1]
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> https://slack.com/share/IS21SJ75H/Rk8HhUly9FuEHb7oGwBZ33uL/enQtODg2MDYwNjE5MTg3LTA2MjIzNDc1M2ZjZDVlMjdlZjk1M2RkYmJhNjAwMTk2ZDZkODQ4NmY5YmI4OGRhNWJkYTViMTM1NzlmMzc4OWM
> > > > > >
> > > > > >
> > > > > > Peter Huang <[hidden email]> 于2020年1月7日周二 上午5:09写道:
> > > > > >
> > > > > > > Dear All,
> > > > > > >
> > > > > > > Happy new year! According to existing feedback from the
> community,
> > > we
> > > > > > > revised the doc with the consideration of session cluster
> support,
> > > > and
> > > > > > > concrete interface changes needed and execution plan. Please
> take
> > > one
> > > > > > more
> > > > > > > round of review at your most convenient time.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> https://docs.google.com/document/d/1aAwVjdZByA-0CHbgv16Me-vjaaDMCfhX7TzVVTuifYM/edit#
> > > > > > >
> > > > > > >
> > > > > > > Best Regards
> > > > > > > Peter Huang
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Thu, Jan 2, 2020 at 11:29 AM Peter Huang <
> > > > > [hidden email]>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hi Dian,
> > > > > > > > Thanks for giving us valuable feedbacks.
> > > > > > > >
> > > > > > > > 1) It's better to have a whole design for this feature
> > > > > > > > For the suggestion of enabling the cluster mode also session
> > > > > cluster, I
> > > > > > > > think Flink already supported it. WebSubmissionExtension
> already
> > > > > allows
> > > > > > > > users to start a job with the specified jar by using web UI.
> > > > > > > > But we need to enable the feature from CLI for both local
> jar,
> > > > remote
> > > > > > > jar.
> > > > > > > > I will align with Yang Wang first about the details and
> update
> > > the
> > > > > > design
> > > > > > > > doc.
> > > > > > > >
> > > > > > > > 2) It's better to consider the convenience for users, such as
> > > > > debugging
> > > > > > > >
> > > > > > > > I am wondering whether we can store the exception in jobgragh
> > > > > > > > generation in application master. As no streaming graph can
> be
> > > > > > scheduled
> > > > > > > in
> > > > > > > > this case, there will be no more TM will be requested from
> > > FlinkRM.
> > > > > > > > If the AM is still running, users can still query it from
> CLI. As
> > > > it
> > > > > > > > requires more change, we can get some feedback from <
> > > > > > [hidden email]
> > > > > > > >
> > > > > > > > and @[hidden email] <[hidden email]>.
> > > > > > > >
> > > > > > > > 3) It's better to consider the impact to the stability of the
> > > > cluster
> > > > > > > >
> > > > > > > > I agree with Yang Wang's opinion.
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > Best Regards
> > > > > > > > Peter Huang
> > > > > > > >
> > > > > > > >
> > > > > > > > On Sun, Dec 29, 2019 at 9:44 PM Dian Fu <
> [hidden email]>
> > > > > wrote:
> > > > > > > >
> > > > > > > >> Hi all,
> > > > > > > >>
> > > > > > > >> Sorry to jump into this discussion. Thanks everyone for the
> > > > > > discussion.
> > > > > > > >> I'm very interested in this topic although I'm not an
> expert in
> > > > this
> > > > > > > part.
> > > > > > > >> So I'm glad to share my thoughts as following:
> > > > > > > >>
> > > > > > > >> 1) It's better to have a whole design for this feature
> > > > > > > >> As we know, there are two deployment modes: per-job mode and
> > > > session
> > > > > > > >> mode. I'm wondering which mode really needs this feature.
> As the
> > > > > > design
> > > > > > > doc
> > > > > > > >> mentioned, per-job mode is more used for streaming jobs and
> > > > session
> > > > > > > mode is
> > > > > > > >> usually used for batch jobs(Of course, the job types and the
> > > > > > deployment
> > > > > > > >> modes are orthogonal). Usually streaming job is only needed
> to
> > > be
> > > > > > > submitted
> > > > > > > >> once and it will run for days or weeks, while batch jobs
> will be
> > > > > > > submitted
> > > > > > > >> more frequently compared with streaming jobs. This means
> that
> > > > maybe
> > > > > > > session
> > > > > > > >> mode also needs this feature. However, if we support this
> > > feature
> > > > in
> > > > > > > >> session mode, the application master will become the new
> > > > centralized
> > > > > > > >> service(which should be solved). So in this case, it's
> better to
> > > > > have
> > > > > > a
> > > > > > > >> complete design for both per-job mode and session mode.
> > > > Furthermore,
> > > > > > > even
> > > > > > > >> if we can do it phase by phase, we need to have a whole
> picture
> > > of
> > > > > how
> > > > > > > it
> > > > > > > >> works in both per-job mode and session mode.
> > > > > > > >>
> > > > > > > >> 2) It's better to consider the convenience for users, such
> as
> > > > > > debugging
> > > > > > > >> After we finish this feature, the job graph will be
> compiled in
> > > > the
> > > > > > > >> application master, which means that users cannot easily
> get the
> > > > > > > exception
> > > > > > > >> message synchorousely in the job client if there are
> problems
> > > > during
> > > > > > the
> > > > > > > >> job graph compiling (especially for platform users), such
> as the
> > > > > > > resource
> > > > > > > >> path is incorrect, the user program itself has some
> problems,
> > > etc.
> > > > > > What
> > > > > > > I'm
> > > > > > > >> thinking is that maybe we should throw the exceptions as
> early
> > > as
> > > > > > > possible
> > > > > > > >> (during job submission stage).
> > > > > > > >>
> > > > > > > >> 3) It's better to consider the impact to the stability of
> the
> > > > > cluster
> > > > > > > >> If we perform the compiling in the application master, we
> should
> > > > > > > consider
> > > > > > > >> the impact of the compiling errors. Although YARN could
> resume
> > > the
> > > > > > > >> application master in case of failures, but in some case the
> > > > > compiling
> > > > > > > >> failure may be a waste of cluster resource and may impact
> the
> > > > > > stability
> > > > > > > the
> > > > > > > >> cluster and the other jobs in the cluster, such as the
> resource
> > > > path
> > > > > > is
> > > > > > > >> incorrect, the user program itself has some problems(in this
> > > case,
> > > > > job
> > > > > > > >> failover cannot solve this kind of problems) etc. In the
> current
> > > > > > > >> implemention, the compiling errors are handled in the client
> > > side
> > > > > and
> > > > > > > there
> > > > > > > >> is no impact to the cluster at all.
> > > > > > > >>
> > > > > > > >> Regarding to 1), it's clearly pointed in the design doc that
> > > only
> > > > > > > per-job
> > > > > > > >> mode will be supported. However, I think it's better to also
> > > > > consider
> > > > > > > the
> > > > > > > >> session mode in the design doc.
> > > > > > > >> Regarding to 2) and 3), I have not seen related sections in
> the
> > > > > design
> > > > > > > >> doc. It will be good if we can cover them in the design doc.
> > > > > > > >>
> > > > > > > >> Feel free to correct me If there is anything I
> misunderstand.
> > > > > > > >>
> > > > > > > >> Regards,
> > > > > > > >> Dian
> > > > > > > >>
> > > > > > > >>
> > > > > > > >> > 在 2019年12月27日,上午3:13,Peter Huang <
> [hidden email]>
> > > > 写道:
> > > > > > > >> >
> > > > > > > >> > Hi Yang,
> > > > > > > >> >
> > > > > > > >> > I can't agree more. The effort definitely needs to align
> with
> > > > the
> > > > > > > final
> > > > > > > >> > goal of FLIP-73.
> > > > > > > >> > I am thinking about whether we can achieve the goal with
> two
> > > > > phases.
> > > > > > > >> >
> > > > > > > >> > 1) Phase I
> > > > > > > >> > As the CLiFrontend will not be depreciated soon. We can
> still
> > > > use
> > > > > > the
> > > > > > > >> > deployMode flag there,
> > > > > > > >> > pass the program info through Flink configuration,  use
> the
> > > > > > > >> > ClassPathJobGraphRetriever
> > > > > > > >> > to generate the job graph in ClusterEntrypoints of yarn
> and
> > > > > > > Kubernetes.
> > > > > > > >> >
> > > > > > > >> > 2) Phase II
> > > > > > > >> > In  AbstractJobClusterExecutor, the job graph is
> generated in
> > > > the
> > > > > > > >> execute
> > > > > > > >> > function. We can still
> > > > > > > >> > use the deployMode in it. With deployMode = cluster, the
> > > execute
> > > > > > > >> function
> > > > > > > >> > only starts the cluster.
> > > > > > > >> >
> > > > > > > >> > When {Yarn/Kuberneates}PerJobClusterEntrypoint starts, It
> will
> > > > > start
> > > > > > > the
> > > > > > > >> > dispatch first, then we can use
> > > > > > > >> > a ClusterEnvironment similar to ContextEnvironment to
> submit
> > > the
> > > > > job
> > > > > > > >> with
> > > > > > > >> > jobName the local
> > > > > > > >> > dispatcher. For the details, we need more investigation.
> Let's
> > > > > wait
> > > > > > > >> > for @Aljoscha
> > > > > > > >> > Krettek <[hidden email]> @Till Rohrmann <
> > > > > [hidden email]
> > > > > > >'s
> > > > > > > >> > feedback after the holiday season.
> > > > > > > >> >
> > > > > > > >> > Thank you in advance. Merry Chrismas and Happy New Year!!!
> > > > > > > >> >
> > > > > > > >> >
> > > > > > > >> > Best Regards
> > > > > > > >> > Peter Huang
> > > > > > > >> >
> > > > > > > >> >
> > > > > > > >> >
> > > > > > > >> >
> > > > > > > >> >
> > > > > > > >> >
> > > > > > > >> >
> > > > > > > >> >
> > > > > > > >> > On Wed, Dec 25, 2019 at 1:08 AM Yang Wang <
> > > > [hidden email]>
> > > > > > > >> wrote:
> > > > > > > >> >
> > > > > > > >> >> Hi Peter,
> > > > > > > >> >>
> > > > > > > >> >> I think we need to reconsider tison's suggestion
> seriously.
> > > > After
> > > > > > > >> FLIP-73,
> > > > > > > >> >> the deployJobCluster has
> > > > > > > >> >> beenmoved into `JobClusterExecutor#execute`. It should
> not be
> > > > > > > perceived
> > > > > > > >> >> for `CliFrontend`. That
> > > > > > > >> >> means the user program will *ALWAYS* be executed on
> client
> > > > side.
> > > > > > This
> > > > > > > >> is
> > > > > > > >> >> the by design behavior.
> > > > > > > >> >> So, we could not just add `if(client mode) .. else
> if(cluster
> > > > > mode)
> > > > > > > >> ...`
> > > > > > > >> >> codes in `CliFrontend` to bypass
> > > > > > > >> >> the executor. We need to find a clean way to decouple
> > > executing
> > > > > > user
> > > > > > > >> >> program and deploying per-job
> > > > > > > >> >> cluster. Based on this, we could support to execute user
> > > > program
> > > > > on
> > > > > > > >> client
> > > > > > > >> >> or master side.
> > > > > > > >> >>
> > > > > > > >> >> Maybe Aljoscha and Jeff could give some good suggestions.
> > > > > > > >> >>
> > > > > > > >> >>
> > > > > > > >> >>
> > > > > > > >> >> Best,
> > > > > > > >> >> Yang
> > > > > > > >> >>
> > > > > > > >> >> Peter Huang <[hidden email]> 于2019年12月25日周三
> > > > > 上午4:03写道:
> > > > > > > >> >>
> > > > > > > >> >>> Hi Jingjing,
> > > > > > > >> >>>
> > > > > > > >> >>> The improvement proposed is a deployment option for
> CLI. For
> > > > SQL
> > > > > > > based
> > > > > > > >> >>> Flink application, It is more convenient to use the
> existing
> > > > > model
> > > > > > > in
> > > > > > > >> >>> SqlClient in which
> > > > > > > >> >>> the job graph is generated within SqlClient. After
> adding
> > > the
> > > > > > > delayed
> > > > > > > >> job
> > > > > > > >> >>> graph generation, I think there is no change is needed
> for
> > > > your
> > > > > > > side.
> > > > > > > >> >>>
> > > > > > > >> >>>
> > > > > > > >> >>> Best Regards
> > > > > > > >> >>> Peter Huang
> > > > > > > >> >>>
> > > > > > > >> >>>
> > > > > > > >> >>> On Wed, Dec 18, 2019 at 6:01 AM jingjing bai <
> > > > > > > >> [hidden email]>
> > > > > > > >> >>> wrote:
> > > > > > > >> >>>
> > > > > > > >> >>>> hi peter:
> > > > > > > >> >>>>    we had extension SqlClent to support sql job submit
> in
> > > web
> > > > > > base
> > > > > > > on
> > > > > > > >> >>>> flink 1.9.   we support submit to yarn on per job mode
> too.
> > > > > > > >> >>>>    in this case, the job graph generated  on client
> side
> > > .  I
> > > > > > think
> > > > > > > >> >>> this
> > > > > > > >> >>>> discuss Mainly to improve api programme.  but in my
> case ,
> > > > > there
> > > > > > is
> > > > > > > >> no
> > > > > > > >> >>>> jar to upload but only a sql string .
> > > > > > > >> >>>>    do u had more suggestion to improve for sql mode or
> it
> > > is
> > > > > > only a
> > > > > > > >> >>>> switch for api programme?
> > > > > > > >> >>>>
> > > > > > > >> >>>>
> > > > > > > >> >>>> best
> > > > > > > >> >>>> bai jj
> > > > > > > >> >>>>
> > > > > > > >> >>>>
> > > > > > > >> >>>> Yang Wang <[hidden email]> 于2019年12月18日周三
> 下午7:21写道:
> > > > > > > >> >>>>
> > > > > > > >> >>>>> I just want to revive this discussion.
> > > > > > > >> >>>>>
> > > > > > > >> >>>>> Recently, i am thinking about how to natively run
> flink
> > > > > per-job
> > > > > > > >> >>> cluster on
> > > > > > > >> >>>>> Kubernetes.
> > > > > > > >> >>>>> The per-job mode on Kubernetes is very different from
> on
> > > > Yarn.
> > > > > > And
> > > > > > > >> we
> > > > > > > >> >>> will
> > > > > > > >> >>>>> have
> > > > > > > >> >>>>> the same deployment requirements to the client and
> entry
> > > > > point.
> > > > > > > >> >>>>>
> > > > > > > >> >>>>> 1. Flink client not always need a local jar to start a
> > > Flink
> > > > > > > per-job
> > > > > > > >> >>>>> cluster. We could
> > > > > > > >> >>>>> support multiple schemas. For example,
> > > > file:///path/of/my.jar
> > > > > > > means
> > > > > > > >> a
> > > > > > > >> >>> jar
> > > > > > > >> >>>>> located
> > > > > > > >> >>>>> at client side, hdfs://myhdfs/user/myname/flink/my.jar
> > > > means a
> > > > > > jar
> > > > > > > >> >>> located
> > > > > > > >> >>>>> at
> > > > > > > >> >>>>> remote hdfs, local:///path/in/image/my.jar means a jar
> > > > located
> > > > > > at
> > > > > > > >> >>>>> jobmanager side.
> > > > > > > >> >>>>>
> > > > > > > >> >>>>> 2. Support running user program on master side. This
> also
> > > > > means
> > > > > > > the
> > > > > > > >> >>> entry
> > > > > > > >> >>>>> point
> > > > > > > >> >>>>> will generate the job graph on master side. We could
> use
> > > the
> > > > > > > >> >>>>> ClasspathJobGraphRetriever
> > > > > > > >> >>>>> or start a local Flink client to achieve this purpose.
> > > > > > > >> >>>>>
> > > > > > > >> >>>>>
> > > > > > > >> >>>>> cc tison, Aljoscha & Kostas Do you think this is the
> right
> > > > > > > >> direction we
> > > > > > > >> >>>>> need to work?
> > > > > > > >> >>>>>
> > > > > > > >> >>>>> tison <[hidden email]> 于2019年12月12日周四 下午4:48写道:
> > > > > > > >> >>>>>
> > > > > > > >> >>>>>> A quick idea is that we separate the deployment from
> user
> > > > > > program
> > > > > > > >> >>> that
> > > > > > > >> >>>>> it
> > > > > > > >> >>>>>> has always been done
> > > > > > > >> >>>>>> outside the program. On user program executed there
> is
> > > > > always a
> > > > > > > >> >>>>>> ClusterClient that communicates with
> > > > > > > >> >>>>>> an existing cluster, remote or local. It will be
> another
> > > > > thread
> > > > > > > so
> > > > > > > >> >>> just
> > > > > > > >> >>>>> for
> > > > > > > >> >>>>>> your information.
> > > > > > > >> >>>>>>
> > > > > > > >> >>>>>> Best,
> > > > > > > >> >>>>>> tison.
> > > > > > > >> >>>>>>
> > > > > > > >> >>>>>>
> > > > > > > >> >>>>>> tison <[hidden email]> 于2019年12月12日周四
> 下午4:40写道:
> > > > > > > >> >>>>>>
> > > > > > > >> >>>>>>> Hi Peter,
> > > > > > > >> >>>>>>>
> > > > > > > >> >>>>>>> Another concern I realized recently is that with
> current
> > > > > > > Executors
> > > > > > > >> >>>>>>> abstraction(FLIP-73)
> > > > > > > >> >>>>>>> I'm afraid that user program is designed to ALWAYS
> run
> > > on
> > > > > the
> > > > > > > >> >>> client
> > > > > > > >> >>>>>> side.
> > > > > > > >> >>>>>>> Specifically,
> > > > > > > >> >>>>>>> we deploy the job in executor when env.execute
> called.
> > > > This
> > > > > > > >> >>>>> abstraction
> > > > > > > >> >>>>>>> possibly prevents
> > > > > > > >> >>>>>>> Flink runs user program on the cluster side.
> > > > > > > >> >>>>>>>
> > > > > > > >> >>>>>>> For your proposal, in this case we already compiled
> the
> > > > > > program
> > > > > > > >> and
> > > > > > > >> >>>>> run
> > > > > > > >> >>>>>> on
> > > > > > > >> >>>>>>> the client side,
> > > > > > > >> >>>>>>> even we deploy a cluster and retrieve job graph from
> > > > program
> > > > > > > >> >>>>> metadata, it
> > > > > > > >> >>>>>>> doesn't make
> > > > > > > >> >>>>>>> many sense.
> > > > > > > >> >>>>>>>
> > > > > > > >> >>>>>>> cc Aljoscha & Kostas what do you think about this
> > > > > constraint?
> > > > > > > >> >>>>>>>
> > > > > > > >> >>>>>>> Best,
> > > > > > > >> >>>>>>> tison.
> > > > > > > >> >>>>>>>
> > > > > > > >> >>>>>>>
> > > > > > > >> >>>>>>> Peter Huang <[hidden email]>
> 于2019年12月10日周二
> > > > > > > >> 下午12:45写道:
> > > > > > > >> >>>>>>>
> > > > > > > >> >>>>>>>> Hi Tison,
> > > > > > > >> >>>>>>>>
> > > > > > > >> >>>>>>>> Yes, you are right. I think I made the wrong
> argument
> > > in
> > > > > the
> > > > > > > doc.
> > > > > > > >> >>>>>>>> Basically, the packaging jar problem is only for
> > > platform
> > > > > > > users.
> > > > > > > >> >>> In
> > > > > > > >> >>>>> our
> > > > > > > >> >>>>>>>> internal deploy service,
> > > > > > > >> >>>>>>>> we further optimized the deployment latency by
> letting
> > > > > users
> > > > > > to
> > > > > > > >> >>>>>> packaging
> > > > > > > >> >>>>>>>> flink-runtime together with the uber jar, so that
> we
> > > > don't
> > > > > > need
> > > > > > > >> to
> > > > > > > >> >>>>>>>> consider
> > > > > > > >> >>>>>>>> multiple flink version
> > > > > > > >> >>>>>>>> support for now. In the session client mode, as
> Flink
> > > > libs
> > > > > > will
> > > > > > > >> be
> > > > > > > >> >>>>>> shipped
> > > > > > > >> >>>>>>>> anyway as local resources of yarn. Users actually
> don't
> > > > > need
> > > > > > to
> > > > > > > >> >>>>> package
> > > > > > > >> >>>>>>>> those libs into job jar.
> > > > > > > >> >>>>>>>>
> > > > > > > >> >>>>>>>>
> > > > > > > >> >>>>>>>>
> > > > > > > >> >>>>>>>> Best Regards
> > > > > > > >> >>>>>>>> Peter Huang
> > > > > > > >> >>>>>>>>
> > > > > > > >> >>>>>>>> On Mon, Dec 9, 2019 at 8:35 PM tison <
> > > > [hidden email]
> > > > > >
> > > > > > > >> >>> wrote:
> > > > > > > >> >>>>>>>>
> > > > > > > >> >>>>>>>>>> 3. What do you mean about the package? Do users
> need
> > > to
> > > > > > > >> >>> compile
> > > > > > > >> >>>>>> their
> > > > > > > >> >>>>>>>>> jars
> > > > > > > >> >>>>>>>>> inlcuding flink-clients, flink-optimizer,
> flink-table
> > > > > codes?
> > > > > > > >> >>>>>>>>>
> > > > > > > >> >>>>>>>>> The answer should be no because they exist in
> system
> > > > > > > classpath.
> > > > > > > >> >>>>>>>>>
> > > > > > > >> >>>>>>>>> Best,
> > > > > > > >> >>>>>>>>> tison.
> > > > > > > >> >>>>>>>>>
> > > > > > > >> >>>>>>>>>
> > > > > > > >> >>>>>>>>> Yang Wang <[hidden email]> 于2019年12月10日周二
> > > > > 下午12:18写道:
> > > > > > > >> >>>>>>>>>
> > > > > > > >> >>>>>>>>>> Hi Peter,
> > > > > > > >> >>>>>>>>>>
> > > > > > > >> >>>>>>>>>> Thanks a lot for starting this discussion. I
> think
> > > this
> > > > > is
> > > > > > a
> > > > > > > >> >>> very
> > > > > > > >> >>>>>>>> useful
> > > > > > > >> >>>>>>>>>> feature.
> > > > > > > >> >>>>>>>>>>
> > > > > > > >> >>>>>>>>>> Not only for Yarn, i am focused on flink on
> > > Kubernetes
> > > > > > > >> >>>>> integration
> > > > > > > >> >>>>>> and
> > > > > > > >> >>>>>>>>> come
> > > > > > > >> >>>>>>>>>> across the same
> > > > > > > >> >>>>>>>>>> problem. I do not want the job graph generated on
> > > > client
> > > > > > > side.
> > > > > > > >> >>>>>>>> Instead,
> > > > > > > >> >>>>>>>>> the
> > > > > > > >> >>>>>>>>>> user jars are built in
> > > > > > > >> >>>>>>>>>> a user-defined image. When the job manager
> launched,
> > > we
> > > > > > just
> > > > > > > >> >>>>> need to
> > > > > > > >> >>>>>>>>>> generate the job graph
> > > > > > > >> >>>>>>>>>> based on local user jars.
> > > > > > > >> >>>>>>>>>>
> > > > > > > >> >>>>>>>>>> I have some small suggestion about this.
> > > > > > > >> >>>>>>>>>>
> > > > > > > >> >>>>>>>>>> 1. `ProgramJobGraphRetriever` is very similar to
> > > > > > > >> >>>>>>>>>> `ClasspathJobGraphRetriever`, the differences
> > > > > > > >> >>>>>>>>>> are the former needs `ProgramMetadata` and the
> latter
> > > > > needs
> > > > > > > >> >>> some
> > > > > > > >> >>>>>>>>> arguments.
> > > > > > > >> >>>>>>>>>> Is it possible to
> > > > > > > >> >>>>>>>>>> have an unified `JobGraphRetriever` to support
> both?
> > > > > > > >> >>>>>>>>>> 2. Is it possible to not use a local user jar to
> > > start
> > > > a
> > > > > > > >> >>> per-job
> > > > > > > >> >>>>>>>> cluster?
> > > > > > > >> >>>>>>>>>> In your case, the user jars has
> > > > > > > >> >>>>>>>>>> existed on hdfs already and we do need to
> download
> > > the
> > > > > jars
> > > > > > > to
> > > > > > > >> >>>>>>>> deployer
> > > > > > > >> >>>>>>>>>> service. Currently, we
> > > > > > > >> >>>>>>>>>> always need a local user jar to start a flink
> > > cluster.
> > > > It
> > > > > > is
> > > > > > > >> >>> be
> > > > > > > >> >>>>>> great
> > > > > > > >> >>>>>>>> if
> > > > > > > >> >>>>>>>>> we
> > > > > > > >> >>>>>>>>>> could support remote user jars.
> > > > > > > >> >>>>>>>>>>>> In the implementation, we assume users package
> > > > > > > >> >>> flink-clients,
> > > > > > > >> >>>>>>>>>> flink-optimizer, flink-table together within the
> job
> > > > jar.
> > > > > > > >> >>>>> Otherwise,
> > > > > > > >> >>>>>>>> the
> > > > > > > >> >>>>>>>>>> job graph generation within JobClusterEntryPoint
> will
> > > > > fail.
> > > > > > > >> >>>>>>>>>> 3. What do you mean about the package? Do users
> need
> > > to
> > > > > > > >> >>> compile
> > > > > > > >> >>>>>> their
> > > > > > > >> >>>>>>>>> jars
> > > > > > > >> >>>>>>>>>> inlcuding flink-clients, flink-optimizer,
> flink-table
> > > > > > codes?
> > > > > > > >> >>>>>>>>>>
> > > > > > > >> >>>>>>>>>>
> > > > > > > >> >>>>>>>>>>
> > > > > > > >> >>>>>>>>>> Best,
> > > > > > > >> >>>>>>>>>> Yang
> > > > > > > >> >>>>>>>>>>
> > > > > > > >> >>>>>>>>>> Peter Huang <[hidden email]>
> > > > 于2019年12月10日周二
> > > > > > > >> >>>>> 上午2:37写道:
> > > > > > > >> >>>>>>>>>>
> > > > > > > >> >>>>>>>>>>> Dear All,
> > > > > > > >> >>>>>>>>>>>
> > > > > > > >> >>>>>>>>>>> Recently, the Flink community starts to improve
> the
> > > > yarn
> > > > > > > >> >>>>> cluster
> > > > > > > >> >>>>>>>>>> descriptor
> > > > > > > >> >>>>>>>>>>> to make job jar and config files configurable
> from
> > > > CLI.
> > > > > It
> > > > > > > >> >>>>>> improves
> > > > > > > >> >>>>>>>> the
> > > > > > > >> >>>>>>>>>>> flexibility of  Flink deployment Yarn Per Job
> Mode.
> > > > For
> > > > > > > >> >>>>> platform
> > > > > > > >> >>>>>>>> users
> > > > > > > >> >>>>>>>>>> who
> > > > > > > >> >>>>>>>>>>> manage tens of hundreds of streaming pipelines
> for
> > > the
> > > > > > whole
> > > > > > > >> >>>>> org
> > > > > > > >> >>>>>> or
> > > > > > > >> >>>>>>>>>>> company, we found the job graph generation in
> > > > > client-side
> > > > > > is
> > > > > > > >> >>>>>> another
> > > > > > > >> >>>>>>>>>>> pinpoint. Thus, we want to propose a
> configurable
> > > > > feature
> > > > > > > >> >>> for
> > > > > > > >> >>>>>>>>>>> FlinkYarnSessionCli. The feature can allow
> users to
> > > > > choose
> > > > > > > >> >>> the
> > > > > > > >> >>>>> job
> > > > > > > >> >>>>>>>>> graph
> > > > > > > >> >>>>>>>>>>> generation in Flink ClusterEntryPoint so that
> the
> > > job
> > > > > jar
> > > > > > > >> >>>>> doesn't
> > > > > > > >> >>>>>>>> need
> > > > > > > >> >>>>>>>>> to
> > > > > > > >> >>>>>>>>>>> be locally for the job graph generation. The
> > > proposal
> > > > is
> > > > > > > >> >>>>> organized
> > > > > > > >> >>>>>>>> as a
> > > > > > > >> >>>>>>>>>>> FLIP
> > > > > > > >> >>>>>>>>>>>
> > > > > > > >> >>>>>>>>>>>
> > > > > > > >> >>>>>>>>>>
> > > > > > > >> >>>>>>>>>
> > > > > > > >> >>>>>>>>
> > > > > > > >> >>>>>>
> > > > > > > >> >>>>>
> > > > > > > >> >>>
> > > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-85+Delayed+JobGraph+Generation
> > > > > > > >> >>>>>>>>>>> .
> > > > > > > >> >>>>>>>>>>>
> > > > > > > >> >>>>>>>>>>> Any questions and suggestions are welcomed.
> Thank
> > > you
> > > > in
> > > > > > > >> >>>>> advance.
> > > > > > > >> >>>>>>>>>>>
> > > > > > > >> >>>>>>>>>>>
> > > > > > > >> >>>>>>>>>>> Best Regards
> > > > > > > >> >>>>>>>>>>> Peter Huang
> > > > > > > >> >>>>>>>>>>>
> > > > > > > >> >>>>>>>>>>
> > > > > > > >> >>>>>>>>>
> > > > > > > >> >>>>>>>>
> > > > > > > >> >>>>>>>
> > > > > > > >> >>>>>>
> > > > > > > >> >>>>>
> > > > > > > >> >>>>
> > > > > > > >> >>>
> > > > > > > >> >>
> > > > > > > >>
> > > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] FLIP-85: Delayed Job Graph Generation

Yang Wang
Hi all,

Thanks a lot for the feedback from @Kostas Kloudas. Your all concerns are
on point. The FLIP-85 is mainly
focused on supporting cluster mode for per-job. Since it is more urgent and
have much more use
cases both in Yarn and Kubernetes deployment. For session cluster, we could
have more discussion
in a new thread later.

#1, How to download the user jars and dependencies for per-job in cluster
mode?
For Yarn, we could register the user jars and dependencies as
LocalResource. They will be distributed
by Yarn. And once the JobManager and TaskManager launched, the jars are
already exists.
For Standalone per-job and K8s, we expect that the user jars
and dependencies are built into the image.
Or the InitContainer could be used for downloading. It is natively
distributed and we will not have bottleneck.

#2, Job graph recovery
We could have an optimization to store job graph on the DFS. However, i
suggest building a new jobgraph
from the configuration is the default option. Since we will not always have
a DFS store when deploying a
Flink per-job cluster. Of course, we assume that using the same
configuration(e.g. job_id, user_jar, main_class,
main_args, parallelism, savepoint_settings, etc.) will get a same job
graph. I think the standalone per-job
already has the similar behavior.

#3, What happens with jobs that have multiple execute calls?
Currently, it is really a problem. Even we use a local client on Flink
master side, it will have different behavior with
client mode. For client mode, if we execute multiple times, then we will
deploy multiple Flink clusters for each execute.
I am not pretty sure whether it is reasonable. However, i still think using
the local client is a good choice. We could
continue the discussion in a new thread. @Zili Chen <[hidden email]> Do
you want to drive this?



Best,
Yang

Peter Huang <[hidden email]> 于2020年1月16日周四 上午1:55写道:

> Hi Kostas,
>
> Thanks for this feedback. I can't agree more about the opinion. The
> cluster mode should be added
> first in per job cluster.
>
> 1) For job cluster implementation
> 1. Job graph recovery from configuration or store as static job graph as
> session cluster. I think the static one will be better for less recovery
> time.
> Let me update the doc for details.
>
> 2. For job execute multiple times, I think @Zili Chen
> <[hidden email]> has proposed the local client solution that can
> the run program actually in the cluster entry point. We can put the
> implementation in the second stage,
> or even a new FLIP for further discussion.
>
> 2) For session cluster implementation
> We can disable the cluster mode for the session cluster in the first
> stage. I agree the jar downloading will be a painful thing.
> We can consider about PoC and performance evaluation first. If the end to
> end experience is good enough, then we can consider
> proceeding with the solution.
>
> Looking forward to more opinions from @Yang Wang <[hidden email]> @Zili
> Chen <[hidden email]> @Dian Fu <[hidden email]>.
>
>
> Best Regards
> Peter Huang
>
> On Wed, Jan 15, 2020 at 7:50 AM Kostas Kloudas <[hidden email]> wrote:
>
>> Hi all,
>>
>> I am writing here as the discussion on the Google Doc seems to be a
>> bit difficult to follow.
>>
>> I think that in order to be able to make progress, it would be helpful
>> to focus on per-job mode for now.
>> The reason is that:
>>  1) making the (unique) JobSubmitHandler responsible for creating the
>> jobgraphs,
>>   which includes downloading dependencies, is not an optimal solution
>>  2) even if we put the responsibility on the JobMaster, currently each
>> job has its own
>>   JobMaster but they all run on the same process, so we have again a
>> single entity.
>>
>> Of course after this is done, and if we feel comfortable with the
>> solution, then we can go to the session mode.
>>
>> A second comment has to do with fault-tolerance in the per-job,
>> cluster-deploy mode.
>> In the document, it is suggested that upon recovery, the JobMaster of
>> each job re-creates the JobGraph.
>> I am just wondering if it is better to create and store the jobGraph
>> upon submission and only fetch it
>> upon recovery so that we have a static jobGraph.
>>
>> Finally, I have a question which is what happens with jobs that have
>> multiple execute calls?
>> The semantics seem to change compared to the current behaviour, right?
>>
>> Cheers,
>> Kostas
>>
>> On Wed, Jan 8, 2020 at 8:05 PM tison <[hidden email]> wrote:
>> >
>> > not always, Yang Wang is also not yet a committer but he can join the
>> > channel. I cannot find the id by clicking “Add new member in channel” so
>> > come to you and ask for try out the link. Possibly I will find other
>> ways
>> > but the original purpose is that the slack channel is a public area we
>> > discuss about developing...
>> > Best,
>> > tison.
>> >
>> >
>> > Peter Huang <[hidden email]> 于2020年1月9日周四 上午2:44写道:
>> >
>> > > Hi Tison,
>> > >
>> > > I am not the committer of Flink yet. I think I can't join it also.
>> > >
>> > >
>> > > Best Regards
>> > > Peter Huang
>> > >
>> > > On Wed, Jan 8, 2020 at 9:39 AM tison <[hidden email]> wrote:
>> > >
>> > > > Hi Peter,
>> > > >
>> > > > Could you try out this link?
>> > > https://the-asf.slack.com/messages/CNA3ADZPH
>> > > >
>> > > > Best,
>> > > > tison.
>> > > >
>> > > >
>> > > > Peter Huang <[hidden email]> 于2020年1月9日周四 上午1:22写道:
>> > > >
>> > > > > Hi Tison,
>> > > > >
>> > > > > I can't join the group with shared link. Would you please add me
>> into
>> > > the
>> > > > > group? My slack account is huangzhenqiu0825.
>> > > > > Thank you in advance.
>> > > > >
>> > > > >
>> > > > > Best Regards
>> > > > > Peter Huang
>> > > > >
>> > > > > On Wed, Jan 8, 2020 at 12:02 AM tison <[hidden email]>
>> wrote:
>> > > > >
>> > > > > > Hi Peter,
>> > > > > >
>> > > > > > As described above, this effort should get attention from people
>> > > > > developing
>> > > > > > FLIP-73 a.k.a. Executor abstractions. I recommend you to join
>> the
>> > > > public
>> > > > > > slack channel[1] for Flink Client API Enhancement and you can
>> try to
>> > > > > share
>> > > > > > you detailed thoughts there. It possibly gets more concrete
>> > > attentions.
>> > > > > >
>> > > > > > Best,
>> > > > > > tison.
>> > > > > >
>> > > > > > [1]
>> > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> https://slack.com/share/IS21SJ75H/Rk8HhUly9FuEHb7oGwBZ33uL/enQtODg2MDYwNjE5MTg3LTA2MjIzNDc1M2ZjZDVlMjdlZjk1M2RkYmJhNjAwMTk2ZDZkODQ4NmY5YmI4OGRhNWJkYTViMTM1NzlmMzc4OWM
>> > > > > >
>> > > > > >
>> > > > > > Peter Huang <[hidden email]> 于2020年1月7日周二 上午5:09写道:
>> > > > > >
>> > > > > > > Dear All,
>> > > > > > >
>> > > > > > > Happy new year! According to existing feedback from the
>> community,
>> > > we
>> > > > > > > revised the doc with the consideration of session cluster
>> support,
>> > > > and
>> > > > > > > concrete interface changes needed and execution plan. Please
>> take
>> > > one
>> > > > > > more
>> > > > > > > round of review at your most convenient time.
>> > > > > > >
>> > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> https://docs.google.com/document/d/1aAwVjdZByA-0CHbgv16Me-vjaaDMCfhX7TzVVTuifYM/edit#
>> > > > > > >
>> > > > > > >
>> > > > > > > Best Regards
>> > > > > > > Peter Huang
>> > > > > > >
>> > > > > > >
>> > > > > > >
>> > > > > > >
>> > > > > > >
>> > > > > > > On Thu, Jan 2, 2020 at 11:29 AM Peter Huang <
>> > > > > [hidden email]>
>> > > > > > > wrote:
>> > > > > > >
>> > > > > > > > Hi Dian,
>> > > > > > > > Thanks for giving us valuable feedbacks.
>> > > > > > > >
>> > > > > > > > 1) It's better to have a whole design for this feature
>> > > > > > > > For the suggestion of enabling the cluster mode also session
>> > > > > cluster, I
>> > > > > > > > think Flink already supported it. WebSubmissionExtension
>> already
>> > > > > allows
>> > > > > > > > users to start a job with the specified jar by using web UI.
>> > > > > > > > But we need to enable the feature from CLI for both local
>> jar,
>> > > > remote
>> > > > > > > jar.
>> > > > > > > > I will align with Yang Wang first about the details and
>> update
>> > > the
>> > > > > > design
>> > > > > > > > doc.
>> > > > > > > >
>> > > > > > > > 2) It's better to consider the convenience for users, such
>> as
>> > > > > debugging
>> > > > > > > >
>> > > > > > > > I am wondering whether we can store the exception in
>> jobgragh
>> > > > > > > > generation in application master. As no streaming graph can
>> be
>> > > > > > scheduled
>> > > > > > > in
>> > > > > > > > this case, there will be no more TM will be requested from
>> > > FlinkRM.
>> > > > > > > > If the AM is still running, users can still query it from
>> CLI. As
>> > > > it
>> > > > > > > > requires more change, we can get some feedback from <
>> > > > > > [hidden email]
>> > > > > > > >
>> > > > > > > > and @[hidden email] <[hidden email]>.
>> > > > > > > >
>> > > > > > > > 3) It's better to consider the impact to the stability of
>> the
>> > > > cluster
>> > > > > > > >
>> > > > > > > > I agree with Yang Wang's opinion.
>> > > > > > > >
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > Best Regards
>> > > > > > > > Peter Huang
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > On Sun, Dec 29, 2019 at 9:44 PM Dian Fu <
>> [hidden email]>
>> > > > > wrote:
>> > > > > > > >
>> > > > > > > >> Hi all,
>> > > > > > > >>
>> > > > > > > >> Sorry to jump into this discussion. Thanks everyone for the
>> > > > > > discussion.
>> > > > > > > >> I'm very interested in this topic although I'm not an
>> expert in
>> > > > this
>> > > > > > > part.
>> > > > > > > >> So I'm glad to share my thoughts as following:
>> > > > > > > >>
>> > > > > > > >> 1) It's better to have a whole design for this feature
>> > > > > > > >> As we know, there are two deployment modes: per-job mode
>> and
>> > > > session
>> > > > > > > >> mode. I'm wondering which mode really needs this feature.
>> As the
>> > > > > > design
>> > > > > > > doc
>> > > > > > > >> mentioned, per-job mode is more used for streaming jobs and
>> > > > session
>> > > > > > > mode is
>> > > > > > > >> usually used for batch jobs(Of course, the job types and
>> the
>> > > > > > deployment
>> > > > > > > >> modes are orthogonal). Usually streaming job is only
>> needed to
>> > > be
>> > > > > > > submitted
>> > > > > > > >> once and it will run for days or weeks, while batch jobs
>> will be
>> > > > > > > submitted
>> > > > > > > >> more frequently compared with streaming jobs. This means
>> that
>> > > > maybe
>> > > > > > > session
>> > > > > > > >> mode also needs this feature. However, if we support this
>> > > feature
>> > > > in
>> > > > > > > >> session mode, the application master will become the new
>> > > > centralized
>> > > > > > > >> service(which should be solved). So in this case, it's
>> better to
>> > > > > have
>> > > > > > a
>> > > > > > > >> complete design for both per-job mode and session mode.
>> > > > Furthermore,
>> > > > > > > even
>> > > > > > > >> if we can do it phase by phase, we need to have a whole
>> picture
>> > > of
>> > > > > how
>> > > > > > > it
>> > > > > > > >> works in both per-job mode and session mode.
>> > > > > > > >>
>> > > > > > > >> 2) It's better to consider the convenience for users, such
>> as
>> > > > > > debugging
>> > > > > > > >> After we finish this feature, the job graph will be
>> compiled in
>> > > > the
>> > > > > > > >> application master, which means that users cannot easily
>> get the
>> > > > > > > exception
>> > > > > > > >> message synchorousely in the job client if there are
>> problems
>> > > > during
>> > > > > > the
>> > > > > > > >> job graph compiling (especially for platform users), such
>> as the
>> > > > > > > resource
>> > > > > > > >> path is incorrect, the user program itself has some
>> problems,
>> > > etc.
>> > > > > > What
>> > > > > > > I'm
>> > > > > > > >> thinking is that maybe we should throw the exceptions as
>> early
>> > > as
>> > > > > > > possible
>> > > > > > > >> (during job submission stage).
>> > > > > > > >>
>> > > > > > > >> 3) It's better to consider the impact to the stability of
>> the
>> > > > > cluster
>> > > > > > > >> If we perform the compiling in the application master, we
>> should
>> > > > > > > consider
>> > > > > > > >> the impact of the compiling errors. Although YARN could
>> resume
>> > > the
>> > > > > > > >> application master in case of failures, but in some case
>> the
>> > > > > compiling
>> > > > > > > >> failure may be a waste of cluster resource and may impact
>> the
>> > > > > > stability
>> > > > > > > the
>> > > > > > > >> cluster and the other jobs in the cluster, such as the
>> resource
>> > > > path
>> > > > > > is
>> > > > > > > >> incorrect, the user program itself has some problems(in
>> this
>> > > case,
>> > > > > job
>> > > > > > > >> failover cannot solve this kind of problems) etc. In the
>> current
>> > > > > > > >> implemention, the compiling errors are handled in the
>> client
>> > > side
>> > > > > and
>> > > > > > > there
>> > > > > > > >> is no impact to the cluster at all.
>> > > > > > > >>
>> > > > > > > >> Regarding to 1), it's clearly pointed in the design doc
>> that
>> > > only
>> > > > > > > per-job
>> > > > > > > >> mode will be supported. However, I think it's better to
>> also
>> > > > > consider
>> > > > > > > the
>> > > > > > > >> session mode in the design doc.
>> > > > > > > >> Regarding to 2) and 3), I have not seen related sections
>> in the
>> > > > > design
>> > > > > > > >> doc. It will be good if we can cover them in the design
>> doc.
>> > > > > > > >>
>> > > > > > > >> Feel free to correct me If there is anything I
>> misunderstand.
>> > > > > > > >>
>> > > > > > > >> Regards,
>> > > > > > > >> Dian
>> > > > > > > >>
>> > > > > > > >>
>> > > > > > > >> > 在 2019年12月27日,上午3:13,Peter Huang <
>> [hidden email]>
>> > > > 写道:
>> > > > > > > >> >
>> > > > > > > >> > Hi Yang,
>> > > > > > > >> >
>> > > > > > > >> > I can't agree more. The effort definitely needs to align
>> with
>> > > > the
>> > > > > > > final
>> > > > > > > >> > goal of FLIP-73.
>> > > > > > > >> > I am thinking about whether we can achieve the goal with
>> two
>> > > > > phases.
>> > > > > > > >> >
>> > > > > > > >> > 1) Phase I
>> > > > > > > >> > As the CLiFrontend will not be depreciated soon. We can
>> still
>> > > > use
>> > > > > > the
>> > > > > > > >> > deployMode flag there,
>> > > > > > > >> > pass the program info through Flink configuration,  use
>> the
>> > > > > > > >> > ClassPathJobGraphRetriever
>> > > > > > > >> > to generate the job graph in ClusterEntrypoints of yarn
>> and
>> > > > > > > Kubernetes.
>> > > > > > > >> >
>> > > > > > > >> > 2) Phase II
>> > > > > > > >> > In  AbstractJobClusterExecutor, the job graph is
>> generated in
>> > > > the
>> > > > > > > >> execute
>> > > > > > > >> > function. We can still
>> > > > > > > >> > use the deployMode in it. With deployMode = cluster, the
>> > > execute
>> > > > > > > >> function
>> > > > > > > >> > only starts the cluster.
>> > > > > > > >> >
>> > > > > > > >> > When {Yarn/Kuberneates}PerJobClusterEntrypoint starts,
>> It will
>> > > > > start
>> > > > > > > the
>> > > > > > > >> > dispatch first, then we can use
>> > > > > > > >> > a ClusterEnvironment similar to ContextEnvironment to
>> submit
>> > > the
>> > > > > job
>> > > > > > > >> with
>> > > > > > > >> > jobName the local
>> > > > > > > >> > dispatcher. For the details, we need more investigation.
>> Let's
>> > > > > wait
>> > > > > > > >> > for @Aljoscha
>> > > > > > > >> > Krettek <[hidden email]> @Till Rohrmann <
>> > > > > [hidden email]
>> > > > > > >'s
>> > > > > > > >> > feedback after the holiday season.
>> > > > > > > >> >
>> > > > > > > >> > Thank you in advance. Merry Chrismas and Happy New
>> Year!!!
>> > > > > > > >> >
>> > > > > > > >> >
>> > > > > > > >> > Best Regards
>> > > > > > > >> > Peter Huang
>> > > > > > > >> >
>> > > > > > > >> >
>> > > > > > > >> >
>> > > > > > > >> >
>> > > > > > > >> >
>> > > > > > > >> >
>> > > > > > > >> >
>> > > > > > > >> >
>> > > > > > > >> > On Wed, Dec 25, 2019 at 1:08 AM Yang Wang <
>> > > > [hidden email]>
>> > > > > > > >> wrote:
>> > > > > > > >> >
>> > > > > > > >> >> Hi Peter,
>> > > > > > > >> >>
>> > > > > > > >> >> I think we need to reconsider tison's suggestion
>> seriously.
>> > > > After
>> > > > > > > >> FLIP-73,
>> > > > > > > >> >> the deployJobCluster has
>> > > > > > > >> >> beenmoved into `JobClusterExecutor#execute`. It should
>> not be
>> > > > > > > perceived
>> > > > > > > >> >> for `CliFrontend`. That
>> > > > > > > >> >> means the user program will *ALWAYS* be executed on
>> client
>> > > > side.
>> > > > > > This
>> > > > > > > >> is
>> > > > > > > >> >> the by design behavior.
>> > > > > > > >> >> So, we could not just add `if(client mode) .. else
>> if(cluster
>> > > > > mode)
>> > > > > > > >> ...`
>> > > > > > > >> >> codes in `CliFrontend` to bypass
>> > > > > > > >> >> the executor. We need to find a clean way to decouple
>> > > executing
>> > > > > > user
>> > > > > > > >> >> program and deploying per-job
>> > > > > > > >> >> cluster. Based on this, we could support to execute user
>> > > > program
>> > > > > on
>> > > > > > > >> client
>> > > > > > > >> >> or master side.
>> > > > > > > >> >>
>> > > > > > > >> >> Maybe Aljoscha and Jeff could give some good
>> suggestions.
>> > > > > > > >> >>
>> > > > > > > >> >>
>> > > > > > > >> >>
>> > > > > > > >> >> Best,
>> > > > > > > >> >> Yang
>> > > > > > > >> >>
>> > > > > > > >> >> Peter Huang <[hidden email]> 于2019年12月25日周三
>> > > > > 上午4:03写道:
>> > > > > > > >> >>
>> > > > > > > >> >>> Hi Jingjing,
>> > > > > > > >> >>>
>> > > > > > > >> >>> The improvement proposed is a deployment option for
>> CLI. For
>> > > > SQL
>> > > > > > > based
>> > > > > > > >> >>> Flink application, It is more convenient to use the
>> existing
>> > > > > model
>> > > > > > > in
>> > > > > > > >> >>> SqlClient in which
>> > > > > > > >> >>> the job graph is generated within SqlClient. After
>> adding
>> > > the
>> > > > > > > delayed
>> > > > > > > >> job
>> > > > > > > >> >>> graph generation, I think there is no change is needed
>> for
>> > > > your
>> > > > > > > side.
>> > > > > > > >> >>>
>> > > > > > > >> >>>
>> > > > > > > >> >>> Best Regards
>> > > > > > > >> >>> Peter Huang
>> > > > > > > >> >>>
>> > > > > > > >> >>>
>> > > > > > > >> >>> On Wed, Dec 18, 2019 at 6:01 AM jingjing bai <
>> > > > > > > >> [hidden email]>
>> > > > > > > >> >>> wrote:
>> > > > > > > >> >>>
>> > > > > > > >> >>>> hi peter:
>> > > > > > > >> >>>>    we had extension SqlClent to support sql job
>> submit in
>> > > web
>> > > > > > base
>> > > > > > > on
>> > > > > > > >> >>>> flink 1.9.   we support submit to yarn on per job
>> mode too.
>> > > > > > > >> >>>>    in this case, the job graph generated  on client
>> side
>> > > .  I
>> > > > > > think
>> > > > > > > >> >>> this
>> > > > > > > >> >>>> discuss Mainly to improve api programme.  but in my
>> case ,
>> > > > > there
>> > > > > > is
>> > > > > > > >> no
>> > > > > > > >> >>>> jar to upload but only a sql string .
>> > > > > > > >> >>>>    do u had more suggestion to improve for sql mode
>> or it
>> > > is
>> > > > > > only a
>> > > > > > > >> >>>> switch for api programme?
>> > > > > > > >> >>>>
>> > > > > > > >> >>>>
>> > > > > > > >> >>>> best
>> > > > > > > >> >>>> bai jj
>> > > > > > > >> >>>>
>> > > > > > > >> >>>>
>> > > > > > > >> >>>> Yang Wang <[hidden email]> 于2019年12月18日周三
>> 下午7:21写道:
>> > > > > > > >> >>>>
>> > > > > > > >> >>>>> I just want to revive this discussion.
>> > > > > > > >> >>>>>
>> > > > > > > >> >>>>> Recently, i am thinking about how to natively run
>> flink
>> > > > > per-job
>> > > > > > > >> >>> cluster on
>> > > > > > > >> >>>>> Kubernetes.
>> > > > > > > >> >>>>> The per-job mode on Kubernetes is very different
>> from on
>> > > > Yarn.
>> > > > > > And
>> > > > > > > >> we
>> > > > > > > >> >>> will
>> > > > > > > >> >>>>> have
>> > > > > > > >> >>>>> the same deployment requirements to the client and
>> entry
>> > > > > point.
>> > > > > > > >> >>>>>
>> > > > > > > >> >>>>> 1. Flink client not always need a local jar to start
>> a
>> > > Flink
>> > > > > > > per-job
>> > > > > > > >> >>>>> cluster. We could
>> > > > > > > >> >>>>> support multiple schemas. For example,
>> > > > file:///path/of/my.jar
>> > > > > > > means
>> > > > > > > >> a
>> > > > > > > >> >>> jar
>> > > > > > > >> >>>>> located
>> > > > > > > >> >>>>> at client side,
>> hdfs://myhdfs/user/myname/flink/my.jar
>> > > > means a
>> > > > > > jar
>> > > > > > > >> >>> located
>> > > > > > > >> >>>>> at
>> > > > > > > >> >>>>> remote hdfs, local:///path/in/image/my.jar means a
>> jar
>> > > > located
>> > > > > > at
>> > > > > > > >> >>>>> jobmanager side.
>> > > > > > > >> >>>>>
>> > > > > > > >> >>>>> 2. Support running user program on master side. This
>> also
>> > > > > means
>> > > > > > > the
>> > > > > > > >> >>> entry
>> > > > > > > >> >>>>> point
>> > > > > > > >> >>>>> will generate the job graph on master side. We could
>> use
>> > > the
>> > > > > > > >> >>>>> ClasspathJobGraphRetriever
>> > > > > > > >> >>>>> or start a local Flink client to achieve this
>> purpose.
>> > > > > > > >> >>>>>
>> > > > > > > >> >>>>>
>> > > > > > > >> >>>>> cc tison, Aljoscha & Kostas Do you think this is the
>> right
>> > > > > > > >> direction we
>> > > > > > > >> >>>>> need to work?
>> > > > > > > >> >>>>>
>> > > > > > > >> >>>>> tison <[hidden email]> 于2019年12月12日周四
>> 下午4:48写道:
>> > > > > > > >> >>>>>
>> > > > > > > >> >>>>>> A quick idea is that we separate the deployment
>> from user
>> > > > > > program
>> > > > > > > >> >>> that
>> > > > > > > >> >>>>> it
>> > > > > > > >> >>>>>> has always been done
>> > > > > > > >> >>>>>> outside the program. On user program executed there
>> is
>> > > > > always a
>> > > > > > > >> >>>>>> ClusterClient that communicates with
>> > > > > > > >> >>>>>> an existing cluster, remote or local. It will be
>> another
>> > > > > thread
>> > > > > > > so
>> > > > > > > >> >>> just
>> > > > > > > >> >>>>> for
>> > > > > > > >> >>>>>> your information.
>> > > > > > > >> >>>>>>
>> > > > > > > >> >>>>>> Best,
>> > > > > > > >> >>>>>> tison.
>> > > > > > > >> >>>>>>
>> > > > > > > >> >>>>>>
>> > > > > > > >> >>>>>> tison <[hidden email]> 于2019年12月12日周四
>> 下午4:40写道:
>> > > > > > > >> >>>>>>
>> > > > > > > >> >>>>>>> Hi Peter,
>> > > > > > > >> >>>>>>>
>> > > > > > > >> >>>>>>> Another concern I realized recently is that with
>> current
>> > > > > > > Executors
>> > > > > > > >> >>>>>>> abstraction(FLIP-73)
>> > > > > > > >> >>>>>>> I'm afraid that user program is designed to ALWAYS
>> run
>> > > on
>> > > > > the
>> > > > > > > >> >>> client
>> > > > > > > >> >>>>>> side.
>> > > > > > > >> >>>>>>> Specifically,
>> > > > > > > >> >>>>>>> we deploy the job in executor when env.execute
>> called.
>> > > > This
>> > > > > > > >> >>>>> abstraction
>> > > > > > > >> >>>>>>> possibly prevents
>> > > > > > > >> >>>>>>> Flink runs user program on the cluster side.
>> > > > > > > >> >>>>>>>
>> > > > > > > >> >>>>>>> For your proposal, in this case we already
>> compiled the
>> > > > > > program
>> > > > > > > >> and
>> > > > > > > >> >>>>> run
>> > > > > > > >> >>>>>> on
>> > > > > > > >> >>>>>>> the client side,
>> > > > > > > >> >>>>>>> even we deploy a cluster and retrieve job graph
>> from
>> > > > program
>> > > > > > > >> >>>>> metadata, it
>> > > > > > > >> >>>>>>> doesn't make
>> > > > > > > >> >>>>>>> many sense.
>> > > > > > > >> >>>>>>>
>> > > > > > > >> >>>>>>> cc Aljoscha & Kostas what do you think about this
>> > > > > constraint?
>> > > > > > > >> >>>>>>>
>> > > > > > > >> >>>>>>> Best,
>> > > > > > > >> >>>>>>> tison.
>> > > > > > > >> >>>>>>>
>> > > > > > > >> >>>>>>>
>> > > > > > > >> >>>>>>> Peter Huang <[hidden email]>
>> 于2019年12月10日周二
>> > > > > > > >> 下午12:45写道:
>> > > > > > > >> >>>>>>>
>> > > > > > > >> >>>>>>>> Hi Tison,
>> > > > > > > >> >>>>>>>>
>> > > > > > > >> >>>>>>>> Yes, you are right. I think I made the wrong
>> argument
>> > > in
>> > > > > the
>> > > > > > > doc.
>> > > > > > > >> >>>>>>>> Basically, the packaging jar problem is only for
>> > > platform
>> > > > > > > users.
>> > > > > > > >> >>> In
>> > > > > > > >> >>>>> our
>> > > > > > > >> >>>>>>>> internal deploy service,
>> > > > > > > >> >>>>>>>> we further optimized the deployment latency by
>> letting
>> > > > > users
>> > > > > > to
>> > > > > > > >> >>>>>> packaging
>> > > > > > > >> >>>>>>>> flink-runtime together with the uber jar, so that
>> we
>> > > > don't
>> > > > > > need
>> > > > > > > >> to
>> > > > > > > >> >>>>>>>> consider
>> > > > > > > >> >>>>>>>> multiple flink version
>> > > > > > > >> >>>>>>>> support for now. In the session client mode, as
>> Flink
>> > > > libs
>> > > > > > will
>> > > > > > > >> be
>> > > > > > > >> >>>>>> shipped
>> > > > > > > >> >>>>>>>> anyway as local resources of yarn. Users actually
>> don't
>> > > > > need
>> > > > > > to
>> > > > > > > >> >>>>> package
>> > > > > > > >> >>>>>>>> those libs into job jar.
>> > > > > > > >> >>>>>>>>
>> > > > > > > >> >>>>>>>>
>> > > > > > > >> >>>>>>>>
>> > > > > > > >> >>>>>>>> Best Regards
>> > > > > > > >> >>>>>>>> Peter Huang
>> > > > > > > >> >>>>>>>>
>> > > > > > > >> >>>>>>>> On Mon, Dec 9, 2019 at 8:35 PM tison <
>> > > > [hidden email]
>> > > > > >
>> > > > > > > >> >>> wrote:
>> > > > > > > >> >>>>>>>>
>> > > > > > > >> >>>>>>>>>> 3. What do you mean about the package? Do users
>> need
>> > > to
>> > > > > > > >> >>> compile
>> > > > > > > >> >>>>>> their
>> > > > > > > >> >>>>>>>>> jars
>> > > > > > > >> >>>>>>>>> inlcuding flink-clients, flink-optimizer,
>> flink-table
>> > > > > codes?
>> > > > > > > >> >>>>>>>>>
>> > > > > > > >> >>>>>>>>> The answer should be no because they exist in
>> system
>> > > > > > > classpath.
>> > > > > > > >> >>>>>>>>>
>> > > > > > > >> >>>>>>>>> Best,
>> > > > > > > >> >>>>>>>>> tison.
>> > > > > > > >> >>>>>>>>>
>> > > > > > > >> >>>>>>>>>
>> > > > > > > >> >>>>>>>>> Yang Wang <[hidden email]> 于2019年12月10日周二
>> > > > > 下午12:18写道:
>> > > > > > > >> >>>>>>>>>
>> > > > > > > >> >>>>>>>>>> Hi Peter,
>> > > > > > > >> >>>>>>>>>>
>> > > > > > > >> >>>>>>>>>> Thanks a lot for starting this discussion. I
>> think
>> > > this
>> > > > > is
>> > > > > > a
>> > > > > > > >> >>> very
>> > > > > > > >> >>>>>>>> useful
>> > > > > > > >> >>>>>>>>>> feature.
>> > > > > > > >> >>>>>>>>>>
>> > > > > > > >> >>>>>>>>>> Not only for Yarn, i am focused on flink on
>> > > Kubernetes
>> > > > > > > >> >>>>> integration
>> > > > > > > >> >>>>>> and
>> > > > > > > >> >>>>>>>>> come
>> > > > > > > >> >>>>>>>>>> across the same
>> > > > > > > >> >>>>>>>>>> problem. I do not want the job graph generated
>> on
>> > > > client
>> > > > > > > side.
>> > > > > > > >> >>>>>>>> Instead,
>> > > > > > > >> >>>>>>>>> the
>> > > > > > > >> >>>>>>>>>> user jars are built in
>> > > > > > > >> >>>>>>>>>> a user-defined image. When the job manager
>> launched,
>> > > we
>> > > > > > just
>> > > > > > > >> >>>>> need to
>> > > > > > > >> >>>>>>>>>> generate the job graph
>> > > > > > > >> >>>>>>>>>> based on local user jars.
>> > > > > > > >> >>>>>>>>>>
>> > > > > > > >> >>>>>>>>>> I have some small suggestion about this.
>> > > > > > > >> >>>>>>>>>>
>> > > > > > > >> >>>>>>>>>> 1. `ProgramJobGraphRetriever` is very similar to
>> > > > > > > >> >>>>>>>>>> `ClasspathJobGraphRetriever`, the differences
>> > > > > > > >> >>>>>>>>>> are the former needs `ProgramMetadata` and the
>> latter
>> > > > > needs
>> > > > > > > >> >>> some
>> > > > > > > >> >>>>>>>>> arguments.
>> > > > > > > >> >>>>>>>>>> Is it possible to
>> > > > > > > >> >>>>>>>>>> have an unified `JobGraphRetriever` to support
>> both?
>> > > > > > > >> >>>>>>>>>> 2. Is it possible to not use a local user jar to
>> > > start
>> > > > a
>> > > > > > > >> >>> per-job
>> > > > > > > >> >>>>>>>> cluster?
>> > > > > > > >> >>>>>>>>>> In your case, the user jars has
>> > > > > > > >> >>>>>>>>>> existed on hdfs already and we do need to
>> download
>> > > the
>> > > > > jars
>> > > > > > > to
>> > > > > > > >> >>>>>>>> deployer
>> > > > > > > >> >>>>>>>>>> service. Currently, we
>> > > > > > > >> >>>>>>>>>> always need a local user jar to start a flink
>> > > cluster.
>> > > > It
>> > > > > > is
>> > > > > > > >> >>> be
>> > > > > > > >> >>>>>> great
>> > > > > > > >> >>>>>>>> if
>> > > > > > > >> >>>>>>>>> we
>> > > > > > > >> >>>>>>>>>> could support remote user jars.
>> > > > > > > >> >>>>>>>>>>>> In the implementation, we assume users package
>> > > > > > > >> >>> flink-clients,
>> > > > > > > >> >>>>>>>>>> flink-optimizer, flink-table together within
>> the job
>> > > > jar.
>> > > > > > > >> >>>>> Otherwise,
>> > > > > > > >> >>>>>>>> the
>> > > > > > > >> >>>>>>>>>> job graph generation within
>> JobClusterEntryPoint will
>> > > > > fail.
>> > > > > > > >> >>>>>>>>>> 3. What do you mean about the package? Do users
>> need
>> > > to
>> > > > > > > >> >>> compile
>> > > > > > > >> >>>>>> their
>> > > > > > > >> >>>>>>>>> jars
>> > > > > > > >> >>>>>>>>>> inlcuding flink-clients, flink-optimizer,
>> flink-table
>> > > > > > codes?
>> > > > > > > >> >>>>>>>>>>
>> > > > > > > >> >>>>>>>>>>
>> > > > > > > >> >>>>>>>>>>
>> > > > > > > >> >>>>>>>>>> Best,
>> > > > > > > >> >>>>>>>>>> Yang
>> > > > > > > >> >>>>>>>>>>
>> > > > > > > >> >>>>>>>>>> Peter Huang <[hidden email]>
>> > > > 于2019年12月10日周二
>> > > > > > > >> >>>>> 上午2:37写道:
>> > > > > > > >> >>>>>>>>>>
>> > > > > > > >> >>>>>>>>>>> Dear All,
>> > > > > > > >> >>>>>>>>>>>
>> > > > > > > >> >>>>>>>>>>> Recently, the Flink community starts to
>> improve the
>> > > > yarn
>> > > > > > > >> >>>>> cluster
>> > > > > > > >> >>>>>>>>>> descriptor
>> > > > > > > >> >>>>>>>>>>> to make job jar and config files configurable
>> from
>> > > > CLI.
>> > > > > It
>> > > > > > > >> >>>>>> improves
>> > > > > > > >> >>>>>>>> the
>> > > > > > > >> >>>>>>>>>>> flexibility of  Flink deployment Yarn Per Job
>> Mode.
>> > > > For
>> > > > > > > >> >>>>> platform
>> > > > > > > >> >>>>>>>> users
>> > > > > > > >> >>>>>>>>>> who
>> > > > > > > >> >>>>>>>>>>> manage tens of hundreds of streaming pipelines
>> for
>> > > the
>> > > > > > whole
>> > > > > > > >> >>>>> org
>> > > > > > > >> >>>>>> or
>> > > > > > > >> >>>>>>>>>>> company, we found the job graph generation in
>> > > > > client-side
>> > > > > > is
>> > > > > > > >> >>>>>> another
>> > > > > > > >> >>>>>>>>>>> pinpoint. Thus, we want to propose a
>> configurable
>> > > > > feature
>> > > > > > > >> >>> for
>> > > > > > > >> >>>>>>>>>>> FlinkYarnSessionCli. The feature can allow
>> users to
>> > > > > choose
>> > > > > > > >> >>> the
>> > > > > > > >> >>>>> job
>> > > > > > > >> >>>>>>>>> graph
>> > > > > > > >> >>>>>>>>>>> generation in Flink ClusterEntryPoint so that
>> the
>> > > job
>> > > > > jar
>> > > > > > > >> >>>>> doesn't
>> > > > > > > >> >>>>>>>> need
>> > > > > > > >> >>>>>>>>> to
>> > > > > > > >> >>>>>>>>>>> be locally for the job graph generation. The
>> > > proposal
>> > > > is
>> > > > > > > >> >>>>> organized
>> > > > > > > >> >>>>>>>> as a
>> > > > > > > >> >>>>>>>>>>> FLIP
>> > > > > > > >> >>>>>>>>>>>
>> > > > > > > >> >>>>>>>>>>>
>> > > > > > > >> >>>>>>>>>>
>> > > > > > > >> >>>>>>>>>
>> > > > > > > >> >>>>>>>>
>> > > > > > > >> >>>>>>
>> > > > > > > >> >>>>>
>> > > > > > > >> >>>
>> > > > > > > >>
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-85+Delayed+JobGraph+Generation
>> > > > > > > >> >>>>>>>>>>> .
>> > > > > > > >> >>>>>>>>>>>
>> > > > > > > >> >>>>>>>>>>> Any questions and suggestions are welcomed.
>> Thank
>> > > you
>> > > > in
>> > > > > > > >> >>>>> advance.
>> > > > > > > >> >>>>>>>>>>>
>> > > > > > > >> >>>>>>>>>>>
>> > > > > > > >> >>>>>>>>>>> Best Regards
>> > > > > > > >> >>>>>>>>>>> Peter Huang
>> > > > > > > >> >>>>>>>>>>>
>> > > > > > > >> >>>>>>>>>>
>> > > > > > > >> >>>>>>>>>
>> > > > > > > >> >>>>>>>>
>> > > > > > > >> >>>>>>>
>> > > > > > > >> >>>>>>
>> > > > > > > >> >>>>>
>> > > > > > > >> >>>>
>> > > > > > > >> >>>
>> > > > > > > >> >>
>> > > > > > > >>
>> > > > > > > >>
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] FLIP-85: Delayed Job Graph Generation

Kostas Kloudas-4
Hi all,

I update https://cwiki.apache.org/confluence/display/FLINK/FLIP-85+Flink+Application+Mode
based on the discussion we had here:

https://docs.google.com/document/d/1ji72s3FD9DYUyGuKnJoO4ApzV-nSsZa0-bceGXW7Ocw/edit#

Please let me know what you think and please keep the discussion in the ML :)

Thanks for starting the discussion and I hope that soon we will be
able to vote on the FLIP.

Cheers,
Kostas

On Thu, Jan 16, 2020 at 3:40 AM Yang Wang <[hidden email]> wrote:

>
> Hi all,
>
> Thanks a lot for the feedback from @Kostas Kloudas. Your all concerns are
> on point. The FLIP-85 is mainly
> focused on supporting cluster mode for per-job. Since it is more urgent and
> have much more use
> cases both in Yarn and Kubernetes deployment. For session cluster, we could
> have more discussion
> in a new thread later.
>
> #1, How to download the user jars and dependencies for per-job in cluster
> mode?
> For Yarn, we could register the user jars and dependencies as
> LocalResource. They will be distributed
> by Yarn. And once the JobManager and TaskManager launched, the jars are
> already exists.
> For Standalone per-job and K8s, we expect that the user jars
> and dependencies are built into the image.
> Or the InitContainer could be used for downloading. It is natively
> distributed and we will not have bottleneck.
>
> #2, Job graph recovery
> We could have an optimization to store job graph on the DFS. However, i
> suggest building a new jobgraph
> from the configuration is the default option. Since we will not always have
> a DFS store when deploying a
> Flink per-job cluster. Of course, we assume that using the same
> configuration(e.g. job_id, user_jar, main_class,
> main_args, parallelism, savepoint_settings, etc.) will get a same job
> graph. I think the standalone per-job
> already has the similar behavior.
>
> #3, What happens with jobs that have multiple execute calls?
> Currently, it is really a problem. Even we use a local client on Flink
> master side, it will have different behavior with
> client mode. For client mode, if we execute multiple times, then we will
> deploy multiple Flink clusters for each execute.
> I am not pretty sure whether it is reasonable. However, i still think using
> the local client is a good choice. We could
> continue the discussion in a new thread. @Zili Chen <[hidden email]> Do
> you want to drive this?
>
>
>
> Best,
> Yang
>
> Peter Huang <[hidden email]> 于2020年1月16日周四 上午1:55写道:
>
> > Hi Kostas,
> >
> > Thanks for this feedback. I can't agree more about the opinion. The
> > cluster mode should be added
> > first in per job cluster.
> >
> > 1) For job cluster implementation
> > 1. Job graph recovery from configuration or store as static job graph as
> > session cluster. I think the static one will be better for less recovery
> > time.
> > Let me update the doc for details.
> >
> > 2. For job execute multiple times, I think @Zili Chen
> > <[hidden email]> has proposed the local client solution that can
> > the run program actually in the cluster entry point. We can put the
> > implementation in the second stage,
> > or even a new FLIP for further discussion.
> >
> > 2) For session cluster implementation
> > We can disable the cluster mode for the session cluster in the first
> > stage. I agree the jar downloading will be a painful thing.
> > We can consider about PoC and performance evaluation first. If the end to
> > end experience is good enough, then we can consider
> > proceeding with the solution.
> >
> > Looking forward to more opinions from @Yang Wang <[hidden email]> @Zili
> > Chen <[hidden email]> @Dian Fu <[hidden email]>.
> >
> >
> > Best Regards
> > Peter Huang
> >
> > On Wed, Jan 15, 2020 at 7:50 AM Kostas Kloudas <[hidden email]> wrote:
> >
> >> Hi all,
> >>
> >> I am writing here as the discussion on the Google Doc seems to be a
> >> bit difficult to follow.
> >>
> >> I think that in order to be able to make progress, it would be helpful
> >> to focus on per-job mode for now.
> >> The reason is that:
> >>  1) making the (unique) JobSubmitHandler responsible for creating the
> >> jobgraphs,
> >>   which includes downloading dependencies, is not an optimal solution
> >>  2) even if we put the responsibility on the JobMaster, currently each
> >> job has its own
> >>   JobMaster but they all run on the same process, so we have again a
> >> single entity.
> >>
> >> Of course after this is done, and if we feel comfortable with the
> >> solution, then we can go to the session mode.
> >>
> >> A second comment has to do with fault-tolerance in the per-job,
> >> cluster-deploy mode.
> >> In the document, it is suggested that upon recovery, the JobMaster of
> >> each job re-creates the JobGraph.
> >> I am just wondering if it is better to create and store the jobGraph
> >> upon submission and only fetch it
> >> upon recovery so that we have a static jobGraph.
> >>
> >> Finally, I have a question which is what happens with jobs that have
> >> multiple execute calls?
> >> The semantics seem to change compared to the current behaviour, right?
> >>
> >> Cheers,
> >> Kostas
> >>
> >> On Wed, Jan 8, 2020 at 8:05 PM tison <[hidden email]> wrote:
> >> >
> >> > not always, Yang Wang is also not yet a committer but he can join the
> >> > channel. I cannot find the id by clicking “Add new member in channel” so
> >> > come to you and ask for try out the link. Possibly I will find other
> >> ways
> >> > but the original purpose is that the slack channel is a public area we
> >> > discuss about developing...
> >> > Best,
> >> > tison.
> >> >
> >> >
> >> > Peter Huang <[hidden email]> 于2020年1月9日周四 上午2:44写道:
> >> >
> >> > > Hi Tison,
> >> > >
> >> > > I am not the committer of Flink yet. I think I can't join it also.
> >> > >
> >> > >
> >> > > Best Regards
> >> > > Peter Huang
> >> > >
> >> > > On Wed, Jan 8, 2020 at 9:39 AM tison <[hidden email]> wrote:
> >> > >
> >> > > > Hi Peter,
> >> > > >
> >> > > > Could you try out this link?
> >> > > https://the-asf.slack.com/messages/CNA3ADZPH
> >> > > >
> >> > > > Best,
> >> > > > tison.
> >> > > >
> >> > > >
> >> > > > Peter Huang <[hidden email]> 于2020年1月9日周四 上午1:22写道:
> >> > > >
> >> > > > > Hi Tison,
> >> > > > >
> >> > > > > I can't join the group with shared link. Would you please add me
> >> into
> >> > > the
> >> > > > > group? My slack account is huangzhenqiu0825.
> >> > > > > Thank you in advance.
> >> > > > >
> >> > > > >
> >> > > > > Best Regards
> >> > > > > Peter Huang
> >> > > > >
> >> > > > > On Wed, Jan 8, 2020 at 12:02 AM tison <[hidden email]>
> >> wrote:
> >> > > > >
> >> > > > > > Hi Peter,
> >> > > > > >
> >> > > > > > As described above, this effort should get attention from people
> >> > > > > developing
> >> > > > > > FLIP-73 a.k.a. Executor abstractions. I recommend you to join
> >> the
> >> > > > public
> >> > > > > > slack channel[1] for Flink Client API Enhancement and you can
> >> try to
> >> > > > > share
> >> > > > > > you detailed thoughts there. It possibly gets more concrete
> >> > > attentions.
> >> > > > > >
> >> > > > > > Best,
> >> > > > > > tison.
> >> > > > > >
> >> > > > > > [1]
> >> > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> https://slack.com/share/IS21SJ75H/Rk8HhUly9FuEHb7oGwBZ33uL/enQtODg2MDYwNjE5MTg3LTA2MjIzNDc1M2ZjZDVlMjdlZjk1M2RkYmJhNjAwMTk2ZDZkODQ4NmY5YmI4OGRhNWJkYTViMTM1NzlmMzc4OWM
> >> > > > > >
> >> > > > > >
> >> > > > > > Peter Huang <[hidden email]> 于2020年1月7日周二 上午5:09写道:
> >> > > > > >
> >> > > > > > > Dear All,
> >> > > > > > >
> >> > > > > > > Happy new year! According to existing feedback from the
> >> community,
> >> > > we
> >> > > > > > > revised the doc with the consideration of session cluster
> >> support,
> >> > > > and
> >> > > > > > > concrete interface changes needed and execution plan. Please
> >> take
> >> > > one
> >> > > > > > more
> >> > > > > > > round of review at your most convenient time.
> >> > > > > > >
> >> > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> https://docs.google.com/document/d/1aAwVjdZByA-0CHbgv16Me-vjaaDMCfhX7TzVVTuifYM/edit#
> >> > > > > > >
> >> > > > > > >
> >> > > > > > > Best Regards
> >> > > > > > > Peter Huang
> >> > > > > > >
> >> > > > > > >
> >> > > > > > >
> >> > > > > > >
> >> > > > > > >
> >> > > > > > > On Thu, Jan 2, 2020 at 11:29 AM Peter Huang <
> >> > > > > [hidden email]>
> >> > > > > > > wrote:
> >> > > > > > >
> >> > > > > > > > Hi Dian,
> >> > > > > > > > Thanks for giving us valuable feedbacks.
> >> > > > > > > >
> >> > > > > > > > 1) It's better to have a whole design for this feature
> >> > > > > > > > For the suggestion of enabling the cluster mode also session
> >> > > > > cluster, I
> >> > > > > > > > think Flink already supported it. WebSubmissionExtension
> >> already
> >> > > > > allows
> >> > > > > > > > users to start a job with the specified jar by using web UI.
> >> > > > > > > > But we need to enable the feature from CLI for both local
> >> jar,
> >> > > > remote
> >> > > > > > > jar.
> >> > > > > > > > I will align with Yang Wang first about the details and
> >> update
> >> > > the
> >> > > > > > design
> >> > > > > > > > doc.
> >> > > > > > > >
> >> > > > > > > > 2) It's better to consider the convenience for users, such
> >> as
> >> > > > > debugging
> >> > > > > > > >
> >> > > > > > > > I am wondering whether we can store the exception in
> >> jobgragh
> >> > > > > > > > generation in application master. As no streaming graph can
> >> be
> >> > > > > > scheduled
> >> > > > > > > in
> >> > > > > > > > this case, there will be no more TM will be requested from
> >> > > FlinkRM.
> >> > > > > > > > If the AM is still running, users can still query it from
> >> CLI. As
> >> > > > it
> >> > > > > > > > requires more change, we can get some feedback from <
> >> > > > > > [hidden email]
> >> > > > > > > >
> >> > > > > > > > and @[hidden email] <[hidden email]>.
> >> > > > > > > >
> >> > > > > > > > 3) It's better to consider the impact to the stability of
> >> the
> >> > > > cluster
> >> > > > > > > >
> >> > > > > > > > I agree with Yang Wang's opinion.
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > > Best Regards
> >> > > > > > > > Peter Huang
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > > On Sun, Dec 29, 2019 at 9:44 PM Dian Fu <
> >> [hidden email]>
> >> > > > > wrote:
> >> > > > > > > >
> >> > > > > > > >> Hi all,
> >> > > > > > > >>
> >> > > > > > > >> Sorry to jump into this discussion. Thanks everyone for the
> >> > > > > > discussion.
> >> > > > > > > >> I'm very interested in this topic although I'm not an
> >> expert in
> >> > > > this
> >> > > > > > > part.
> >> > > > > > > >> So I'm glad to share my thoughts as following:
> >> > > > > > > >>
> >> > > > > > > >> 1) It's better to have a whole design for this feature
> >> > > > > > > >> As we know, there are two deployment modes: per-job mode
> >> and
> >> > > > session
> >> > > > > > > >> mode. I'm wondering which mode really needs this feature.
> >> As the
> >> > > > > > design
> >> > > > > > > doc
> >> > > > > > > >> mentioned, per-job mode is more used for streaming jobs and
> >> > > > session
> >> > > > > > > mode is
> >> > > > > > > >> usually used for batch jobs(Of course, the job types and
> >> the
> >> > > > > > deployment
> >> > > > > > > >> modes are orthogonal). Usually streaming job is only
> >> needed to
> >> > > be
> >> > > > > > > submitted
> >> > > > > > > >> once and it will run for days or weeks, while batch jobs
> >> will be
> >> > > > > > > submitted
> >> > > > > > > >> more frequently compared with streaming jobs. This means
> >> that
> >> > > > maybe
> >> > > > > > > session
> >> > > > > > > >> mode also needs this feature. However, if we support this
> >> > > feature
> >> > > > in
> >> > > > > > > >> session mode, the application master will become the new
> >> > > > centralized
> >> > > > > > > >> service(which should be solved). So in this case, it's
> >> better to
> >> > > > > have
> >> > > > > > a
> >> > > > > > > >> complete design for both per-job mode and session mode.
> >> > > > Furthermore,
> >> > > > > > > even
> >> > > > > > > >> if we can do it phase by phase, we need to have a whole
> >> picture
> >> > > of
> >> > > > > how
> >> > > > > > > it
> >> > > > > > > >> works in both per-job mode and session mode.
> >> > > > > > > >>
> >> > > > > > > >> 2) It's better to consider the convenience for users, such
> >> as
> >> > > > > > debugging
> >> > > > > > > >> After we finish this feature, the job graph will be
> >> compiled in
> >> > > > the
> >> > > > > > > >> application master, which means that users cannot easily
> >> get the
> >> > > > > > > exception
> >> > > > > > > >> message synchorousely in the job client if there are
> >> problems
> >> > > > during
> >> > > > > > the
> >> > > > > > > >> job graph compiling (especially for platform users), such
> >> as the
> >> > > > > > > resource
> >> > > > > > > >> path is incorrect, the user program itself has some
> >> problems,
> >> > > etc.
> >> > > > > > What
> >> > > > > > > I'm
> >> > > > > > > >> thinking is that maybe we should throw the exceptions as
> >> early
> >> > > as
> >> > > > > > > possible
> >> > > > > > > >> (during job submission stage).
> >> > > > > > > >>
> >> > > > > > > >> 3) It's better to consider the impact to the stability of
> >> the
> >> > > > > cluster
> >> > > > > > > >> If we perform the compiling in the application master, we
> >> should
> >> > > > > > > consider
> >> > > > > > > >> the impact of the compiling errors. Although YARN could
> >> resume
> >> > > the
> >> > > > > > > >> application master in case of failures, but in some case
> >> the
> >> > > > > compiling
> >> > > > > > > >> failure may be a waste of cluster resource and may impact
> >> the
> >> > > > > > stability
> >> > > > > > > the
> >> > > > > > > >> cluster and the other jobs in the cluster, such as the
> >> resource
> >> > > > path
> >> > > > > > is
> >> > > > > > > >> incorrect, the user program itself has some problems(in
> >> this
> >> > > case,
> >> > > > > job
> >> > > > > > > >> failover cannot solve this kind of problems) etc. In the
> >> current
> >> > > > > > > >> implemention, the compiling errors are handled in the
> >> client
> >> > > side
> >> > > > > and
> >> > > > > > > there
> >> > > > > > > >> is no impact to the cluster at all.
> >> > > > > > > >>
> >> > > > > > > >> Regarding to 1), it's clearly pointed in the design doc
> >> that
> >> > > only
> >> > > > > > > per-job
> >> > > > > > > >> mode will be supported. However, I think it's better to
> >> also
> >> > > > > consider
> >> > > > > > > the
> >> > > > > > > >> session mode in the design doc.
> >> > > > > > > >> Regarding to 2) and 3), I have not seen related sections
> >> in the
> >> > > > > design
> >> > > > > > > >> doc. It will be good if we can cover them in the design
> >> doc.
> >> > > > > > > >>
> >> > > > > > > >> Feel free to correct me If there is anything I
> >> misunderstand.
> >> > > > > > > >>
> >> > > > > > > >> Regards,
> >> > > > > > > >> Dian
> >> > > > > > > >>
> >> > > > > > > >>
> >> > > > > > > >> > 在 2019年12月27日,上午3:13,Peter Huang <
> >> [hidden email]>
> >> > > > 写道:
> >> > > > > > > >> >
> >> > > > > > > >> > Hi Yang,
> >> > > > > > > >> >
> >> > > > > > > >> > I can't agree more. The effort definitely needs to align
> >> with
> >> > > > the
> >> > > > > > > final
> >> > > > > > > >> > goal of FLIP-73.
> >> > > > > > > >> > I am thinking about whether we can achieve the goal with
> >> two
> >> > > > > phases.
> >> > > > > > > >> >
> >> > > > > > > >> > 1) Phase I
> >> > > > > > > >> > As the CLiFrontend will not be depreciated soon. We can
> >> still
> >> > > > use
> >> > > > > > the
> >> > > > > > > >> > deployMode flag there,
> >> > > > > > > >> > pass the program info through Flink configuration,  use
> >> the
> >> > > > > > > >> > ClassPathJobGraphRetriever
> >> > > > > > > >> > to generate the job graph in ClusterEntrypoints of yarn
> >> and
> >> > > > > > > Kubernetes.
> >> > > > > > > >> >
> >> > > > > > > >> > 2) Phase II
> >> > > > > > > >> > In  AbstractJobClusterExecutor, the job graph is
> >> generated in
> >> > > > the
> >> > > > > > > >> execute
> >> > > > > > > >> > function. We can still
> >> > > > > > > >> > use the deployMode in it. With deployMode = cluster, the
> >> > > execute
> >> > > > > > > >> function
> >> > > > > > > >> > only starts the cluster.
> >> > > > > > > >> >
> >> > > > > > > >> > When {Yarn/Kuberneates}PerJobClusterEntrypoint starts,
> >> It will
> >> > > > > start
> >> > > > > > > the
> >> > > > > > > >> > dispatch first, then we can use
> >> > > > > > > >> > a ClusterEnvironment similar to ContextEnvironment to
> >> submit
> >> > > the
> >> > > > > job
> >> > > > > > > >> with
> >> > > > > > > >> > jobName the local
> >> > > > > > > >> > dispatcher. For the details, we need more investigation.
> >> Let's
> >> > > > > wait
> >> > > > > > > >> > for @Aljoscha
> >> > > > > > > >> > Krettek <[hidden email]> @Till Rohrmann <
> >> > > > > [hidden email]
> >> > > > > > >'s
> >> > > > > > > >> > feedback after the holiday season.
> >> > > > > > > >> >
> >> > > > > > > >> > Thank you in advance. Merry Chrismas and Happy New
> >> Year!!!
> >> > > > > > > >> >
> >> > > > > > > >> >
> >> > > > > > > >> > Best Regards
> >> > > > > > > >> > Peter Huang
> >> > > > > > > >> >
> >> > > > > > > >> >
> >> > > > > > > >> >
> >> > > > > > > >> >
> >> > > > > > > >> >
> >> > > > > > > >> >
> >> > > > > > > >> >
> >> > > > > > > >> >
> >> > > > > > > >> > On Wed, Dec 25, 2019 at 1:08 AM Yang Wang <
> >> > > > [hidden email]>
> >> > > > > > > >> wrote:
> >> > > > > > > >> >
> >> > > > > > > >> >> Hi Peter,
> >> > > > > > > >> >>
> >> > > > > > > >> >> I think we need to reconsider tison's suggestion
> >> seriously.
> >> > > > After
> >> > > > > > > >> FLIP-73,
> >> > > > > > > >> >> the deployJobCluster has
> >> > > > > > > >> >> beenmoved into `JobClusterExecutor#execute`. It should
> >> not be
> >> > > > > > > perceived
> >> > > > > > > >> >> for `CliFrontend`. That
> >> > > > > > > >> >> means the user program will *ALWAYS* be executed on
> >> client
> >> > > > side.
> >> > > > > > This
> >> > > > > > > >> is
> >> > > > > > > >> >> the by design behavior.
> >> > > > > > > >> >> So, we could not just add `if(client mode) .. else
> >> if(cluster
> >> > > > > mode)
> >> > > > > > > >> ...`
> >> > > > > > > >> >> codes in `CliFrontend` to bypass
> >> > > > > > > >> >> the executor. We need to find a clean way to decouple
> >> > > executing
> >> > > > > > user
> >> > > > > > > >> >> program and deploying per-job
> >> > > > > > > >> >> cluster. Based on this, we could support to execute user
> >> > > > program
> >> > > > > on
> >> > > > > > > >> client
> >> > > > > > > >> >> or master side.
> >> > > > > > > >> >>
> >> > > > > > > >> >> Maybe Aljoscha and Jeff could give some good
> >> suggestions.
> >> > > > > > > >> >>
> >> > > > > > > >> >>
> >> > > > > > > >> >>
> >> > > > > > > >> >> Best,
> >> > > > > > > >> >> Yang
> >> > > > > > > >> >>
> >> > > > > > > >> >> Peter Huang <[hidden email]> 于2019年12月25日周三
> >> > > > > 上午4:03写道:
> >> > > > > > > >> >>
> >> > > > > > > >> >>> Hi Jingjing,
> >> > > > > > > >> >>>
> >> > > > > > > >> >>> The improvement proposed is a deployment option for
> >> CLI. For
> >> > > > SQL
> >> > > > > > > based
> >> > > > > > > >> >>> Flink application, It is more convenient to use the
> >> existing
> >> > > > > model
> >> > > > > > > in
> >> > > > > > > >> >>> SqlClient in which
> >> > > > > > > >> >>> the job graph is generated within SqlClient. After
> >> adding
> >> > > the
> >> > > > > > > delayed
> >> > > > > > > >> job
> >> > > > > > > >> >>> graph generation, I think there is no change is needed
> >> for
> >> > > > your
> >> > > > > > > side.
> >> > > > > > > >> >>>
> >> > > > > > > >> >>>
> >> > > > > > > >> >>> Best Regards
> >> > > > > > > >> >>> Peter Huang
> >> > > > > > > >> >>>
> >> > > > > > > >> >>>
> >> > > > > > > >> >>> On Wed, Dec 18, 2019 at 6:01 AM jingjing bai <
> >> > > > > > > >> [hidden email]>
> >> > > > > > > >> >>> wrote:
> >> > > > > > > >> >>>
> >> > > > > > > >> >>>> hi peter:
> >> > > > > > > >> >>>>    we had extension SqlClent to support sql job
> >> submit in
> >> > > web
> >> > > > > > base
> >> > > > > > > on
> >> > > > > > > >> >>>> flink 1.9.   we support submit to yarn on per job
> >> mode too.
> >> > > > > > > >> >>>>    in this case, the job graph generated  on client
> >> side
> >> > > .  I
> >> > > > > > think
> >> > > > > > > >> >>> this
> >> > > > > > > >> >>>> discuss Mainly to improve api programme.  but in my
> >> case ,
> >> > > > > there
> >> > > > > > is
> >> > > > > > > >> no
> >> > > > > > > >> >>>> jar to upload but only a sql string .
> >> > > > > > > >> >>>>    do u had more suggestion to improve for sql mode
> >> or it
> >> > > is
> >> > > > > > only a
> >> > > > > > > >> >>>> switch for api programme?
> >> > > > > > > >> >>>>
> >> > > > > > > >> >>>>
> >> > > > > > > >> >>>> best
> >> > > > > > > >> >>>> bai jj
> >> > > > > > > >> >>>>
> >> > > > > > > >> >>>>
> >> > > > > > > >> >>>> Yang Wang <[hidden email]> 于2019年12月18日周三
> >> 下午7:21写道:
> >> > > > > > > >> >>>>
> >> > > > > > > >> >>>>> I just want to revive this discussion.
> >> > > > > > > >> >>>>>
> >> > > > > > > >> >>>>> Recently, i am thinking about how to natively run
> >> flink
> >> > > > > per-job
> >> > > > > > > >> >>> cluster on
> >> > > > > > > >> >>>>> Kubernetes.
> >> > > > > > > >> >>>>> The per-job mode on Kubernetes is very different
> >> from on
> >> > > > Yarn.
> >> > > > > > And
> >> > > > > > > >> we
> >> > > > > > > >> >>> will
> >> > > > > > > >> >>>>> have
> >> > > > > > > >> >>>>> the same deployment requirements to the client and
> >> entry
> >> > > > > point.
> >> > > > > > > >> >>>>>
> >> > > > > > > >> >>>>> 1. Flink client not always need a local jar to start
> >> a
> >> > > Flink
> >> > > > > > > per-job
> >> > > > > > > >> >>>>> cluster. We could
> >> > > > > > > >> >>>>> support multiple schemas. For example,
> >> > > > file:///path/of/my.jar
> >> > > > > > > means
> >> > > > > > > >> a
> >> > > > > > > >> >>> jar
> >> > > > > > > >> >>>>> located
> >> > > > > > > >> >>>>> at client side,
> >> hdfs://myhdfs/user/myname/flink/my.jar
> >> > > > means a
> >> > > > > > jar
> >> > > > > > > >> >>> located
> >> > > > > > > >> >>>>> at
> >> > > > > > > >> >>>>> remote hdfs, local:///path/in/image/my.jar means a
> >> jar
> >> > > > located
> >> > > > > > at
> >> > > > > > > >> >>>>> jobmanager side.
> >> > > > > > > >> >>>>>
> >> > > > > > > >> >>>>> 2. Support running user program on master side. This
> >> also
> >> > > > > means
> >> > > > > > > the
> >> > > > > > > >> >>> entry
> >> > > > > > > >> >>>>> point
> >> > > > > > > >> >>>>> will generate the job graph on master side. We could
> >> use
> >> > > the
> >> > > > > > > >> >>>>> ClasspathJobGraphRetriever
> >> > > > > > > >> >>>>> or start a local Flink client to achieve this
> >> purpose.
> >> > > > > > > >> >>>>>
> >> > > > > > > >> >>>>>
> >> > > > > > > >> >>>>> cc tison, Aljoscha & Kostas Do you think this is the
> >> right
> >> > > > > > > >> direction we
> >> > > > > > > >> >>>>> need to work?
> >> > > > > > > >> >>>>>
> >> > > > > > > >> >>>>> tison <[hidden email]> 于2019年12月12日周四
> >> 下午4:48写道:
> >> > > > > > > >> >>>>>
> >> > > > > > > >> >>>>>> A quick idea is that we separate the deployment
> >> from user
> >> > > > > > program
> >> > > > > > > >> >>> that
> >> > > > > > > >> >>>>> it
> >> > > > > > > >> >>>>>> has always been done
> >> > > > > > > >> >>>>>> outside the program. On user program executed there
> >> is
> >> > > > > always a
> >> > > > > > > >> >>>>>> ClusterClient that communicates with
> >> > > > > > > >> >>>>>> an existing cluster, remote or local. It will be
> >> another
> >> > > > > thread
> >> > > > > > > so
> >> > > > > > > >> >>> just
> >> > > > > > > >> >>>>> for
> >> > > > > > > >> >>>>>> your information.
> >> > > > > > > >> >>>>>>
> >> > > > > > > >> >>>>>> Best,
> >> > > > > > > >> >>>>>> tison.
> >> > > > > > > >> >>>>>>
> >> > > > > > > >> >>>>>>
> >> > > > > > > >> >>>>>> tison <[hidden email]> 于2019年12月12日周四
> >> 下午4:40写道:
> >> > > > > > > >> >>>>>>
> >> > > > > > > >> >>>>>>> Hi Peter,
> >> > > > > > > >> >>>>>>>
> >> > > > > > > >> >>>>>>> Another concern I realized recently is that with
> >> current
> >> > > > > > > Executors
> >> > > > > > > >> >>>>>>> abstraction(FLIP-73)
> >> > > > > > > >> >>>>>>> I'm afraid that user program is designed to ALWAYS
> >> run
> >> > > on
> >> > > > > the
> >> > > > > > > >> >>> client
> >> > > > > > > >> >>>>>> side.
> >> > > > > > > >> >>>>>>> Specifically,
> >> > > > > > > >> >>>>>>> we deploy the job in executor when env.execute
> >> called.
> >> > > > This
> >> > > > > > > >> >>>>> abstraction
> >> > > > > > > >> >>>>>>> possibly prevents
> >> > > > > > > >> >>>>>>> Flink runs user program on the cluster side.
> >> > > > > > > >> >>>>>>>
> >> > > > > > > >> >>>>>>> For your proposal, in this case we already
> >> compiled the
> >> > > > > > program
> >> > > > > > > >> and
> >> > > > > > > >> >>>>> run
> >> > > > > > > >> >>>>>> on
> >> > > > > > > >> >>>>>>> the client side,
> >> > > > > > > >> >>>>>>> even we deploy a cluster and retrieve job graph
> >> from
> >> > > > program
> >> > > > > > > >> >>>>> metadata, it
> >> > > > > > > >> >>>>>>> doesn't make
> >> > > > > > > >> >>>>>>> many sense.
> >> > > > > > > >> >>>>>>>
> >> > > > > > > >> >>>>>>> cc Aljoscha & Kostas what do you think about this
> >> > > > > constraint?
> >> > > > > > > >> >>>>>>>
> >> > > > > > > >> >>>>>>> Best,
> >> > > > > > > >> >>>>>>> tison.
> >> > > > > > > >> >>>>>>>
> >> > > > > > > >> >>>>>>>
> >> > > > > > > >> >>>>>>> Peter Huang <[hidden email]>
> >> 于2019年12月10日周二
> >> > > > > > > >> 下午12:45写道:
> >> > > > > > > >> >>>>>>>
> >> > > > > > > >> >>>>>>>> Hi Tison,
> >> > > > > > > >> >>>>>>>>
> >> > > > > > > >> >>>>>>>> Yes, you are right. I think I made the wrong
> >> argument
> >> > > in
> >> > > > > the
> >> > > > > > > doc.
> >> > > > > > > >> >>>>>>>> Basically, the packaging jar problem is only for
> >> > > platform
> >> > > > > > > users.
> >> > > > > > > >> >>> In
> >> > > > > > > >> >>>>> our
> >> > > > > > > >> >>>>>>>> internal deploy service,
> >> > > > > > > >> >>>>>>>> we further optimized the deployment latency by
> >> letting
> >> > > > > users
> >> > > > > > to
> >> > > > > > > >> >>>>>> packaging
> >> > > > > > > >> >>>>>>>> flink-runtime together with the uber jar, so that
> >> we
> >> > > > don't
> >> > > > > > need
> >> > > > > > > >> to
> >> > > > > > > >> >>>>>>>> consider
> >> > > > > > > >> >>>>>>>> multiple flink version
> >> > > > > > > >> >>>>>>>> support for now. In the session client mode, as
> >> Flink
> >> > > > libs
> >> > > > > > will
> >> > > > > > > >> be
> >> > > > > > > >> >>>>>> shipped
> >> > > > > > > >> >>>>>>>> anyway as local resources of yarn. Users actually
> >> don't
> >> > > > > need
> >> > > > > > to
> >> > > > > > > >> >>>>> package
> >> > > > > > > >> >>>>>>>> those libs into job jar.
> >> > > > > > > >> >>>>>>>>
> >> > > > > > > >> >>>>>>>>
> >> > > > > > > >> >>>>>>>>
> >> > > > > > > >> >>>>>>>> Best Regards
> >> > > > > > > >> >>>>>>>> Peter Huang
> >> > > > > > > >> >>>>>>>>
> >> > > > > > > >> >>>>>>>> On Mon, Dec 9, 2019 at 8:35 PM tison <
> >> > > > [hidden email]
> >> > > > > >
> >> > > > > > > >> >>> wrote:
> >> > > > > > > >> >>>>>>>>
> >> > > > > > > >> >>>>>>>>>> 3. What do you mean about the package? Do users
> >> need
> >> > > to
> >> > > > > > > >> >>> compile
> >> > > > > > > >> >>>>>> their
> >> > > > > > > >> >>>>>>>>> jars
> >> > > > > > > >> >>>>>>>>> inlcuding flink-clients, flink-optimizer,
> >> flink-table
> >> > > > > codes?
> >> > > > > > > >> >>>>>>>>>
> >> > > > > > > >> >>>>>>>>> The answer should be no because they exist in
> >> system
> >> > > > > > > classpath.
> >> > > > > > > >> >>>>>>>>>
> >> > > > > > > >> >>>>>>>>> Best,
> >> > > > > > > >> >>>>>>>>> tison.
> >> > > > > > > >> >>>>>>>>>
> >> > > > > > > >> >>>>>>>>>
> >> > > > > > > >> >>>>>>>>> Yang Wang <[hidden email]> 于2019年12月10日周二
> >> > > > > 下午12:18写道:
> >> > > > > > > >> >>>>>>>>>
> >> > > > > > > >> >>>>>>>>>> Hi Peter,
> >> > > > > > > >> >>>>>>>>>>
> >> > > > > > > >> >>>>>>>>>> Thanks a lot for starting this discussion. I
> >> think
> >> > > this
> >> > > > > is
> >> > > > > > a
> >> > > > > > > >> >>> very
> >> > > > > > > >> >>>>>>>> useful
> >> > > > > > > >> >>>>>>>>>> feature.
> >> > > > > > > >> >>>>>>>>>>
> >> > > > > > > >> >>>>>>>>>> Not only for Yarn, i am focused on flink on
> >> > > Kubernetes
> >> > > > > > > >> >>>>> integration
> >> > > > > > > >> >>>>>> and
> >> > > > > > > >> >>>>>>>>> come
> >> > > > > > > >> >>>>>>>>>> across the same
> >> > > > > > > >> >>>>>>>>>> problem. I do not want the job graph generated
> >> on
> >> > > > client
> >> > > > > > > side.
> >> > > > > > > >> >>>>>>>> Instead,
> >> > > > > > > >> >>>>>>>>> the
> >> > > > > > > >> >>>>>>>>>> user jars are built in
> >> > > > > > > >> >>>>>>>>>> a user-defined image. When the job manager
> >> launched,
> >> > > we
> >> > > > > > just
> >> > > > > > > >> >>>>> need to
> >> > > > > > > >> >>>>>>>>>> generate the job graph
> >> > > > > > > >> >>>>>>>>>> based on local user jars.
> >> > > > > > > >> >>>>>>>>>>
> >> > > > > > > >> >>>>>>>>>> I have some small suggestion about this.
> >> > > > > > > >> >>>>>>>>>>
> >> > > > > > > >> >>>>>>>>>> 1. `ProgramJobGraphRetriever` is very similar to
> >> > > > > > > >> >>>>>>>>>> `ClasspathJobGraphRetriever`, the differences
> >> > > > > > > >> >>>>>>>>>> are the former needs `ProgramMetadata` and the
> >> latter
> >> > > > > needs
> >> > > > > > > >> >>> some
> >> > > > > > > >> >>>>>>>>> arguments.
> >> > > > > > > >> >>>>>>>>>> Is it possible to
> >> > > > > > > >> >>>>>>>>>> have an unified `JobGraphRetriever` to support
> >> both?
> >> > > > > > > >> >>>>>>>>>> 2. Is it possible to not use a local user jar to
> >> > > start
> >> > > > a
> >> > > > > > > >> >>> per-job
> >> > > > > > > >> >>>>>>>> cluster?
> >> > > > > > > >> >>>>>>>>>> In your case, the user jars has
> >> > > > > > > >> >>>>>>>>>> existed on hdfs already and we do need to
> >> download
> >> > > the
> >> > > > > jars
> >> > > > > > > to
> >> > > > > > > >> >>>>>>>> deployer
> >> > > > > > > >> >>>>>>>>>> service. Currently, we
> >> > > > > > > >> >>>>>>>>>> always need a local user jar to start a flink
> >> > > cluster.
> >> > > > It
> >> > > > > > is
> >> > > > > > > >> >>> be
> >> > > > > > > >> >>>>>> great
> >> > > > > > > >> >>>>>>>> if
> >> > > > > > > >> >>>>>>>>> we
> >> > > > > > > >> >>>>>>>>>> could support remote user jars.
> >> > > > > > > >> >>>>>>>>>>>> In the implementation, we assume users package
> >> > > > > > > >> >>> flink-clients,
> >> > > > > > > >> >>>>>>>>>> flink-optimizer, flink-table together within
> >> the job
> >> > > > jar.
> >> > > > > > > >> >>>>> Otherwise,
> >> > > > > > > >> >>>>>>>> the
> >> > > > > > > >> >>>>>>>>>> job graph generation within
> >> JobClusterEntryPoint will
> >> > > > > fail.
> >> > > > > > > >> >>>>>>>>>> 3. What do you mean about the package? Do users
> >> need
> >> > > to
> >> > > > > > > >> >>> compile
> >> > > > > > > >> >>>>>> their
> >> > > > > > > >> >>>>>>>>> jars
> >> > > > > > > >> >>>>>>>>>> inlcuding flink-clients, flink-optimizer,
> >> flink-table
> >> > > > > > codes?
> >> > > > > > > >> >>>>>>>>>>
> >> > > > > > > >> >>>>>>>>>>
> >> > > > > > > >> >>>>>>>>>>
> >> > > > > > > >> >>>>>>>>>> Best,
> >> > > > > > > >> >>>>>>>>>> Yang
> >> > > > > > > >> >>>>>>>>>>
> >> > > > > > > >> >>>>>>>>>> Peter Huang <[hidden email]>
> >> > > > 于2019年12月10日周二
> >> > > > > > > >> >>>>> 上午2:37写道:
> >> > > > > > > >> >>>>>>>>>>
> >> > > > > > > >> >>>>>>>>>>> Dear All,
> >> > > > > > > >> >>>>>>>>>>>
> >> > > > > > > >> >>>>>>>>>>> Recently, the Flink community starts to
> >> improve the
> >> > > > yarn
> >> > > > > > > >> >>>>> cluster
> >> > > > > > > >> >>>>>>>>>> descriptor
> >> > > > > > > >> >>>>>>>>>>> to make job jar and config files configurable
> >> from
> >> > > > CLI.
> >> > > > > It
> >> > > > > > > >> >>>>>> improves
> >> > > > > > > >> >>>>>>>> the
> >> > > > > > > >> >>>>>>>>>>> flexibility of  Flink deployment Yarn Per Job
> >> Mode.
> >> > > > For
> >> > > > > > > >> >>>>> platform
> >> > > > > > > >> >>>>>>>> users
> >> > > > > > > >> >>>>>>>>>> who
> >> > > > > > > >> >>>>>>>>>>> manage tens of hundreds of streaming pipelines
> >> for
> >> > > the
> >> > > > > > whole
> >> > > > > > > >> >>>>> org
> >> > > > > > > >> >>>>>> or
> >> > > > > > > >> >>>>>>>>>>> company, we found the job graph generation in
> >> > > > > client-side
> >> > > > > > is
> >> > > > > > > >> >>>>>> another
> >> > > > > > > >> >>>>>>>>>>> pinpoint. Thus, we want to propose a
> >> configurable
> >> > > > > feature
> >> > > > > > > >> >>> for
> >> > > > > > > >> >>>>>>>>>>> FlinkYarnSessionCli. The feature can allow
> >> users to
> >> > > > > choose
> >> > > > > > > >> >>> the
> >> > > > > > > >> >>>>> job
> >> > > > > > > >> >>>>>>>>> graph
> >> > > > > > > >> >>>>>>>>>>> generation in Flink ClusterEntryPoint so that
> >> the
> >> > > job
> >> > > > > jar
> >> > > > > > > >> >>>>> doesn't
> >> > > > > > > >> >>>>>>>> need
> >> > > > > > > >> >>>>>>>>> to
> >> > > > > > > >> >>>>>>>>>>> be locally for the job graph generation. The
> >> > > proposal
> >> > > > is
> >> > > > > > > >> >>>>> organized
> >> > > > > > > >> >>>>>>>> as a
> >> > > > > > > >> >>>>>>>>>>> FLIP
> >> > > > > > > >> >>>>>>>>>>>
> >> > > > > > > >> >>>>>>>>>>>
> >> > > > > > > >> >>>>>>>>>>
> >> > > > > > > >> >>>>>>>>>
> >> > > > > > > >> >>>>>>>>
> >> > > > > > > >> >>>>>>
> >> > > > > > > >> >>>>>
> >> > > > > > > >> >>>
> >> > > > > > > >>
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-85+Delayed+JobGraph+Generation
> >> > > > > > > >> >>>>>>>>>>> .
> >> > > > > > > >> >>>>>>>>>>>
> >> > > > > > > >> >>>>>>>>>>> Any questions and suggestions are welcomed.
> >> Thank
> >> > > you
> >> > > > in
> >> > > > > > > >> >>>>> advance.
> >> > > > > > > >> >>>>>>>>>>>
> >> > > > > > > >> >>>>>>>>>>>
> >> > > > > > > >> >>>>>>>>>>> Best Regards
> >> > > > > > > >> >>>>>>>>>>> Peter Huang
> >> > > > > > > >> >>>>>>>>>>>
> >> > > > > > > >> >>>>>>>>>>
> >> > > > > > > >> >>>>>>>>>
> >> > > > > > > >> >>>>>>>>
> >> > > > > > > >> >>>>>>>
> >> > > > > > > >> >>>>>>
> >> > > > > > > >> >>>>>
> >> > > > > > > >> >>>>
> >> > > > > > > >> >>>
> >> > > > > > > >> >>
> >> > > > > > > >>
> >> > > > > > > >>
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >>
> >

Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] FLIP-85: Delayed Job Graph Generation

Yang Wang
Hi Kostas,

Thanks a lot for your conclusion and updating the FLIP-85 WIKI. Currently,
i have no more
questions about motivation, approach, fault tolerance and the first phase
implementation.

I think the new title "Flink Application Mode" makes a lot senses to me.
Especially for the
containerized environment, the cluster deploy option will be very useful.

Just one concern, how do we introduce this new application mode to our
users?
Each user program(i.e. `main()`) is an application. Currently, we intend to
only support one
`execute()`. So what's the difference between per-job and application mode?

For per-job, user `main()` is always executed on client side. And For
application mode, user
`main()` could be executed on client or master side(configured via cli
option).
Right? We need to have a clear concept. Otherwise, the users will be more
and more confusing.


Best,
Yang

Kostas Kloudas <[hidden email]> 于2020年3月2日周一 下午5:58写道:

> Hi all,
>
> I update
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-85+Flink+Application+Mode
> based on the discussion we had here:
>
>
> https://docs.google.com/document/d/1ji72s3FD9DYUyGuKnJoO4ApzV-nSsZa0-bceGXW7Ocw/edit#
>
> Please let me know what you think and please keep the discussion in the ML
> :)
>
> Thanks for starting the discussion and I hope that soon we will be
> able to vote on the FLIP.
>
> Cheers,
> Kostas
>
> On Thu, Jan 16, 2020 at 3:40 AM Yang Wang <[hidden email]> wrote:
> >
> > Hi all,
> >
> > Thanks a lot for the feedback from @Kostas Kloudas. Your all concerns are
> > on point. The FLIP-85 is mainly
> > focused on supporting cluster mode for per-job. Since it is more urgent
> and
> > have much more use
> > cases both in Yarn and Kubernetes deployment. For session cluster, we
> could
> > have more discussion
> > in a new thread later.
> >
> > #1, How to download the user jars and dependencies for per-job in cluster
> > mode?
> > For Yarn, we could register the user jars and dependencies as
> > LocalResource. They will be distributed
> > by Yarn. And once the JobManager and TaskManager launched, the jars are
> > already exists.
> > For Standalone per-job and K8s, we expect that the user jars
> > and dependencies are built into the image.
> > Or the InitContainer could be used for downloading. It is natively
> > distributed and we will not have bottleneck.
> >
> > #2, Job graph recovery
> > We could have an optimization to store job graph on the DFS. However, i
> > suggest building a new jobgraph
> > from the configuration is the default option. Since we will not always
> have
> > a DFS store when deploying a
> > Flink per-job cluster. Of course, we assume that using the same
> > configuration(e.g. job_id, user_jar, main_class,
> > main_args, parallelism, savepoint_settings, etc.) will get a same job
> > graph. I think the standalone per-job
> > already has the similar behavior.
> >
> > #3, What happens with jobs that have multiple execute calls?
> > Currently, it is really a problem. Even we use a local client on Flink
> > master side, it will have different behavior with
> > client mode. For client mode, if we execute multiple times, then we will
> > deploy multiple Flink clusters for each execute.
> > I am not pretty sure whether it is reasonable. However, i still think
> using
> > the local client is a good choice. We could
> > continue the discussion in a new thread. @Zili Chen <
> [hidden email]> Do
> > you want to drive this?
> >
> >
> >
> > Best,
> > Yang
> >
> > Peter Huang <[hidden email]> 于2020年1月16日周四 上午1:55写道:
> >
> > > Hi Kostas,
> > >
> > > Thanks for this feedback. I can't agree more about the opinion. The
> > > cluster mode should be added
> > > first in per job cluster.
> > >
> > > 1) For job cluster implementation
> > > 1. Job graph recovery from configuration or store as static job graph
> as
> > > session cluster. I think the static one will be better for less
> recovery
> > > time.
> > > Let me update the doc for details.
> > >
> > > 2. For job execute multiple times, I think @Zili Chen
> > > <[hidden email]> has proposed the local client solution that can
> > > the run program actually in the cluster entry point. We can put the
> > > implementation in the second stage,
> > > or even a new FLIP for further discussion.
> > >
> > > 2) For session cluster implementation
> > > We can disable the cluster mode for the session cluster in the first
> > > stage. I agree the jar downloading will be a painful thing.
> > > We can consider about PoC and performance evaluation first. If the end
> to
> > > end experience is good enough, then we can consider
> > > proceeding with the solution.
> > >
> > > Looking forward to more opinions from @Yang Wang <
> [hidden email]> @Zili
> > > Chen <[hidden email]> @Dian Fu <[hidden email]>.
> > >
> > >
> > > Best Regards
> > > Peter Huang
> > >
> > > On Wed, Jan 15, 2020 at 7:50 AM Kostas Kloudas <[hidden email]>
> wrote:
> > >
> > >> Hi all,
> > >>
> > >> I am writing here as the discussion on the Google Doc seems to be a
> > >> bit difficult to follow.
> > >>
> > >> I think that in order to be able to make progress, it would be helpful
> > >> to focus on per-job mode for now.
> > >> The reason is that:
> > >>  1) making the (unique) JobSubmitHandler responsible for creating the
> > >> jobgraphs,
> > >>   which includes downloading dependencies, is not an optimal solution
> > >>  2) even if we put the responsibility on the JobMaster, currently each
> > >> job has its own
> > >>   JobMaster but they all run on the same process, so we have again a
> > >> single entity.
> > >>
> > >> Of course after this is done, and if we feel comfortable with the
> > >> solution, then we can go to the session mode.
> > >>
> > >> A second comment has to do with fault-tolerance in the per-job,
> > >> cluster-deploy mode.
> > >> In the document, it is suggested that upon recovery, the JobMaster of
> > >> each job re-creates the JobGraph.
> > >> I am just wondering if it is better to create and store the jobGraph
> > >> upon submission and only fetch it
> > >> upon recovery so that we have a static jobGraph.
> > >>
> > >> Finally, I have a question which is what happens with jobs that have
> > >> multiple execute calls?
> > >> The semantics seem to change compared to the current behaviour, right?
> > >>
> > >> Cheers,
> > >> Kostas
> > >>
> > >> On Wed, Jan 8, 2020 at 8:05 PM tison <[hidden email]> wrote:
> > >> >
> > >> > not always, Yang Wang is also not yet a committer but he can join
> the
> > >> > channel. I cannot find the id by clicking “Add new member in
> channel” so
> > >> > come to you and ask for try out the link. Possibly I will find other
> > >> ways
> > >> > but the original purpose is that the slack channel is a public area
> we
> > >> > discuss about developing...
> > >> > Best,
> > >> > tison.
> > >> >
> > >> >
> > >> > Peter Huang <[hidden email]> 于2020年1月9日周四 上午2:44写道:
> > >> >
> > >> > > Hi Tison,
> > >> > >
> > >> > > I am not the committer of Flink yet. I think I can't join it also.
> > >> > >
> > >> > >
> > >> > > Best Regards
> > >> > > Peter Huang
> > >> > >
> > >> > > On Wed, Jan 8, 2020 at 9:39 AM tison <[hidden email]>
> wrote:
> > >> > >
> > >> > > > Hi Peter,
> > >> > > >
> > >> > > > Could you try out this link?
> > >> > > https://the-asf.slack.com/messages/CNA3ADZPH
> > >> > > >
> > >> > > > Best,
> > >> > > > tison.
> > >> > > >
> > >> > > >
> > >> > > > Peter Huang <[hidden email]> 于2020年1月9日周四 上午1:22写道:
> > >> > > >
> > >> > > > > Hi Tison,
> > >> > > > >
> > >> > > > > I can't join the group with shared link. Would you please add
> me
> > >> into
> > >> > > the
> > >> > > > > group? My slack account is huangzhenqiu0825.
> > >> > > > > Thank you in advance.
> > >> > > > >
> > >> > > > >
> > >> > > > > Best Regards
> > >> > > > > Peter Huang
> > >> > > > >
> > >> > > > > On Wed, Jan 8, 2020 at 12:02 AM tison <[hidden email]>
> > >> wrote:
> > >> > > > >
> > >> > > > > > Hi Peter,
> > >> > > > > >
> > >> > > > > > As described above, this effort should get attention from
> people
> > >> > > > > developing
> > >> > > > > > FLIP-73 a.k.a. Executor abstractions. I recommend you to
> join
> > >> the
> > >> > > > public
> > >> > > > > > slack channel[1] for Flink Client API Enhancement and you
> can
> > >> try to
> > >> > > > > share
> > >> > > > > > you detailed thoughts there. It possibly gets more concrete
> > >> > > attentions.
> > >> > > > > >
> > >> > > > > > Best,
> > >> > > > > > tison.
> > >> > > > > >
> > >> > > > > > [1]
> > >> > > > > >
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >>
> https://slack.com/share/IS21SJ75H/Rk8HhUly9FuEHb7oGwBZ33uL/enQtODg2MDYwNjE5MTg3LTA2MjIzNDc1M2ZjZDVlMjdlZjk1M2RkYmJhNjAwMTk2ZDZkODQ4NmY5YmI4OGRhNWJkYTViMTM1NzlmMzc4OWM
> > >> > > > > >
> > >> > > > > >
> > >> > > > > > Peter Huang <[hidden email]> 于2020年1月7日周二
> 上午5:09写道:
> > >> > > > > >
> > >> > > > > > > Dear All,
> > >> > > > > > >
> > >> > > > > > > Happy new year! According to existing feedback from the
> > >> community,
> > >> > > we
> > >> > > > > > > revised the doc with the consideration of session cluster
> > >> support,
> > >> > > > and
> > >> > > > > > > concrete interface changes needed and execution plan.
> Please
> > >> take
> > >> > > one
> > >> > > > > > more
> > >> > > > > > > round of review at your most convenient time.
> > >> > > > > > >
> > >> > > > > > >
> > >> > > > > > >
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >>
> https://docs.google.com/document/d/1aAwVjdZByA-0CHbgv16Me-vjaaDMCfhX7TzVVTuifYM/edit#
> > >> > > > > > >
> > >> > > > > > >
> > >> > > > > > > Best Regards
> > >> > > > > > > Peter Huang
> > >> > > > > > >
> > >> > > > > > >
> > >> > > > > > >
> > >> > > > > > >
> > >> > > > > > >
> > >> > > > > > > On Thu, Jan 2, 2020 at 11:29 AM Peter Huang <
> > >> > > > > [hidden email]>
> > >> > > > > > > wrote:
> > >> > > > > > >
> > >> > > > > > > > Hi Dian,
> > >> > > > > > > > Thanks for giving us valuable feedbacks.
> > >> > > > > > > >
> > >> > > > > > > > 1) It's better to have a whole design for this feature
> > >> > > > > > > > For the suggestion of enabling the cluster mode also
> session
> > >> > > > > cluster, I
> > >> > > > > > > > think Flink already supported it. WebSubmissionExtension
> > >> already
> > >> > > > > allows
> > >> > > > > > > > users to start a job with the specified jar by using
> web UI.
> > >> > > > > > > > But we need to enable the feature from CLI for both
> local
> > >> jar,
> > >> > > > remote
> > >> > > > > > > jar.
> > >> > > > > > > > I will align with Yang Wang first about the details and
> > >> update
> > >> > > the
> > >> > > > > > design
> > >> > > > > > > > doc.
> > >> > > > > > > >
> > >> > > > > > > > 2) It's better to consider the convenience for users,
> such
> > >> as
> > >> > > > > debugging
> > >> > > > > > > >
> > >> > > > > > > > I am wondering whether we can store the exception in
> > >> jobgragh
> > >> > > > > > > > generation in application master. As no streaming graph
> can
> > >> be
> > >> > > > > > scheduled
> > >> > > > > > > in
> > >> > > > > > > > this case, there will be no more TM will be requested
> from
> > >> > > FlinkRM.
> > >> > > > > > > > If the AM is still running, users can still query it
> from
> > >> CLI. As
> > >> > > > it
> > >> > > > > > > > requires more change, we can get some feedback from <
> > >> > > > > > [hidden email]
> > >> > > > > > > >
> > >> > > > > > > > and @[hidden email] <[hidden email]>.
> > >> > > > > > > >
> > >> > > > > > > > 3) It's better to consider the impact to the stability
> of
> > >> the
> > >> > > > cluster
> > >> > > > > > > >
> > >> > > > > > > > I agree with Yang Wang's opinion.
> > >> > > > > > > >
> > >> > > > > > > >
> > >> > > > > > > >
> > >> > > > > > > > Best Regards
> > >> > > > > > > > Peter Huang
> > >> > > > > > > >
> > >> > > > > > > >
> > >> > > > > > > > On Sun, Dec 29, 2019 at 9:44 PM Dian Fu <
> > >> [hidden email]>
> > >> > > > > wrote:
> > >> > > > > > > >
> > >> > > > > > > >> Hi all,
> > >> > > > > > > >>
> > >> > > > > > > >> Sorry to jump into this discussion. Thanks everyone
> for the
> > >> > > > > > discussion.
> > >> > > > > > > >> I'm very interested in this topic although I'm not an
> > >> expert in
> > >> > > > this
> > >> > > > > > > part.
> > >> > > > > > > >> So I'm glad to share my thoughts as following:
> > >> > > > > > > >>
> > >> > > > > > > >> 1) It's better to have a whole design for this feature
> > >> > > > > > > >> As we know, there are two deployment modes: per-job
> mode
> > >> and
> > >> > > > session
> > >> > > > > > > >> mode. I'm wondering which mode really needs this
> feature.
> > >> As the
> > >> > > > > > design
> > >> > > > > > > doc
> > >> > > > > > > >> mentioned, per-job mode is more used for streaming
> jobs and
> > >> > > > session
> > >> > > > > > > mode is
> > >> > > > > > > >> usually used for batch jobs(Of course, the job types
> and
> > >> the
> > >> > > > > > deployment
> > >> > > > > > > >> modes are orthogonal). Usually streaming job is only
> > >> needed to
> > >> > > be
> > >> > > > > > > submitted
> > >> > > > > > > >> once and it will run for days or weeks, while batch
> jobs
> > >> will be
> > >> > > > > > > submitted
> > >> > > > > > > >> more frequently compared with streaming jobs. This
> means
> > >> that
> > >> > > > maybe
> > >> > > > > > > session
> > >> > > > > > > >> mode also needs this feature. However, if we support
> this
> > >> > > feature
> > >> > > > in
> > >> > > > > > > >> session mode, the application master will become the
> new
> > >> > > > centralized
> > >> > > > > > > >> service(which should be solved). So in this case, it's
> > >> better to
> > >> > > > > have
> > >> > > > > > a
> > >> > > > > > > >> complete design for both per-job mode and session mode.
> > >> > > > Furthermore,
> > >> > > > > > > even
> > >> > > > > > > >> if we can do it phase by phase, we need to have a whole
> > >> picture
> > >> > > of
> > >> > > > > how
> > >> > > > > > > it
> > >> > > > > > > >> works in both per-job mode and session mode.
> > >> > > > > > > >>
> > >> > > > > > > >> 2) It's better to consider the convenience for users,
> such
> > >> as
> > >> > > > > > debugging
> > >> > > > > > > >> After we finish this feature, the job graph will be
> > >> compiled in
> > >> > > > the
> > >> > > > > > > >> application master, which means that users cannot
> easily
> > >> get the
> > >> > > > > > > exception
> > >> > > > > > > >> message synchorousely in the job client if there are
> > >> problems
> > >> > > > during
> > >> > > > > > the
> > >> > > > > > > >> job graph compiling (especially for platform users),
> such
> > >> as the
> > >> > > > > > > resource
> > >> > > > > > > >> path is incorrect, the user program itself has some
> > >> problems,
> > >> > > etc.
> > >> > > > > > What
> > >> > > > > > > I'm
> > >> > > > > > > >> thinking is that maybe we should throw the exceptions
> as
> > >> early
> > >> > > as
> > >> > > > > > > possible
> > >> > > > > > > >> (during job submission stage).
> > >> > > > > > > >>
> > >> > > > > > > >> 3) It's better to consider the impact to the stability
> of
> > >> the
> > >> > > > > cluster
> > >> > > > > > > >> If we perform the compiling in the application master,
> we
> > >> should
> > >> > > > > > > consider
> > >> > > > > > > >> the impact of the compiling errors. Although YARN could
> > >> resume
> > >> > > the
> > >> > > > > > > >> application master in case of failures, but in some
> case
> > >> the
> > >> > > > > compiling
> > >> > > > > > > >> failure may be a waste of cluster resource and may
> impact
> > >> the
> > >> > > > > > stability
> > >> > > > > > > the
> > >> > > > > > > >> cluster and the other jobs in the cluster, such as the
> > >> resource
> > >> > > > path
> > >> > > > > > is
> > >> > > > > > > >> incorrect, the user program itself has some problems(in
> > >> this
> > >> > > case,
> > >> > > > > job
> > >> > > > > > > >> failover cannot solve this kind of problems) etc. In
> the
> > >> current
> > >> > > > > > > >> implemention, the compiling errors are handled in the
> > >> client
> > >> > > side
> > >> > > > > and
> > >> > > > > > > there
> > >> > > > > > > >> is no impact to the cluster at all.
> > >> > > > > > > >>
> > >> > > > > > > >> Regarding to 1), it's clearly pointed in the design doc
> > >> that
> > >> > > only
> > >> > > > > > > per-job
> > >> > > > > > > >> mode will be supported. However, I think it's better to
> > >> also
> > >> > > > > consider
> > >> > > > > > > the
> > >> > > > > > > >> session mode in the design doc.
> > >> > > > > > > >> Regarding to 2) and 3), I have not seen related
> sections
> > >> in the
> > >> > > > > design
> > >> > > > > > > >> doc. It will be good if we can cover them in the design
> > >> doc.
> > >> > > > > > > >>
> > >> > > > > > > >> Feel free to correct me If there is anything I
> > >> misunderstand.
> > >> > > > > > > >>
> > >> > > > > > > >> Regards,
> > >> > > > > > > >> Dian
> > >> > > > > > > >>
> > >> > > > > > > >>
> > >> > > > > > > >> > 在 2019年12月27日,上午3:13,Peter Huang <
> > >> [hidden email]>
> > >> > > > 写道:
> > >> > > > > > > >> >
> > >> > > > > > > >> > Hi Yang,
> > >> > > > > > > >> >
> > >> > > > > > > >> > I can't agree more. The effort definitely needs to
> align
> > >> with
> > >> > > > the
> > >> > > > > > > final
> > >> > > > > > > >> > goal of FLIP-73.
> > >> > > > > > > >> > I am thinking about whether we can achieve the goal
> with
> > >> two
> > >> > > > > phases.
> > >> > > > > > > >> >
> > >> > > > > > > >> > 1) Phase I
> > >> > > > > > > >> > As the CLiFrontend will not be depreciated soon. We
> can
> > >> still
> > >> > > > use
> > >> > > > > > the
> > >> > > > > > > >> > deployMode flag there,
> > >> > > > > > > >> > pass the program info through Flink configuration,
> use
> > >> the
> > >> > > > > > > >> > ClassPathJobGraphRetriever
> > >> > > > > > > >> > to generate the job graph in ClusterEntrypoints of
> yarn
> > >> and
> > >> > > > > > > Kubernetes.
> > >> > > > > > > >> >
> > >> > > > > > > >> > 2) Phase II
> > >> > > > > > > >> > In  AbstractJobClusterExecutor, the job graph is
> > >> generated in
> > >> > > > the
> > >> > > > > > > >> execute
> > >> > > > > > > >> > function. We can still
> > >> > > > > > > >> > use the deployMode in it. With deployMode = cluster,
> the
> > >> > > execute
> > >> > > > > > > >> function
> > >> > > > > > > >> > only starts the cluster.
> > >> > > > > > > >> >
> > >> > > > > > > >> > When {Yarn/Kuberneates}PerJobClusterEntrypoint
> starts,
> > >> It will
> > >> > > > > start
> > >> > > > > > > the
> > >> > > > > > > >> > dispatch first, then we can use
> > >> > > > > > > >> > a ClusterEnvironment similar to ContextEnvironment to
> > >> submit
> > >> > > the
> > >> > > > > job
> > >> > > > > > > >> with
> > >> > > > > > > >> > jobName the local
> > >> > > > > > > >> > dispatcher. For the details, we need more
> investigation.
> > >> Let's
> > >> > > > > wait
> > >> > > > > > > >> > for @Aljoscha
> > >> > > > > > > >> > Krettek <[hidden email]> @Till Rohrmann <
> > >> > > > > [hidden email]
> > >> > > > > > >'s
> > >> > > > > > > >> > feedback after the holiday season.
> > >> > > > > > > >> >
> > >> > > > > > > >> > Thank you in advance. Merry Chrismas and Happy New
> > >> Year!!!
> > >> > > > > > > >> >
> > >> > > > > > > >> >
> > >> > > > > > > >> > Best Regards
> > >> > > > > > > >> > Peter Huang
> > >> > > > > > > >> >
> > >> > > > > > > >> >
> > >> > > > > > > >> >
> > >> > > > > > > >> >
> > >> > > > > > > >> >
> > >> > > > > > > >> >
> > >> > > > > > > >> >
> > >> > > > > > > >> >
> > >> > > > > > > >> > On Wed, Dec 25, 2019 at 1:08 AM Yang Wang <
> > >> > > > [hidden email]>
> > >> > > > > > > >> wrote:
> > >> > > > > > > >> >
> > >> > > > > > > >> >> Hi Peter,
> > >> > > > > > > >> >>
> > >> > > > > > > >> >> I think we need to reconsider tison's suggestion
> > >> seriously.
> > >> > > > After
> > >> > > > > > > >> FLIP-73,
> > >> > > > > > > >> >> the deployJobCluster has
> > >> > > > > > > >> >> beenmoved into `JobClusterExecutor#execute`. It
> should
> > >> not be
> > >> > > > > > > perceived
> > >> > > > > > > >> >> for `CliFrontend`. That
> > >> > > > > > > >> >> means the user program will *ALWAYS* be executed on
> > >> client
> > >> > > > side.
> > >> > > > > > This
> > >> > > > > > > >> is
> > >> > > > > > > >> >> the by design behavior.
> > >> > > > > > > >> >> So, we could not just add `if(client mode) .. else
> > >> if(cluster
> > >> > > > > mode)
> > >> > > > > > > >> ...`
> > >> > > > > > > >> >> codes in `CliFrontend` to bypass
> > >> > > > > > > >> >> the executor. We need to find a clean way to
> decouple
> > >> > > executing
> > >> > > > > > user
> > >> > > > > > > >> >> program and deploying per-job
> > >> > > > > > > >> >> cluster. Based on this, we could support to execute
> user
> > >> > > > program
> > >> > > > > on
> > >> > > > > > > >> client
> > >> > > > > > > >> >> or master side.
> > >> > > > > > > >> >>
> > >> > > > > > > >> >> Maybe Aljoscha and Jeff could give some good
> > >> suggestions.
> > >> > > > > > > >> >>
> > >> > > > > > > >> >>
> > >> > > > > > > >> >>
> > >> > > > > > > >> >> Best,
> > >> > > > > > > >> >> Yang
> > >> > > > > > > >> >>
> > >> > > > > > > >> >> Peter Huang <[hidden email]>
> 于2019年12月25日周三
> > >> > > > > 上午4:03写道:
> > >> > > > > > > >> >>
> > >> > > > > > > >> >>> Hi Jingjing,
> > >> > > > > > > >> >>>
> > >> > > > > > > >> >>> The improvement proposed is a deployment option for
> > >> CLI. For
> > >> > > > SQL
> > >> > > > > > > based
> > >> > > > > > > >> >>> Flink application, It is more convenient to use the
> > >> existing
> > >> > > > > model
> > >> > > > > > > in
> > >> > > > > > > >> >>> SqlClient in which
> > >> > > > > > > >> >>> the job graph is generated within SqlClient. After
> > >> adding
> > >> > > the
> > >> > > > > > > delayed
> > >> > > > > > > >> job
> > >> > > > > > > >> >>> graph generation, I think there is no change is
> needed
> > >> for
> > >> > > > your
> > >> > > > > > > side.
> > >> > > > > > > >> >>>
> > >> > > > > > > >> >>>
> > >> > > > > > > >> >>> Best Regards
> > >> > > > > > > >> >>> Peter Huang
> > >> > > > > > > >> >>>
> > >> > > > > > > >> >>>
> > >> > > > > > > >> >>> On Wed, Dec 18, 2019 at 6:01 AM jingjing bai <
> > >> > > > > > > >> [hidden email]>
> > >> > > > > > > >> >>> wrote:
> > >> > > > > > > >> >>>
> > >> > > > > > > >> >>>> hi peter:
> > >> > > > > > > >> >>>>    we had extension SqlClent to support sql job
> > >> submit in
> > >> > > web
> > >> > > > > > base
> > >> > > > > > > on
> > >> > > > > > > >> >>>> flink 1.9.   we support submit to yarn on per job
> > >> mode too.
> > >> > > > > > > >> >>>>    in this case, the job graph generated  on
> client
> > >> side
> > >> > > .  I
> > >> > > > > > think
> > >> > > > > > > >> >>> this
> > >> > > > > > > >> >>>> discuss Mainly to improve api programme.  but in
> my
> > >> case ,
> > >> > > > > there
> > >> > > > > > is
> > >> > > > > > > >> no
> > >> > > > > > > >> >>>> jar to upload but only a sql string .
> > >> > > > > > > >> >>>>    do u had more suggestion to improve for sql
> mode
> > >> or it
> > >> > > is
> > >> > > > > > only a
> > >> > > > > > > >> >>>> switch for api programme?
> > >> > > > > > > >> >>>>
> > >> > > > > > > >> >>>>
> > >> > > > > > > >> >>>> best
> > >> > > > > > > >> >>>> bai jj
> > >> > > > > > > >> >>>>
> > >> > > > > > > >> >>>>
> > >> > > > > > > >> >>>> Yang Wang <[hidden email]> 于2019年12月18日周三
> > >> 下午7:21写道:
> > >> > > > > > > >> >>>>
> > >> > > > > > > >> >>>>> I just want to revive this discussion.
> > >> > > > > > > >> >>>>>
> > >> > > > > > > >> >>>>> Recently, i am thinking about how to natively run
> > >> flink
> > >> > > > > per-job
> > >> > > > > > > >> >>> cluster on
> > >> > > > > > > >> >>>>> Kubernetes.
> > >> > > > > > > >> >>>>> The per-job mode on Kubernetes is very different
> > >> from on
> > >> > > > Yarn.
> > >> > > > > > And
> > >> > > > > > > >> we
> > >> > > > > > > >> >>> will
> > >> > > > > > > >> >>>>> have
> > >> > > > > > > >> >>>>> the same deployment requirements to the client
> and
> > >> entry
> > >> > > > > point.
> > >> > > > > > > >> >>>>>
> > >> > > > > > > >> >>>>> 1. Flink client not always need a local jar to
> start
> > >> a
> > >> > > Flink
> > >> > > > > > > per-job
> > >> > > > > > > >> >>>>> cluster. We could
> > >> > > > > > > >> >>>>> support multiple schemas. For example,
> > >> > > > file:///path/of/my.jar
> > >> > > > > > > means
> > >> > > > > > > >> a
> > >> > > > > > > >> >>> jar
> > >> > > > > > > >> >>>>> located
> > >> > > > > > > >> >>>>> at client side,
> > >> hdfs://myhdfs/user/myname/flink/my.jar
> > >> > > > means a
> > >> > > > > > jar
> > >> > > > > > > >> >>> located
> > >> > > > > > > >> >>>>> at
> > >> > > > > > > >> >>>>> remote hdfs, local:///path/in/image/my.jar means
> a
> > >> jar
> > >> > > > located
> > >> > > > > > at
> > >> > > > > > > >> >>>>> jobmanager side.
> > >> > > > > > > >> >>>>>
> > >> > > > > > > >> >>>>> 2. Support running user program on master side.
> This
> > >> also
> > >> > > > > means
> > >> > > > > > > the
> > >> > > > > > > >> >>> entry
> > >> > > > > > > >> >>>>> point
> > >> > > > > > > >> >>>>> will generate the job graph on master side. We
> could
> > >> use
> > >> > > the
> > >> > > > > > > >> >>>>> ClasspathJobGraphRetriever
> > >> > > > > > > >> >>>>> or start a local Flink client to achieve this
> > >> purpose.
> > >> > > > > > > >> >>>>>
> > >> > > > > > > >> >>>>>
> > >> > > > > > > >> >>>>> cc tison, Aljoscha & Kostas Do you think this is
> the
> > >> right
> > >> > > > > > > >> direction we
> > >> > > > > > > >> >>>>> need to work?
> > >> > > > > > > >> >>>>>
> > >> > > > > > > >> >>>>> tison <[hidden email]> 于2019年12月12日周四
> > >> 下午4:48写道:
> > >> > > > > > > >> >>>>>
> > >> > > > > > > >> >>>>>> A quick idea is that we separate the deployment
> > >> from user
> > >> > > > > > program
> > >> > > > > > > >> >>> that
> > >> > > > > > > >> >>>>> it
> > >> > > > > > > >> >>>>>> has always been done
> > >> > > > > > > >> >>>>>> outside the program. On user program executed
> there
> > >> is
> > >> > > > > always a
> > >> > > > > > > >> >>>>>> ClusterClient that communicates with
> > >> > > > > > > >> >>>>>> an existing cluster, remote or local. It will be
> > >> another
> > >> > > > > thread
> > >> > > > > > > so
> > >> > > > > > > >> >>> just
> > >> > > > > > > >> >>>>> for
> > >> > > > > > > >> >>>>>> your information.
> > >> > > > > > > >> >>>>>>
> > >> > > > > > > >> >>>>>> Best,
> > >> > > > > > > >> >>>>>> tison.
> > >> > > > > > > >> >>>>>>
> > >> > > > > > > >> >>>>>>
> > >> > > > > > > >> >>>>>> tison <[hidden email]> 于2019年12月12日周四
> > >> 下午4:40写道:
> > >> > > > > > > >> >>>>>>
> > >> > > > > > > >> >>>>>>> Hi Peter,
> > >> > > > > > > >> >>>>>>>
> > >> > > > > > > >> >>>>>>> Another concern I realized recently is that
> with
> > >> current
> > >> > > > > > > Executors
> > >> > > > > > > >> >>>>>>> abstraction(FLIP-73)
> > >> > > > > > > >> >>>>>>> I'm afraid that user program is designed to
> ALWAYS
> > >> run
> > >> > > on
> > >> > > > > the
> > >> > > > > > > >> >>> client
> > >> > > > > > > >> >>>>>> side.
> > >> > > > > > > >> >>>>>>> Specifically,
> > >> > > > > > > >> >>>>>>> we deploy the job in executor when env.execute
> > >> called.
> > >> > > > This
> > >> > > > > > > >> >>>>> abstraction
> > >> > > > > > > >> >>>>>>> possibly prevents
> > >> > > > > > > >> >>>>>>> Flink runs user program on the cluster side.
> > >> > > > > > > >> >>>>>>>
> > >> > > > > > > >> >>>>>>> For your proposal, in this case we already
> > >> compiled the
> > >> > > > > > program
> > >> > > > > > > >> and
> > >> > > > > > > >> >>>>> run
> > >> > > > > > > >> >>>>>> on
> > >> > > > > > > >> >>>>>>> the client side,
> > >> > > > > > > >> >>>>>>> even we deploy a cluster and retrieve job graph
> > >> from
> > >> > > > program
> > >> > > > > > > >> >>>>> metadata, it
> > >> > > > > > > >> >>>>>>> doesn't make
> > >> > > > > > > >> >>>>>>> many sense.
> > >> > > > > > > >> >>>>>>>
> > >> > > > > > > >> >>>>>>> cc Aljoscha & Kostas what do you think about
> this
> > >> > > > > constraint?
> > >> > > > > > > >> >>>>>>>
> > >> > > > > > > >> >>>>>>> Best,
> > >> > > > > > > >> >>>>>>> tison.
> > >> > > > > > > >> >>>>>>>
> > >> > > > > > > >> >>>>>>>
> > >> > > > > > > >> >>>>>>> Peter Huang <[hidden email]>
> > >> 于2019年12月10日周二
> > >> > > > > > > >> 下午12:45写道:
> > >> > > > > > > >> >>>>>>>
> > >> > > > > > > >> >>>>>>>> Hi Tison,
> > >> > > > > > > >> >>>>>>>>
> > >> > > > > > > >> >>>>>>>> Yes, you are right. I think I made the wrong
> > >> argument
> > >> > > in
> > >> > > > > the
> > >> > > > > > > doc.
> > >> > > > > > > >> >>>>>>>> Basically, the packaging jar problem is only
> for
> > >> > > platform
> > >> > > > > > > users.
> > >> > > > > > > >> >>> In
> > >> > > > > > > >> >>>>> our
> > >> > > > > > > >> >>>>>>>> internal deploy service,
> > >> > > > > > > >> >>>>>>>> we further optimized the deployment latency by
> > >> letting
> > >> > > > > users
> > >> > > > > > to
> > >> > > > > > > >> >>>>>> packaging
> > >> > > > > > > >> >>>>>>>> flink-runtime together with the uber jar, so
> that
> > >> we
> > >> > > > don't
> > >> > > > > > need
> > >> > > > > > > >> to
> > >> > > > > > > >> >>>>>>>> consider
> > >> > > > > > > >> >>>>>>>> multiple flink version
> > >> > > > > > > >> >>>>>>>> support for now. In the session client mode,
> as
> > >> Flink
> > >> > > > libs
> > >> > > > > > will
> > >> > > > > > > >> be
> > >> > > > > > > >> >>>>>> shipped
> > >> > > > > > > >> >>>>>>>> anyway as local resources of yarn. Users
> actually
> > >> don't
> > >> > > > > need
> > >> > > > > > to
> > >> > > > > > > >> >>>>> package
> > >> > > > > > > >> >>>>>>>> those libs into job jar.
> > >> > > > > > > >> >>>>>>>>
> > >> > > > > > > >> >>>>>>>>
> > >> > > > > > > >> >>>>>>>>
> > >> > > > > > > >> >>>>>>>> Best Regards
> > >> > > > > > > >> >>>>>>>> Peter Huang
> > >> > > > > > > >> >>>>>>>>
> > >> > > > > > > >> >>>>>>>> On Mon, Dec 9, 2019 at 8:35 PM tison <
> > >> > > > [hidden email]
> > >> > > > > >
> > >> > > > > > > >> >>> wrote:
> > >> > > > > > > >> >>>>>>>>
> > >> > > > > > > >> >>>>>>>>>> 3. What do you mean about the package? Do
> users
> > >> need
> > >> > > to
> > >> > > > > > > >> >>> compile
> > >> > > > > > > >> >>>>>> their
> > >> > > > > > > >> >>>>>>>>> jars
> > >> > > > > > > >> >>>>>>>>> inlcuding flink-clients, flink-optimizer,
> > >> flink-table
> > >> > > > > codes?
> > >> > > > > > > >> >>>>>>>>>
> > >> > > > > > > >> >>>>>>>>> The answer should be no because they exist in
> > >> system
> > >> > > > > > > classpath.
> > >> > > > > > > >> >>>>>>>>>
> > >> > > > > > > >> >>>>>>>>> Best,
> > >> > > > > > > >> >>>>>>>>> tison.
> > >> > > > > > > >> >>>>>>>>>
> > >> > > > > > > >> >>>>>>>>>
> > >> > > > > > > >> >>>>>>>>> Yang Wang <[hidden email]>
> 于2019年12月10日周二
> > >> > > > > 下午12:18写道:
> > >> > > > > > > >> >>>>>>>>>
> > >> > > > > > > >> >>>>>>>>>> Hi Peter,
> > >> > > > > > > >> >>>>>>>>>>
> > >> > > > > > > >> >>>>>>>>>> Thanks a lot for starting this discussion. I
> > >> think
> > >> > > this
> > >> > > > > is
> > >> > > > > > a
> > >> > > > > > > >> >>> very
> > >> > > > > > > >> >>>>>>>> useful
> > >> > > > > > > >> >>>>>>>>>> feature.
> > >> > > > > > > >> >>>>>>>>>>
> > >> > > > > > > >> >>>>>>>>>> Not only for Yarn, i am focused on flink on
> > >> > > Kubernetes
> > >> > > > > > > >> >>>>> integration
> > >> > > > > > > >> >>>>>> and
> > >> > > > > > > >> >>>>>>>>> come
> > >> > > > > > > >> >>>>>>>>>> across the same
> > >> > > > > > > >> >>>>>>>>>> problem. I do not want the job graph
> generated
> > >> on
> > >> > > > client
> > >> > > > > > > side.
> > >> > > > > > > >> >>>>>>>> Instead,
> > >> > > > > > > >> >>>>>>>>> the
> > >> > > > > > > >> >>>>>>>>>> user jars are built in
> > >> > > > > > > >> >>>>>>>>>> a user-defined image. When the job manager
> > >> launched,
> > >> > > we
> > >> > > > > > just
> > >> > > > > > > >> >>>>> need to
> > >> > > > > > > >> >>>>>>>>>> generate the job graph
> > >> > > > > > > >> >>>>>>>>>> based on local user jars.
> > >> > > > > > > >> >>>>>>>>>>
> > >> > > > > > > >> >>>>>>>>>> I have some small suggestion about this.
> > >> > > > > > > >> >>>>>>>>>>
> > >> > > > > > > >> >>>>>>>>>> 1. `ProgramJobGraphRetriever` is very
> similar to
> > >> > > > > > > >> >>>>>>>>>> `ClasspathJobGraphRetriever`, the
> differences
> > >> > > > > > > >> >>>>>>>>>> are the former needs `ProgramMetadata` and
> the
> > >> latter
> > >> > > > > needs
> > >> > > > > > > >> >>> some
> > >> > > > > > > >> >>>>>>>>> arguments.
> > >> > > > > > > >> >>>>>>>>>> Is it possible to
> > >> > > > > > > >> >>>>>>>>>> have an unified `JobGraphRetriever` to
> support
> > >> both?
> > >> > > > > > > >> >>>>>>>>>> 2. Is it possible to not use a local user
> jar to
> > >> > > start
> > >> > > > a
> > >> > > > > > > >> >>> per-job
> > >> > > > > > > >> >>>>>>>> cluster?
> > >> > > > > > > >> >>>>>>>>>> In your case, the user jars has
> > >> > > > > > > >> >>>>>>>>>> existed on hdfs already and we do need to
> > >> download
> > >> > > the
> > >> > > > > jars
> > >> > > > > > > to
> > >> > > > > > > >> >>>>>>>> deployer
> > >> > > > > > > >> >>>>>>>>>> service. Currently, we
> > >> > > > > > > >> >>>>>>>>>> always need a local user jar to start a
> flink
> > >> > > cluster.
> > >> > > > It
> > >> > > > > > is
> > >> > > > > > > >> >>> be
> > >> > > > > > > >> >>>>>> great
> > >> > > > > > > >> >>>>>>>> if
> > >> > > > > > > >> >>>>>>>>> we
> > >> > > > > > > >> >>>>>>>>>> could support remote user jars.
> > >> > > > > > > >> >>>>>>>>>>>> In the implementation, we assume users
> package
> > >> > > > > > > >> >>> flink-clients,
> > >> > > > > > > >> >>>>>>>>>> flink-optimizer, flink-table together within
> > >> the job
> > >> > > > jar.
> > >> > > > > > > >> >>>>> Otherwise,
> > >> > > > > > > >> >>>>>>>> the
> > >> > > > > > > >> >>>>>>>>>> job graph generation within
> > >> JobClusterEntryPoint will
> > >> > > > > fail.
> > >> > > > > > > >> >>>>>>>>>> 3. What do you mean about the package? Do
> users
> > >> need
> > >> > > to
> > >> > > > > > > >> >>> compile
> > >> > > > > > > >> >>>>>> their
> > >> > > > > > > >> >>>>>>>>> jars
> > >> > > > > > > >> >>>>>>>>>> inlcuding flink-clients, flink-optimizer,
> > >> flink-table
> > >> > > > > > codes?
> > >> > > > > > > >> >>>>>>>>>>
> > >> > > > > > > >> >>>>>>>>>>
> > >> > > > > > > >> >>>>>>>>>>
> > >> > > > > > > >> >>>>>>>>>> Best,
> > >> > > > > > > >> >>>>>>>>>> Yang
> > >> > > > > > > >> >>>>>>>>>>
> > >> > > > > > > >> >>>>>>>>>> Peter Huang <[hidden email]>
> > >> > > > 于2019年12月10日周二
> > >> > > > > > > >> >>>>> 上午2:37写道:
> > >> > > > > > > >> >>>>>>>>>>
> > >> > > > > > > >> >>>>>>>>>>> Dear All,
> > >> > > > > > > >> >>>>>>>>>>>
> > >> > > > > > > >> >>>>>>>>>>> Recently, the Flink community starts to
> > >> improve the
> > >> > > > yarn
> > >> > > > > > > >> >>>>> cluster
> > >> > > > > > > >> >>>>>>>>>> descriptor
> > >> > > > > > > >> >>>>>>>>>>> to make job jar and config files
> configurable
> > >> from
> > >> > > > CLI.
> > >> > > > > It
> > >> > > > > > > >> >>>>>> improves
> > >> > > > > > > >> >>>>>>>> the
> > >> > > > > > > >> >>>>>>>>>>> flexibility of  Flink deployment Yarn Per
> Job
> > >> Mode.
> > >> > > > For
> > >> > > > > > > >> >>>>> platform
> > >> > > > > > > >> >>>>>>>> users
> > >> > > > > > > >> >>>>>>>>>> who
> > >> > > > > > > >> >>>>>>>>>>> manage tens of hundreds of streaming
> pipelines
> > >> for
> > >> > > the
> > >> > > > > > whole
> > >> > > > > > > >> >>>>> org
> > >> > > > > > > >> >>>>>> or
> > >> > > > > > > >> >>>>>>>>>>> company, we found the job graph generation
> in
> > >> > > > > client-side
> > >> > > > > > is
> > >> > > > > > > >> >>>>>> another
> > >> > > > > > > >> >>>>>>>>>>> pinpoint. Thus, we want to propose a
> > >> configurable
> > >> > > > > feature
> > >> > > > > > > >> >>> for
> > >> > > > > > > >> >>>>>>>>>>> FlinkYarnSessionCli. The feature can allow
> > >> users to
> > >> > > > > choose
> > >> > > > > > > >> >>> the
> > >> > > > > > > >> >>>>> job
> > >> > > > > > > >> >>>>>>>>> graph
> > >> > > > > > > >> >>>>>>>>>>> generation in Flink ClusterEntryPoint so
> that
> > >> the
> > >> > > job
> > >> > > > > jar
> > >> > > > > > > >> >>>>> doesn't
> > >> > > > > > > >> >>>>>>>> need
> > >> > > > > > > >> >>>>>>>>> to
> > >> > > > > > > >> >>>>>>>>>>> be locally for the job graph generation.
> The
> > >> > > proposal
> > >> > > > is
> > >> > > > > > > >> >>>>> organized
> > >> > > > > > > >> >>>>>>>> as a
> > >> > > > > > > >> >>>>>>>>>>> FLIP
> > >> > > > > > > >> >>>>>>>>>>>
> > >> > > > > > > >> >>>>>>>>>>>
> > >> > > > > > > >> >>>>>>>>>>
> > >> > > > > > > >> >>>>>>>>>
> > >> > > > > > > >> >>>>>>>>
> > >> > > > > > > >> >>>>>>
> > >> > > > > > > >> >>>>>
> > >> > > > > > > >> >>>
> > >> > > > > > > >>
> > >> > > > > > >
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-85+Delayed+JobGraph+Generation
> > >> > > > > > > >> >>>>>>>>>>> .
> > >> > > > > > > >> >>>>>>>>>>>
> > >> > > > > > > >> >>>>>>>>>>> Any questions and suggestions are welcomed.
> > >> Thank
> > >> > > you
> > >> > > > in
> > >> > > > > > > >> >>>>> advance.
> > >> > > > > > > >> >>>>>>>>>>>
> > >> > > > > > > >> >>>>>>>>>>>
> > >> > > > > > > >> >>>>>>>>>>> Best Regards
> > >> > > > > > > >> >>>>>>>>>>> Peter Huang
> > >> > > > > > > >> >>>>>>>>>>>
> > >> > > > > > > >> >>>>>>>>>>
> > >> > > > > > > >> >>>>>>>>>
> > >> > > > > > > >> >>>>>>>>
> > >> > > > > > > >> >>>>>>>
> > >> > > > > > > >> >>>>>>
> > >> > > > > > > >> >>>>>
> > >> > > > > > > >> >>>>
> > >> > > > > > > >> >>>
> > >> > > > > > > >> >>
> > >> > > > > > > >>
> > >> > > > > > > >>
> > >> > > > > > >
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >>
> > >
>
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] FLIP-85: Delayed Job Graph Generation

Kostas Kloudas-4
Hi Yang,

The difference between per-job and application mode is that, as you
described, in the per-job mode the main is executed on the client
while in the application mode, the main is executed on the cluster.
I do not think we have to offer "application mode" with running the
main on the client side as this is exactly what the per-job mode does
currently and, as you described also, it would be redundant.

Sorry if this was not clear in the document.

Cheers,
Kostas

On Mon, Mar 2, 2020 at 3:17 PM Yang Wang <[hidden email]> wrote:

>
> Hi Kostas,
>
> Thanks a lot for your conclusion and updating the FLIP-85 WIKI. Currently, i have no more
> questions about motivation, approach, fault tolerance and the first phase implementation.
>
> I think the new title "Flink Application Mode" makes a lot senses to me. Especially for the
> containerized environment, the cluster deploy option will be very useful.
>
> Just one concern, how do we introduce this new application mode to our users?
> Each user program(i.e. `main()`) is an application. Currently, we intend to only support one
> `execute()`. So what's the difference between per-job and application mode?
>
> For per-job, user `main()` is always executed on client side. And For application mode, user
> `main()` could be executed on client or master side(configured via cli option).
> Right? We need to have a clear concept. Otherwise, the users will be more and more confusing.
>
>
> Best,
> Yang
>
> Kostas Kloudas <[hidden email]> 于2020年3月2日周一 下午5:58写道:
>>
>> Hi all,
>>
>> I update https://cwiki.apache.org/confluence/display/FLINK/FLIP-85+Flink+Application+Mode
>> based on the discussion we had here:
>>
>> https://docs.google.com/document/d/1ji72s3FD9DYUyGuKnJoO4ApzV-nSsZa0-bceGXW7Ocw/edit#
>>
>> Please let me know what you think and please keep the discussion in the ML :)
>>
>> Thanks for starting the discussion and I hope that soon we will be
>> able to vote on the FLIP.
>>
>> Cheers,
>> Kostas
>>
>> On Thu, Jan 16, 2020 at 3:40 AM Yang Wang <[hidden email]> wrote:
>> >
>> > Hi all,
>> >
>> > Thanks a lot for the feedback from @Kostas Kloudas. Your all concerns are
>> > on point. The FLIP-85 is mainly
>> > focused on supporting cluster mode for per-job. Since it is more urgent and
>> > have much more use
>> > cases both in Yarn and Kubernetes deployment. For session cluster, we could
>> > have more discussion
>> > in a new thread later.
>> >
>> > #1, How to download the user jars and dependencies for per-job in cluster
>> > mode?
>> > For Yarn, we could register the user jars and dependencies as
>> > LocalResource. They will be distributed
>> > by Yarn. And once the JobManager and TaskManager launched, the jars are
>> > already exists.
>> > For Standalone per-job and K8s, we expect that the user jars
>> > and dependencies are built into the image.
>> > Or the InitContainer could be used for downloading. It is natively
>> > distributed and we will not have bottleneck.
>> >
>> > #2, Job graph recovery
>> > We could have an optimization to store job graph on the DFS. However, i
>> > suggest building a new jobgraph
>> > from the configuration is the default option. Since we will not always have
>> > a DFS store when deploying a
>> > Flink per-job cluster. Of course, we assume that using the same
>> > configuration(e.g. job_id, user_jar, main_class,
>> > main_args, parallelism, savepoint_settings, etc.) will get a same job
>> > graph. I think the standalone per-job
>> > already has the similar behavior.
>> >
>> > #3, What happens with jobs that have multiple execute calls?
>> > Currently, it is really a problem. Even we use a local client on Flink
>> > master side, it will have different behavior with
>> > client mode. For client mode, if we execute multiple times, then we will
>> > deploy multiple Flink clusters for each execute.
>> > I am not pretty sure whether it is reasonable. However, i still think using
>> > the local client is a good choice. We could
>> > continue the discussion in a new thread. @Zili Chen <[hidden email]> Do
>> > you want to drive this?
>> >
>> >
>> >
>> > Best,
>> > Yang
>> >
>> > Peter Huang <[hidden email]> 于2020年1月16日周四 上午1:55写道:
>> >
>> > > Hi Kostas,
>> > >
>> > > Thanks for this feedback. I can't agree more about the opinion. The
>> > > cluster mode should be added
>> > > first in per job cluster.
>> > >
>> > > 1) For job cluster implementation
>> > > 1. Job graph recovery from configuration or store as static job graph as
>> > > session cluster. I think the static one will be better for less recovery
>> > > time.
>> > > Let me update the doc for details.
>> > >
>> > > 2. For job execute multiple times, I think @Zili Chen
>> > > <[hidden email]> has proposed the local client solution that can
>> > > the run program actually in the cluster entry point. We can put the
>> > > implementation in the second stage,
>> > > or even a new FLIP for further discussion.
>> > >
>> > > 2) For session cluster implementation
>> > > We can disable the cluster mode for the session cluster in the first
>> > > stage. I agree the jar downloading will be a painful thing.
>> > > We can consider about PoC and performance evaluation first. If the end to
>> > > end experience is good enough, then we can consider
>> > > proceeding with the solution.
>> > >
>> > > Looking forward to more opinions from @Yang Wang <[hidden email]> @Zili
>> > > Chen <[hidden email]> @Dian Fu <[hidden email]>.
>> > >
>> > >
>> > > Best Regards
>> > > Peter Huang
>> > >
>> > > On Wed, Jan 15, 2020 at 7:50 AM Kostas Kloudas <[hidden email]> wrote:
>> > >
>> > >> Hi all,
>> > >>
>> > >> I am writing here as the discussion on the Google Doc seems to be a
>> > >> bit difficult to follow.
>> > >>
>> > >> I think that in order to be able to make progress, it would be helpful
>> > >> to focus on per-job mode for now.
>> > >> The reason is that:
>> > >>  1) making the (unique) JobSubmitHandler responsible for creating the
>> > >> jobgraphs,
>> > >>   which includes downloading dependencies, is not an optimal solution
>> > >>  2) even if we put the responsibility on the JobMaster, currently each
>> > >> job has its own
>> > >>   JobMaster but they all run on the same process, so we have again a
>> > >> single entity.
>> > >>
>> > >> Of course after this is done, and if we feel comfortable with the
>> > >> solution, then we can go to the session mode.
>> > >>
>> > >> A second comment has to do with fault-tolerance in the per-job,
>> > >> cluster-deploy mode.
>> > >> In the document, it is suggested that upon recovery, the JobMaster of
>> > >> each job re-creates the JobGraph.
>> > >> I am just wondering if it is better to create and store the jobGraph
>> > >> upon submission and only fetch it
>> > >> upon recovery so that we have a static jobGraph.
>> > >>
>> > >> Finally, I have a question which is what happens with jobs that have
>> > >> multiple execute calls?
>> > >> The semantics seem to change compared to the current behaviour, right?
>> > >>
>> > >> Cheers,
>> > >> Kostas
>> > >>
>> > >> On Wed, Jan 8, 2020 at 8:05 PM tison <[hidden email]> wrote:
>> > >> >
>> > >> > not always, Yang Wang is also not yet a committer but he can join the
>> > >> > channel. I cannot find the id by clicking “Add new member in channel” so
>> > >> > come to you and ask for try out the link. Possibly I will find other
>> > >> ways
>> > >> > but the original purpose is that the slack channel is a public area we
>> > >> > discuss about developing...
>> > >> > Best,
>> > >> > tison.
>> > >> >
>> > >> >
>> > >> > Peter Huang <[hidden email]> 于2020年1月9日周四 上午2:44写道:
>> > >> >
>> > >> > > Hi Tison,
>> > >> > >
>> > >> > > I am not the committer of Flink yet. I think I can't join it also.
>> > >> > >
>> > >> > >
>> > >> > > Best Regards
>> > >> > > Peter Huang
>> > >> > >
>> > >> > > On Wed, Jan 8, 2020 at 9:39 AM tison <[hidden email]> wrote:
>> > >> > >
>> > >> > > > Hi Peter,
>> > >> > > >
>> > >> > > > Could you try out this link?
>> > >> > > https://the-asf.slack.com/messages/CNA3ADZPH
>> > >> > > >
>> > >> > > > Best,
>> > >> > > > tison.
>> > >> > > >
>> > >> > > >
>> > >> > > > Peter Huang <[hidden email]> 于2020年1月9日周四 上午1:22写道:
>> > >> > > >
>> > >> > > > > Hi Tison,
>> > >> > > > >
>> > >> > > > > I can't join the group with shared link. Would you please add me
>> > >> into
>> > >> > > the
>> > >> > > > > group? My slack account is huangzhenqiu0825.
>> > >> > > > > Thank you in advance.
>> > >> > > > >
>> > >> > > > >
>> > >> > > > > Best Regards
>> > >> > > > > Peter Huang
>> > >> > > > >
>> > >> > > > > On Wed, Jan 8, 2020 at 12:02 AM tison <[hidden email]>
>> > >> wrote:
>> > >> > > > >
>> > >> > > > > > Hi Peter,
>> > >> > > > > >
>> > >> > > > > > As described above, this effort should get attention from people
>> > >> > > > > developing
>> > >> > > > > > FLIP-73 a.k.a. Executor abstractions. I recommend you to join
>> > >> the
>> > >> > > > public
>> > >> > > > > > slack channel[1] for Flink Client API Enhancement and you can
>> > >> try to
>> > >> > > > > share
>> > >> > > > > > you detailed thoughts there. It possibly gets more concrete
>> > >> > > attentions.
>> > >> > > > > >
>> > >> > > > > > Best,
>> > >> > > > > > tison.
>> > >> > > > > >
>> > >> > > > > > [1]
>> > >> > > > > >
>> > >> > > > > >
>> > >> > > > >
>> > >> > > >
>> > >> > >
>> > >> https://slack.com/share/IS21SJ75H/Rk8HhUly9FuEHb7oGwBZ33uL/enQtODg2MDYwNjE5MTg3LTA2MjIzNDc1M2ZjZDVlMjdlZjk1M2RkYmJhNjAwMTk2ZDZkODQ4NmY5YmI4OGRhNWJkYTViMTM1NzlmMzc4OWM
>> > >> > > > > >
>> > >> > > > > >
>> > >> > > > > > Peter Huang <[hidden email]> 于2020年1月7日周二 上午5:09写道:
>> > >> > > > > >
>> > >> > > > > > > Dear All,
>> > >> > > > > > >
>> > >> > > > > > > Happy new year! According to existing feedback from the
>> > >> community,
>> > >> > > we
>> > >> > > > > > > revised the doc with the consideration of session cluster
>> > >> support,
>> > >> > > > and
>> > >> > > > > > > concrete interface changes needed and execution plan. Please
>> > >> take
>> > >> > > one
>> > >> > > > > > more
>> > >> > > > > > > round of review at your most convenient time.
>> > >> > > > > > >
>> > >> > > > > > >
>> > >> > > > > > >
>> > >> > > > > >
>> > >> > > > >
>> > >> > > >
>> > >> > >
>> > >> https://docs.google.com/document/d/1aAwVjdZByA-0CHbgv16Me-vjaaDMCfhX7TzVVTuifYM/edit#
>> > >> > > > > > >
>> > >> > > > > > >
>> > >> > > > > > > Best Regards
>> > >> > > > > > > Peter Huang
>> > >> > > > > > >
>> > >> > > > > > >
>> > >> > > > > > >
>> > >> > > > > > >
>> > >> > > > > > >
>> > >> > > > > > > On Thu, Jan 2, 2020 at 11:29 AM Peter Huang <
>> > >> > > > > [hidden email]>
>> > >> > > > > > > wrote:
>> > >> > > > > > >
>> > >> > > > > > > > Hi Dian,
>> > >> > > > > > > > Thanks for giving us valuable feedbacks.
>> > >> > > > > > > >
>> > >> > > > > > > > 1) It's better to have a whole design for this feature
>> > >> > > > > > > > For the suggestion of enabling the cluster mode also session
>> > >> > > > > cluster, I
>> > >> > > > > > > > think Flink already supported it. WebSubmissionExtension
>> > >> already
>> > >> > > > > allows
>> > >> > > > > > > > users to start a job with the specified jar by using web UI.
>> > >> > > > > > > > But we need to enable the feature from CLI for both local
>> > >> jar,
>> > >> > > > remote
>> > >> > > > > > > jar.
>> > >> > > > > > > > I will align with Yang Wang first about the details and
>> > >> update
>> > >> > > the
>> > >> > > > > > design
>> > >> > > > > > > > doc.
>> > >> > > > > > > >
>> > >> > > > > > > > 2) It's better to consider the convenience for users, such
>> > >> as
>> > >> > > > > debugging
>> > >> > > > > > > >
>> > >> > > > > > > > I am wondering whether we can store the exception in
>> > >> jobgragh
>> > >> > > > > > > > generation in application master. As no streaming graph can
>> > >> be
>> > >> > > > > > scheduled
>> > >> > > > > > > in
>> > >> > > > > > > > this case, there will be no more TM will be requested from
>> > >> > > FlinkRM.
>> > >> > > > > > > > If the AM is still running, users can still query it from
>> > >> CLI. As
>> > >> > > > it
>> > >> > > > > > > > requires more change, we can get some feedback from <
>> > >> > > > > > [hidden email]
>> > >> > > > > > > >
>> > >> > > > > > > > and @[hidden email] <[hidden email]>.
>> > >> > > > > > > >
>> > >> > > > > > > > 3) It's better to consider the impact to the stability of
>> > >> the
>> > >> > > > cluster
>> > >> > > > > > > >
>> > >> > > > > > > > I agree with Yang Wang's opinion.
>> > >> > > > > > > >
>> > >> > > > > > > >
>> > >> > > > > > > >
>> > >> > > > > > > > Best Regards
>> > >> > > > > > > > Peter Huang
>> > >> > > > > > > >
>> > >> > > > > > > >
>> > >> > > > > > > > On Sun, Dec 29, 2019 at 9:44 PM Dian Fu <
>> > >> [hidden email]>
>> > >> > > > > wrote:
>> > >> > > > > > > >
>> > >> > > > > > > >> Hi all,
>> > >> > > > > > > >>
>> > >> > > > > > > >> Sorry to jump into this discussion. Thanks everyone for the
>> > >> > > > > > discussion.
>> > >> > > > > > > >> I'm very interested in this topic although I'm not an
>> > >> expert in
>> > >> > > > this
>> > >> > > > > > > part.
>> > >> > > > > > > >> So I'm glad to share my thoughts as following:
>> > >> > > > > > > >>
>> > >> > > > > > > >> 1) It's better to have a whole design for this feature
>> > >> > > > > > > >> As we know, there are two deployment modes: per-job mode
>> > >> and
>> > >> > > > session
>> > >> > > > > > > >> mode. I'm wondering which mode really needs this feature.
>> > >> As the
>> > >> > > > > > design
>> > >> > > > > > > doc
>> > >> > > > > > > >> mentioned, per-job mode is more used for streaming jobs and
>> > >> > > > session
>> > >> > > > > > > mode is
>> > >> > > > > > > >> usually used for batch jobs(Of course, the job types and
>> > >> the
>> > >> > > > > > deployment
>> > >> > > > > > > >> modes are orthogonal). Usually streaming job is only
>> > >> needed to
>> > >> > > be
>> > >> > > > > > > submitted
>> > >> > > > > > > >> once and it will run for days or weeks, while batch jobs
>> > >> will be
>> > >> > > > > > > submitted
>> > >> > > > > > > >> more frequently compared with streaming jobs. This means
>> > >> that
>> > >> > > > maybe
>> > >> > > > > > > session
>> > >> > > > > > > >> mode also needs this feature. However, if we support this
>> > >> > > feature
>> > >> > > > in
>> > >> > > > > > > >> session mode, the application master will become the new
>> > >> > > > centralized
>> > >> > > > > > > >> service(which should be solved). So in this case, it's
>> > >> better to
>> > >> > > > > have
>> > >> > > > > > a
>> > >> > > > > > > >> complete design for both per-job mode and session mode.
>> > >> > > > Furthermore,
>> > >> > > > > > > even
>> > >> > > > > > > >> if we can do it phase by phase, we need to have a whole
>> > >> picture
>> > >> > > of
>> > >> > > > > how
>> > >> > > > > > > it
>> > >> > > > > > > >> works in both per-job mode and session mode.
>> > >> > > > > > > >>
>> > >> > > > > > > >> 2) It's better to consider the convenience for users, such
>> > >> as
>> > >> > > > > > debugging
>> > >> > > > > > > >> After we finish this feature, the job graph will be
>> > >> compiled in
>> > >> > > > the
>> > >> > > > > > > >> application master, which means that users cannot easily
>> > >> get the
>> > >> > > > > > > exception
>> > >> > > > > > > >> message synchorousely in the job client if there are
>> > >> problems
>> > >> > > > during
>> > >> > > > > > the
>> > >> > > > > > > >> job graph compiling (especially for platform users), such
>> > >> as the
>> > >> > > > > > > resource
>> > >> > > > > > > >> path is incorrect, the user program itself has some
>> > >> problems,
>> > >> > > etc.
>> > >> > > > > > What
>> > >> > > > > > > I'm
>> > >> > > > > > > >> thinking is that maybe we should throw the exceptions as
>> > >> early
>> > >> > > as
>> > >> > > > > > > possible
>> > >> > > > > > > >> (during job submission stage).
>> > >> > > > > > > >>
>> > >> > > > > > > >> 3) It's better to consider the impact to the stability of
>> > >> the
>> > >> > > > > cluster
>> > >> > > > > > > >> If we perform the compiling in the application master, we
>> > >> should
>> > >> > > > > > > consider
>> > >> > > > > > > >> the impact of the compiling errors. Although YARN could
>> > >> resume
>> > >> > > the
>> > >> > > > > > > >> application master in case of failures, but in some case
>> > >> the
>> > >> > > > > compiling
>> > >> > > > > > > >> failure may be a waste of cluster resource and may impact
>> > >> the
>> > >> > > > > > stability
>> > >> > > > > > > the
>> > >> > > > > > > >> cluster and the other jobs in the cluster, such as the
>> > >> resource
>> > >> > > > path
>> > >> > > > > > is
>> > >> > > > > > > >> incorrect, the user program itself has some problems(in
>> > >> this
>> > >> > > case,
>> > >> > > > > job
>> > >> > > > > > > >> failover cannot solve this kind of problems) etc. In the
>> > >> current
>> > >> > > > > > > >> implemention, the compiling errors are handled in the
>> > >> client
>> > >> > > side
>> > >> > > > > and
>> > >> > > > > > > there
>> > >> > > > > > > >> is no impact to the cluster at all.
>> > >> > > > > > > >>
>> > >> > > > > > > >> Regarding to 1), it's clearly pointed in the design doc
>> > >> that
>> > >> > > only
>> > >> > > > > > > per-job
>> > >> > > > > > > >> mode will be supported. However, I think it's better to
>> > >> also
>> > >> > > > > consider
>> > >> > > > > > > the
>> > >> > > > > > > >> session mode in the design doc.
>> > >> > > > > > > >> Regarding to 2) and 3), I have not seen related sections
>> > >> in the
>> > >> > > > > design
>> > >> > > > > > > >> doc. It will be good if we can cover them in the design
>> > >> doc.
>> > >> > > > > > > >>
>> > >> > > > > > > >> Feel free to correct me If there is anything I
>> > >> misunderstand.
>> > >> > > > > > > >>
>> > >> > > > > > > >> Regards,
>> > >> > > > > > > >> Dian
>> > >> > > > > > > >>
>> > >> > > > > > > >>
>> > >> > > > > > > >> > 在 2019年12月27日,上午3:13,Peter Huang <
>> > >> [hidden email]>
>> > >> > > > 写道:
>> > >> > > > > > > >> >
>> > >> > > > > > > >> > Hi Yang,
>> > >> > > > > > > >> >
>> > >> > > > > > > >> > I can't agree more. The effort definitely needs to align
>> > >> with
>> > >> > > > the
>> > >> > > > > > > final
>> > >> > > > > > > >> > goal of FLIP-73.
>> > >> > > > > > > >> > I am thinking about whether we can achieve the goal with
>> > >> two
>> > >> > > > > phases.
>> > >> > > > > > > >> >
>> > >> > > > > > > >> > 1) Phase I
>> > >> > > > > > > >> > As the CLiFrontend will not be depreciated soon. We can
>> > >> still
>> > >> > > > use
>> > >> > > > > > the
>> > >> > > > > > > >> > deployMode flag there,
>> > >> > > > > > > >> > pass the program info through Flink configuration,  use
>> > >> the
>> > >> > > > > > > >> > ClassPathJobGraphRetriever
>> > >> > > > > > > >> > to generate the job graph in ClusterEntrypoints of yarn
>> > >> and
>> > >> > > > > > > Kubernetes.
>> > >> > > > > > > >> >
>> > >> > > > > > > >> > 2) Phase II
>> > >> > > > > > > >> > In  AbstractJobClusterExecutor, the job graph is
>> > >> generated in
>> > >> > > > the
>> > >> > > > > > > >> execute
>> > >> > > > > > > >> > function. We can still
>> > >> > > > > > > >> > use the deployMode in it. With deployMode = cluster, the
>> > >> > > execute
>> > >> > > > > > > >> function
>> > >> > > > > > > >> > only starts the cluster.
>> > >> > > > > > > >> >
>> > >> > > > > > > >> > When {Yarn/Kuberneates}PerJobClusterEntrypoint starts,
>> > >> It will
>> > >> > > > > start
>> > >> > > > > > > the
>> > >> > > > > > > >> > dispatch first, then we can use
>> > >> > > > > > > >> > a ClusterEnvironment similar to ContextEnvironment to
>> > >> submit
>> > >> > > the
>> > >> > > > > job
>> > >> > > > > > > >> with
>> > >> > > > > > > >> > jobName the local
>> > >> > > > > > > >> > dispatcher. For the details, we need more investigation.
>> > >> Let's
>> > >> > > > > wait
>> > >> > > > > > > >> > for @Aljoscha
>> > >> > > > > > > >> > Krettek <[hidden email]> @Till Rohrmann <
>> > >> > > > > [hidden email]
>> > >> > > > > > >'s
>> > >> > > > > > > >> > feedback after the holiday season.
>> > >> > > > > > > >> >
>> > >> > > > > > > >> > Thank you in advance. Merry Chrismas and Happy New
>> > >> Year!!!
>> > >> > > > > > > >> >
>> > >> > > > > > > >> >
>> > >> > > > > > > >> > Best Regards
>> > >> > > > > > > >> > Peter Huang
>> > >> > > > > > > >> >
>> > >> > > > > > > >> >
>> > >> > > > > > > >> >
>> > >> > > > > > > >> >
>> > >> > > > > > > >> >
>> > >> > > > > > > >> >
>> > >> > > > > > > >> >
>> > >> > > > > > > >> >
>> > >> > > > > > > >> > On Wed, Dec 25, 2019 at 1:08 AM Yang Wang <
>> > >> > > > [hidden email]>
>> > >> > > > > > > >> wrote:
>> > >> > > > > > > >> >
>> > >> > > > > > > >> >> Hi Peter,
>> > >> > > > > > > >> >>
>> > >> > > > > > > >> >> I think we need to reconsider tison's suggestion
>> > >> seriously.
>> > >> > > > After
>> > >> > > > > > > >> FLIP-73,
>> > >> > > > > > > >> >> the deployJobCluster has
>> > >> > > > > > > >> >> beenmoved into `JobClusterExecutor#execute`. It should
>> > >> not be
>> > >> > > > > > > perceived
>> > >> > > > > > > >> >> for `CliFrontend`. That
>> > >> > > > > > > >> >> means the user program will *ALWAYS* be executed on
>> > >> client
>> > >> > > > side.
>> > >> > > > > > This
>> > >> > > > > > > >> is
>> > >> > > > > > > >> >> the by design behavior.
>> > >> > > > > > > >> >> So, we could not just add `if(client mode) .. else
>> > >> if(cluster
>> > >> > > > > mode)
>> > >> > > > > > > >> ...`
>> > >> > > > > > > >> >> codes in `CliFrontend` to bypass
>> > >> > > > > > > >> >> the executor. We need to find a clean way to decouple
>> > >> > > executing
>> > >> > > > > > user
>> > >> > > > > > > >> >> program and deploying per-job
>> > >> > > > > > > >> >> cluster. Based on this, we could support to execute user
>> > >> > > > program
>> > >> > > > > on
>> > >> > > > > > > >> client
>> > >> > > > > > > >> >> or master side.
>> > >> > > > > > > >> >>
>> > >> > > > > > > >> >> Maybe Aljoscha and Jeff could give some good
>> > >> suggestions.
>> > >> > > > > > > >> >>
>> > >> > > > > > > >> >>
>> > >> > > > > > > >> >>
>> > >> > > > > > > >> >> Best,
>> > >> > > > > > > >> >> Yang
>> > >> > > > > > > >> >>
>> > >> > > > > > > >> >> Peter Huang <[hidden email]> 于2019年12月25日周三
>> > >> > > > > 上午4:03写道:
>> > >> > > > > > > >> >>
>> > >> > > > > > > >> >>> Hi Jingjing,
>> > >> > > > > > > >> >>>
>> > >> > > > > > > >> >>> The improvement proposed is a deployment option for
>> > >> CLI. For
>> > >> > > > SQL
>> > >> > > > > > > based
>> > >> > > > > > > >> >>> Flink application, It is more convenient to use the
>> > >> existing
>> > >> > > > > model
>> > >> > > > > > > in
>> > >> > > > > > > >> >>> SqlClient in which
>> > >> > > > > > > >> >>> the job graph is generated within SqlClient. After
>> > >> adding
>> > >> > > the
>> > >> > > > > > > delayed
>> > >> > > > > > > >> job
>> > >> > > > > > > >> >>> graph generation, I think there is no change is needed
>> > >> for
>> > >> > > > your
>> > >> > > > > > > side.
>> > >> > > > > > > >> >>>
>> > >> > > > > > > >> >>>
>> > >> > > > > > > >> >>> Best Regards
>> > >> > > > > > > >> >>> Peter Huang
>> > >> > > > > > > >> >>>
>> > >> > > > > > > >> >>>
>> > >> > > > > > > >> >>> On Wed, Dec 18, 2019 at 6:01 AM jingjing bai <
>> > >> > > > > > > >> [hidden email]>
>> > >> > > > > > > >> >>> wrote:
>> > >> > > > > > > >> >>>
>> > >> > > > > > > >> >>>> hi peter:
>> > >> > > > > > > >> >>>>    we had extension SqlClent to support sql job
>> > >> submit in
>> > >> > > web
>> > >> > > > > > base
>> > >> > > > > > > on
>> > >> > > > > > > >> >>>> flink 1.9.   we support submit to yarn on per job
>> > >> mode too.
>> > >> > > > > > > >> >>>>    in this case, the job graph generated  on client
>> > >> side
>> > >> > > .  I
>> > >> > > > > > think
>> > >> > > > > > > >> >>> this
>> > >> > > > > > > >> >>>> discuss Mainly to improve api programme.  but in my
>> > >> case ,
>> > >> > > > > there
>> > >> > > > > > is
>> > >> > > > > > > >> no
>> > >> > > > > > > >> >>>> jar to upload but only a sql string .
>> > >> > > > > > > >> >>>>    do u had more suggestion to improve for sql mode
>> > >> or it
>> > >> > > is
>> > >> > > > > > only a
>> > >> > > > > > > >> >>>> switch for api programme?
>> > >> > > > > > > >> >>>>
>> > >> > > > > > > >> >>>>
>> > >> > > > > > > >> >>>> best
>> > >> > > > > > > >> >>>> bai jj
>> > >> > > > > > > >> >>>>
>> > >> > > > > > > >> >>>>
>> > >> > > > > > > >> >>>> Yang Wang <[hidden email]> 于2019年12月18日周三
>> > >> 下午7:21写道:
>> > >> > > > > > > >> >>>>
>> > >> > > > > > > >> >>>>> I just want to revive this discussion.
>> > >> > > > > > > >> >>>>>
>> > >> > > > > > > >> >>>>> Recently, i am thinking about how to natively run
>> > >> flink
>> > >> > > > > per-job
>> > >> > > > > > > >> >>> cluster on
>> > >> > > > > > > >> >>>>> Kubernetes.
>> > >> > > > > > > >> >>>>> The per-job mode on Kubernetes is very different
>> > >> from on
>> > >> > > > Yarn.
>> > >> > > > > > And
>> > >> > > > > > > >> we
>> > >> > > > > > > >> >>> will
>> > >> > > > > > > >> >>>>> have
>> > >> > > > > > > >> >>>>> the same deployment requirements to the client and
>> > >> entry
>> > >> > > > > point.
>> > >> > > > > > > >> >>>>>
>> > >> > > > > > > >> >>>>> 1. Flink client not always need a local jar to start
>> > >> a
>> > >> > > Flink
>> > >> > > > > > > per-job
>> > >> > > > > > > >> >>>>> cluster. We could
>> > >> > > > > > > >> >>>>> support multiple schemas. For example,
>> > >> > > > file:///path/of/my.jar
>> > >> > > > > > > means
>> > >> > > > > > > >> a
>> > >> > > > > > > >> >>> jar
>> > >> > > > > > > >> >>>>> located
>> > >> > > > > > > >> >>>>> at client side,
>> > >> hdfs://myhdfs/user/myname/flink/my.jar
>> > >> > > > means a
>> > >> > > > > > jar
>> > >> > > > > > > >> >>> located
>> > >> > > > > > > >> >>>>> at
>> > >> > > > > > > >> >>>>> remote hdfs, local:///path/in/image/my.jar means a
>> > >> jar
>> > >> > > > located
>> > >> > > > > > at
>> > >> > > > > > > >> >>>>> jobmanager side.
>> > >> > > > > > > >> >>>>>
>> > >> > > > > > > >> >>>>> 2. Support running user program on master side. This
>> > >> also
>> > >> > > > > means
>> > >> > > > > > > the
>> > >> > > > > > > >> >>> entry
>> > >> > > > > > > >> >>>>> point
>> > >> > > > > > > >> >>>>> will generate the job graph on master side. We could
>> > >> use
>> > >> > > the
>> > >> > > > > > > >> >>>>> ClasspathJobGraphRetriever
>> > >> > > > > > > >> >>>>> or start a local Flink client to achieve this
>> > >> purpose.
>> > >> > > > > > > >> >>>>>
>> > >> > > > > > > >> >>>>>
>> > >> > > > > > > >> >>>>> cc tison, Aljoscha & Kostas Do you think this is the
>> > >> right
>> > >> > > > > > > >> direction we
>> > >> > > > > > > >> >>>>> need to work?
>> > >> > > > > > > >> >>>>>
>> > >> > > > > > > >> >>>>> tison <[hidden email]> 于2019年12月12日周四
>> > >> 下午4:48写道:
>> > >> > > > > > > >> >>>>>
>> > >> > > > > > > >> >>>>>> A quick idea is that we separate the deployment
>> > >> from user
>> > >> > > > > > program
>> > >> > > > > > > >> >>> that
>> > >> > > > > > > >> >>>>> it
>> > >> > > > > > > >> >>>>>> has always been done
>> > >> > > > > > > >> >>>>>> outside the program. On user program executed there
>> > >> is
>> > >> > > > > always a
>> > >> > > > > > > >> >>>>>> ClusterClient that communicates with
>> > >> > > > > > > >> >>>>>> an existing cluster, remote or local. It will be
>> > >> another
>> > >> > > > > thread
>> > >> > > > > > > so
>> > >> > > > > > > >> >>> just
>> > >> > > > > > > >> >>>>> for
>> > >> > > > > > > >> >>>>>> your information.
>> > >> > > > > > > >> >>>>>>
>> > >> > > > > > > >> >>>>>> Best,
>> > >> > > > > > > >> >>>>>> tison.
>> > >> > > > > > > >> >>>>>>
>> > >> > > > > > > >> >>>>>>
>> > >> > > > > > > >> >>>>>> tison <[hidden email]> 于2019年12月12日周四
>> > >> 下午4:40写道:
>> > >> > > > > > > >> >>>>>>
>> > >> > > > > > > >> >>>>>>> Hi Peter,
>> > >> > > > > > > >> >>>>>>>
>> > >> > > > > > > >> >>>>>>> Another concern I realized recently is that with
>> > >> current
>> > >> > > > > > > Executors
>> > >> > > > > > > >> >>>>>>> abstraction(FLIP-73)
>> > >> > > > > > > >> >>>>>>> I'm afraid that user program is designed to ALWAYS
>> > >> run
>> > >> > > on
>> > >> > > > > the
>> > >> > > > > > > >> >>> client
>> > >> > > > > > > >> >>>>>> side.
>> > >> > > > > > > >> >>>>>>> Specifically,
>> > >> > > > > > > >> >>>>>>> we deploy the job in executor when env.execute
>> > >> called.
>> > >> > > > This
>> > >> > > > > > > >> >>>>> abstraction
>> > >> > > > > > > >> >>>>>>> possibly prevents
>> > >> > > > > > > >> >>>>>>> Flink runs user program on the cluster side.
>> > >> > > > > > > >> >>>>>>>
>> > >> > > > > > > >> >>>>>>> For your proposal, in this case we already
>> > >> compiled the
>> > >> > > > > > program
>> > >> > > > > > > >> and
>> > >> > > > > > > >> >>>>> run
>> > >> > > > > > > >> >>>>>> on
>> > >> > > > > > > >> >>>>>>> the client side,
>> > >> > > > > > > >> >>>>>>> even we deploy a cluster and retrieve job graph
>> > >> from
>> > >> > > > program
>> > >> > > > > > > >> >>>>> metadata, it
>> > >> > > > > > > >> >>>>>>> doesn't make
>> > >> > > > > > > >> >>>>>>> many sense.
>> > >> > > > > > > >> >>>>>>>
>> > >> > > > > > > >> >>>>>>> cc Aljoscha & Kostas what do you think about this
>> > >> > > > > constraint?
>> > >> > > > > > > >> >>>>>>>
>> > >> > > > > > > >> >>>>>>> Best,
>> > >> > > > > > > >> >>>>>>> tison.
>> > >> > > > > > > >> >>>>>>>
>> > >> > > > > > > >> >>>>>>>
>> > >> > > > > > > >> >>>>>>> Peter Huang <[hidden email]>
>> > >> 于2019年12月10日周二
>> > >> > > > > > > >> 下午12:45写道:
>> > >> > > > > > > >> >>>>>>>
>> > >> > > > > > > >> >>>>>>>> Hi Tison,
>> > >> > > > > > > >> >>>>>>>>
>> > >> > > > > > > >> >>>>>>>> Yes, you are right. I think I made the wrong
>> > >> argument
>> > >> > > in
>> > >> > > > > the
>> > >> > > > > > > doc.
>> > >> > > > > > > >> >>>>>>>> Basically, the packaging jar problem is only for
>> > >> > > platform
>> > >> > > > > > > users.
>> > >> > > > > > > >> >>> In
>> > >> > > > > > > >> >>>>> our
>> > >> > > > > > > >> >>>>>>>> internal deploy service,
>> > >> > > > > > > >> >>>>>>>> we further optimized the deployment latency by
>> > >> letting
>> > >> > > > > users
>> > >> > > > > > to
>> > >> > > > > > > >> >>>>>> packaging
>> > >> > > > > > > >> >>>>>>>> flink-runtime together with the uber jar, so that
>> > >> we
>> > >> > > > don't
>> > >> > > > > > need
>> > >> > > > > > > >> to
>> > >> > > > > > > >> >>>>>>>> consider
>> > >> > > > > > > >> >>>>>>>> multiple flink version
>> > >> > > > > > > >> >>>>>>>> support for now. In the session client mode, as
>> > >> Flink
>> > >> > > > libs
>> > >> > > > > > will
>> > >> > > > > > > >> be
>> > >> > > > > > > >> >>>>>> shipped
>> > >> > > > > > > >> >>>>>>>> anyway as local resources of yarn. Users actually
>> > >> don't
>> > >> > > > > need
>> > >> > > > > > to
>> > >> > > > > > > >> >>>>> package
>> > >> > > > > > > >> >>>>>>>> those libs into job jar.
>> > >> > > > > > > >> >>>>>>>>
>> > >> > > > > > > >> >>>>>>>>
>> > >> > > > > > > >> >>>>>>>>
>> > >> > > > > > > >> >>>>>>>> Best Regards
>> > >> > > > > > > >> >>>>>>>> Peter Huang
>> > >> > > > > > > >> >>>>>>>>
>> > >> > > > > > > >> >>>>>>>> On Mon, Dec 9, 2019 at 8:35 PM tison <
>> > >> > > > [hidden email]
>> > >> > > > > >
>> > >> > > > > > > >> >>> wrote:
>> > >> > > > > > > >> >>>>>>>>
>> > >> > > > > > > >> >>>>>>>>>> 3. What do you mean about the package? Do users
>> > >> need
>> > >> > > to
>> > >> > > > > > > >> >>> compile
>> > >> > > > > > > >> >>>>>> their
>> > >> > > > > > > >> >>>>>>>>> jars
>> > >> > > > > > > >> >>>>>>>>> inlcuding flink-clients, flink-optimizer,
>> > >> flink-table
>> > >> > > > > codes?
>> > >> > > > > > > >> >>>>>>>>>
>> > >> > > > > > > >> >>>>>>>>> The answer should be no because they exist in
>> > >> system
>> > >> > > > > > > classpath.
>> > >> > > > > > > >> >>>>>>>>>
>> > >> > > > > > > >> >>>>>>>>> Best,
>> > >> > > > > > > >> >>>>>>>>> tison.
>> > >> > > > > > > >> >>>>>>>>>
>> > >> > > > > > > >> >>>>>>>>>
>> > >> > > > > > > >> >>>>>>>>> Yang Wang <[hidden email]> 于2019年12月10日周二
>> > >> > > > > 下午12:18写道:
>> > >> > > > > > > >> >>>>>>>>>
>> > >> > > > > > > >> >>>>>>>>>> Hi Peter,
>> > >> > > > > > > >> >>>>>>>>>>
>> > >> > > > > > > >> >>>>>>>>>> Thanks a lot for starting this discussion. I
>> > >> think
>> > >> > > this
>> > >> > > > > is
>> > >> > > > > > a
>> > >> > > > > > > >> >>> very
>> > >> > > > > > > >> >>>>>>>> useful
>> > >> > > > > > > >> >>>>>>>>>> feature.
>> > >> > > > > > > >> >>>>>>>>>>
>> > >> > > > > > > >> >>>>>>>>>> Not only for Yarn, i am focused on flink on
>> > >> > > Kubernetes
>> > >> > > > > > > >> >>>>> integration
>> > >> > > > > > > >> >>>>>> and
>> > >> > > > > > > >> >>>>>>>>> come
>> > >> > > > > > > >> >>>>>>>>>> across the same
>> > >> > > > > > > >> >>>>>>>>>> problem. I do not want the job graph generated
>> > >> on
>> > >> > > > client
>> > >> > > > > > > side.
>> > >> > > > > > > >> >>>>>>>> Instead,
>> > >> > > > > > > >> >>>>>>>>> the
>> > >> > > > > > > >> >>>>>>>>>> user jars are built in
>> > >> > > > > > > >> >>>>>>>>>> a user-defined image. When the job manager
>> > >> launched,
>> > >> > > we
>> > >> > > > > > just
>> > >> > > > > > > >> >>>>> need to
>> > >> > > > > > > >> >>>>>>>>>> generate the job graph
>> > >> > > > > > > >> >>>>>>>>>> based on local user jars.
>> > >> > > > > > > >> >>>>>>>>>>
>> > >> > > > > > > >> >>>>>>>>>> I have some small suggestion about this.
>> > >> > > > > > > >> >>>>>>>>>>
>> > >> > > > > > > >> >>>>>>>>>> 1. `ProgramJobGraphRetriever` is very similar to
>> > >> > > > > > > >> >>>>>>>>>> `ClasspathJobGraphRetriever`, the differences
>> > >> > > > > > > >> >>>>>>>>>> are the former needs `ProgramMetadata` and the
>> > >> latter
>> > >> > > > > needs
>> > >> > > > > > > >> >>> some
>> > >> > > > > > > >> >>>>>>>>> arguments.
>> > >> > > > > > > >> >>>>>>>>>> Is it possible to
>> > >> > > > > > > >> >>>>>>>>>> have an unified `JobGraphRetriever` to support
>> > >> both?
>> > >> > > > > > > >> >>>>>>>>>> 2. Is it possible to not use a local user jar to
>> > >> > > start
>> > >> > > > a
>> > >> > > > > > > >> >>> per-job
>> > >> > > > > > > >> >>>>>>>> cluster?
>> > >> > > > > > > >> >>>>>>>>>> In your case, the user jars has
>> > >> > > > > > > >> >>>>>>>>>> existed on hdfs already and we do need to
>> > >> download
>> > >> > > the
>> > >> > > > > jars
>> > >> > > > > > > to
>> > >> > > > > > > >> >>>>>>>> deployer
>> > >> > > > > > > >> >>>>>>>>>> service. Currently, we
>> > >> > > > > > > >> >>>>>>>>>> always need a local user jar to start a flink
>> > >> > > cluster.
>> > >> > > > It
>> > >> > > > > > is
>> > >> > > > > > > >> >>> be
>> > >> > > > > > > >> >>>>>> great
>> > >> > > > > > > >> >>>>>>>> if
>> > >> > > > > > > >> >>>>>>>>> we
>> > >> > > > > > > >> >>>>>>>>>> could support remote user jars.
>> > >> > > > > > > >> >>>>>>>>>>>> In the implementation, we assume users package
>> > >> > > > > > > >> >>> flink-clients,
>> > >> > > > > > > >> >>>>>>>>>> flink-optimizer, flink-table together within
>> > >> the job
>> > >> > > > jar.
>> > >> > > > > > > >> >>>>> Otherwise,
>> > >> > > > > > > >> >>>>>>>> the
>> > >> > > > > > > >> >>>>>>>>>> job graph generation within
>> > >> JobClusterEntryPoint will
>> > >> > > > > fail.
>> > >> > > > > > > >> >>>>>>>>>> 3. What do you mean about the package? Do users
>> > >> need
>> > >> > > to
>> > >> > > > > > > >> >>> compile
>> > >> > > > > > > >> >>>>>> their
>> > >> > > > > > > >> >>>>>>>>> jars
>> > >> > > > > > > >> >>>>>>>>>> inlcuding flink-clients, flink-optimizer,
>> > >> flink-table
>> > >> > > > > > codes?
>> > >> > > > > > > >> >>>>>>>>>>
>> > >> > > > > > > >> >>>>>>>>>>
>> > >> > > > > > > >> >>>>>>>>>>
>> > >> > > > > > > >> >>>>>>>>>> Best,
>> > >> > > > > > > >> >>>>>>>>>> Yang
>> > >> > > > > > > >> >>>>>>>>>>
>> > >> > > > > > > >> >>>>>>>>>> Peter Huang <[hidden email]>
>> > >> > > > 于2019年12月10日周二
>> > >> > > > > > > >> >>>>> 上午2:37写道:
>> > >> > > > > > > >> >>>>>>>>>>
>> > >> > > > > > > >> >>>>>>>>>>> Dear All,
>> > >> > > > > > > >> >>>>>>>>>>>
>> > >> > > > > > > >> >>>>>>>>>>> Recently, the Flink community starts to
>> > >> improve the
>> > >> > > > yarn
>> > >> > > > > > > >> >>>>> cluster
>> > >> > > > > > > >> >>>>>>>>>> descriptor
>> > >> > > > > > > >> >>>>>>>>>>> to make job jar and config files configurable
>> > >> from
>> > >> > > > CLI.
>> > >> > > > > It
>> > >> > > > > > > >> >>>>>> improves
>> > >> > > > > > > >> >>>>>>>> the
>> > >> > > > > > > >> >>>>>>>>>>> flexibility of  Flink deployment Yarn Per Job
>> > >> Mode.
>> > >> > > > For
>> > >> > > > > > > >> >>>>> platform
>> > >> > > > > > > >> >>>>>>>> users
>> > >> > > > > > > >> >>>>>>>>>> who
>> > >> > > > > > > >> >>>>>>>>>>> manage tens of hundreds of streaming pipelines
>> > >> for
>> > >> > > the
>> > >> > > > > > whole
>> > >> > > > > > > >> >>>>> org
>> > >> > > > > > > >> >>>>>> or
>> > >> > > > > > > >> >>>>>>>>>>> company, we found the job graph generation in
>> > >> > > > > client-side
>> > >> > > > > > is
>> > >> > > > > > > >> >>>>>> another
>> > >> > > > > > > >> >>>>>>>>>>> pinpoint. Thus, we want to propose a
>> > >> configurable
>> > >> > > > > feature
>> > >> > > > > > > >> >>> for
>> > >> > > > > > > >> >>>>>>>>>>> FlinkYarnSessionCli. The feature can allow
>> > >> users to
>> > >> > > > > choose
>> > >> > > > > > > >> >>> the
>> > >> > > > > > > >> >>>>> job
>> > >> > > > > > > >> >>>>>>>>> graph
>> > >> > > > > > > >> >>>>>>>>>>> generation in Flink ClusterEntryPoint so that
>> > >> the
>> > >> > > job
>> > >> > > > > jar
>> > >> > > > > > > >> >>>>> doesn't
>> > >> > > > > > > >> >>>>>>>> need
>> > >> > > > > > > >> >>>>>>>>> to
>> > >> > > > > > > >> >>>>>>>>>>> be locally for the job graph generation. The
>> > >> > > proposal
>> > >> > > > is
>> > >> > > > > > > >> >>>>> organized
>> > >> > > > > > > >> >>>>>>>> as a
>> > >> > > > > > > >> >>>>>>>>>>> FLIP
>> > >> > > > > > > >> >>>>>>>>>>>
>> > >> > > > > > > >> >>>>>>>>>>>
>> > >> > > > > > > >> >>>>>>>>>>
>> > >> > > > > > > >> >>>>>>>>>
>> > >> > > > > > > >> >>>>>>>>
>> > >> > > > > > > >> >>>>>>
>> > >> > > > > > > >> >>>>>
>> > >> > > > > > > >> >>>
>> > >> > > > > > > >>
>> > >> > > > > > >
>> > >> > > > > >
>> > >> > > > >
>> > >> > > >
>> > >> > >
>> > >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-85+Delayed+JobGraph+Generation
>> > >> > > > > > > >> >>>>>>>>>>> .
>> > >> > > > > > > >> >>>>>>>>>>>
>> > >> > > > > > > >> >>>>>>>>>>> Any questions and suggestions are welcomed.
>> > >> Thank
>> > >> > > you
>> > >> > > > in
>> > >> > > > > > > >> >>>>> advance.
>> > >> > > > > > > >> >>>>>>>>>>>
>> > >> > > > > > > >> >>>>>>>>>>>
>> > >> > > > > > > >> >>>>>>>>>>> Best Regards
>> > >> > > > > > > >> >>>>>>>>>>> Peter Huang
>> > >> > > > > > > >> >>>>>>>>>>>
>> > >> > > > > > > >> >>>>>>>>>>
>> > >> > > > > > > >> >>>>>>>>>
>> > >> > > > > > > >> >>>>>>>>
>> > >> > > > > > > >> >>>>>>>
>> > >> > > > > > > >> >>>>>>
>> > >> > > > > > > >> >>>>>
>> > >> > > > > > > >> >>>>
>> > >> > > > > > > >> >>>
>> > >> > > > > > > >> >>
>> > >> > > > > > > >>
>> > >> > > > > > > >>
>> > >> > > > > > >
>> > >> > > > > >
>> > >> > > > >
>> > >> > > >
>> > >> > >
>> > >>
>> > >
>>

Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] FLIP-85: Delayed Job Graph Generation

Peter Huang
Hi Kostas,

Thanks for updating the wiki. We have aligned with the implementations in
the doc. But I feel it is still a little bit confusing of the naming from a
user's perspective. It is well known that Flink support per job cluster and
session cluster. The concept is in the layer of how a job is managed within
Flink. The method introduced util now is a kind of mixing job and session
cluster to promising the implementation complexity. We probably don't need
to label it as Application Model as the same layer of per job cluster and
session cluster. Conceptually, I think it is still a cluster mode
implementation for per job cluster.

To minimize the confusion of users, I think it would be better just an
option of per job cluster for each type of cluster manager. How do you
think?


Best Regards
Peter Huang








On Mon, Mar 2, 2020 at 7:22 AM Kostas Kloudas <[hidden email]> wrote:

> Hi Yang,
>
> The difference between per-job and application mode is that, as you
> described, in the per-job mode the main is executed on the client
> while in the application mode, the main is executed on the cluster.
> I do not think we have to offer "application mode" with running the
> main on the client side as this is exactly what the per-job mode does
> currently and, as you described also, it would be redundant.
>
> Sorry if this was not clear in the document.
>
> Cheers,
> Kostas
>
> On Mon, Mar 2, 2020 at 3:17 PM Yang Wang <[hidden email]> wrote:
> >
> > Hi Kostas,
> >
> > Thanks a lot for your conclusion and updating the FLIP-85 WIKI.
> Currently, i have no more
> > questions about motivation, approach, fault tolerance and the first
> phase implementation.
> >
> > I think the new title "Flink Application Mode" makes a lot senses to me.
> Especially for the
> > containerized environment, the cluster deploy option will be very useful.
> >
> > Just one concern, how do we introduce this new application mode to our
> users?
> > Each user program(i.e. `main()`) is an application. Currently, we intend
> to only support one
> > `execute()`. So what's the difference between per-job and application
> mode?
> >
> > For per-job, user `main()` is always executed on client side. And For
> application mode, user
> > `main()` could be executed on client or master side(configured via cli
> option).
> > Right? We need to have a clear concept. Otherwise, the users will be
> more and more confusing.
> >
> >
> > Best,
> > Yang
> >
> > Kostas Kloudas <[hidden email]> 于2020年3月2日周一 下午5:58写道:
> >>
> >> Hi all,
> >>
> >> I update
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-85+Flink+Application+Mode
> >> based on the discussion we had here:
> >>
> >>
> https://docs.google.com/document/d/1ji72s3FD9DYUyGuKnJoO4ApzV-nSsZa0-bceGXW7Ocw/edit#
> >>
> >> Please let me know what you think and please keep the discussion in the
> ML :)
> >>
> >> Thanks for starting the discussion and I hope that soon we will be
> >> able to vote on the FLIP.
> >>
> >> Cheers,
> >> Kostas
> >>
> >> On Thu, Jan 16, 2020 at 3:40 AM Yang Wang <[hidden email]>
> wrote:
> >> >
> >> > Hi all,
> >> >
> >> > Thanks a lot for the feedback from @Kostas Kloudas. Your all concerns
> are
> >> > on point. The FLIP-85 is mainly
> >> > focused on supporting cluster mode for per-job. Since it is more
> urgent and
> >> > have much more use
> >> > cases both in Yarn and Kubernetes deployment. For session cluster, we
> could
> >> > have more discussion
> >> > in a new thread later.
> >> >
> >> > #1, How to download the user jars and dependencies for per-job in
> cluster
> >> > mode?
> >> > For Yarn, we could register the user jars and dependencies as
> >> > LocalResource. They will be distributed
> >> > by Yarn. And once the JobManager and TaskManager launched, the jars
> are
> >> > already exists.
> >> > For Standalone per-job and K8s, we expect that the user jars
> >> > and dependencies are built into the image.
> >> > Or the InitContainer could be used for downloading. It is natively
> >> > distributed and we will not have bottleneck.
> >> >
> >> > #2, Job graph recovery
> >> > We could have an optimization to store job graph on the DFS. However,
> i
> >> > suggest building a new jobgraph
> >> > from the configuration is the default option. Since we will not
> always have
> >> > a DFS store when deploying a
> >> > Flink per-job cluster. Of course, we assume that using the same
> >> > configuration(e.g. job_id, user_jar, main_class,
> >> > main_args, parallelism, savepoint_settings, etc.) will get a same job
> >> > graph. I think the standalone per-job
> >> > already has the similar behavior.
> >> >
> >> > #3, What happens with jobs that have multiple execute calls?
> >> > Currently, it is really a problem. Even we use a local client on Flink
> >> > master side, it will have different behavior with
> >> > client mode. For client mode, if we execute multiple times, then we
> will
> >> > deploy multiple Flink clusters for each execute.
> >> > I am not pretty sure whether it is reasonable. However, i still think
> using
> >> > the local client is a good choice. We could
> >> > continue the discussion in a new thread. @Zili Chen <
> [hidden email]> Do
> >> > you want to drive this?
> >> >
> >> >
> >> >
> >> > Best,
> >> > Yang
> >> >
> >> > Peter Huang <[hidden email]> 于2020年1月16日周四 上午1:55写道:
> >> >
> >> > > Hi Kostas,
> >> > >
> >> > > Thanks for this feedback. I can't agree more about the opinion. The
> >> > > cluster mode should be added
> >> > > first in per job cluster.
> >> > >
> >> > > 1) For job cluster implementation
> >> > > 1. Job graph recovery from configuration or store as static job
> graph as
> >> > > session cluster. I think the static one will be better for less
> recovery
> >> > > time.
> >> > > Let me update the doc for details.
> >> > >
> >> > > 2. For job execute multiple times, I think @Zili Chen
> >> > > <[hidden email]> has proposed the local client solution that
> can
> >> > > the run program actually in the cluster entry point. We can put the
> >> > > implementation in the second stage,
> >> > > or even a new FLIP for further discussion.
> >> > >
> >> > > 2) For session cluster implementation
> >> > > We can disable the cluster mode for the session cluster in the first
> >> > > stage. I agree the jar downloading will be a painful thing.
> >> > > We can consider about PoC and performance evaluation first. If the
> end to
> >> > > end experience is good enough, then we can consider
> >> > > proceeding with the solution.
> >> > >
> >> > > Looking forward to more opinions from @Yang Wang <
> [hidden email]> @Zili
> >> > > Chen <[hidden email]> @Dian Fu <[hidden email]>.
> >> > >
> >> > >
> >> > > Best Regards
> >> > > Peter Huang
> >> > >
> >> > > On Wed, Jan 15, 2020 at 7:50 AM Kostas Kloudas <[hidden email]>
> wrote:
> >> > >
> >> > >> Hi all,
> >> > >>
> >> > >> I am writing here as the discussion on the Google Doc seems to be a
> >> > >> bit difficult to follow.
> >> > >>
> >> > >> I think that in order to be able to make progress, it would be
> helpful
> >> > >> to focus on per-job mode for now.
> >> > >> The reason is that:
> >> > >>  1) making the (unique) JobSubmitHandler responsible for creating
> the
> >> > >> jobgraphs,
> >> > >>   which includes downloading dependencies, is not an optimal
> solution
> >> > >>  2) even if we put the responsibility on the JobMaster, currently
> each
> >> > >> job has its own
> >> > >>   JobMaster but they all run on the same process, so we have again
> a
> >> > >> single entity.
> >> > >>
> >> > >> Of course after this is done, and if we feel comfortable with the
> >> > >> solution, then we can go to the session mode.
> >> > >>
> >> > >> A second comment has to do with fault-tolerance in the per-job,
> >> > >> cluster-deploy mode.
> >> > >> In the document, it is suggested that upon recovery, the JobMaster
> of
> >> > >> each job re-creates the JobGraph.
> >> > >> I am just wondering if it is better to create and store the
> jobGraph
> >> > >> upon submission and only fetch it
> >> > >> upon recovery so that we have a static jobGraph.
> >> > >>
> >> > >> Finally, I have a question which is what happens with jobs that
> have
> >> > >> multiple execute calls?
> >> > >> The semantics seem to change compared to the current behaviour,
> right?
> >> > >>
> >> > >> Cheers,
> >> > >> Kostas
> >> > >>
> >> > >> On Wed, Jan 8, 2020 at 8:05 PM tison <[hidden email]> wrote:
> >> > >> >
> >> > >> > not always, Yang Wang is also not yet a committer but he can
> join the
> >> > >> > channel. I cannot find the id by clicking “Add new member in
> channel” so
> >> > >> > come to you and ask for try out the link. Possibly I will find
> other
> >> > >> ways
> >> > >> > but the original purpose is that the slack channel is a public
> area we
> >> > >> > discuss about developing...
> >> > >> > Best,
> >> > >> > tison.
> >> > >> >
> >> > >> >
> >> > >> > Peter Huang <[hidden email]> 于2020年1月9日周四 上午2:44写道:
> >> > >> >
> >> > >> > > Hi Tison,
> >> > >> > >
> >> > >> > > I am not the committer of Flink yet. I think I can't join it
> also.
> >> > >> > >
> >> > >> > >
> >> > >> > > Best Regards
> >> > >> > > Peter Huang
> >> > >> > >
> >> > >> > > On Wed, Jan 8, 2020 at 9:39 AM tison <[hidden email]>
> wrote:
> >> > >> > >
> >> > >> > > > Hi Peter,
> >> > >> > > >
> >> > >> > > > Could you try out this link?
> >> > >> > > https://the-asf.slack.com/messages/CNA3ADZPH
> >> > >> > > >
> >> > >> > > > Best,
> >> > >> > > > tison.
> >> > >> > > >
> >> > >> > > >
> >> > >> > > > Peter Huang <[hidden email]> 于2020年1月9日周四
> 上午1:22写道:
> >> > >> > > >
> >> > >> > > > > Hi Tison,
> >> > >> > > > >
> >> > >> > > > > I can't join the group with shared link. Would you please
> add me
> >> > >> into
> >> > >> > > the
> >> > >> > > > > group? My slack account is huangzhenqiu0825.
> >> > >> > > > > Thank you in advance.
> >> > >> > > > >
> >> > >> > > > >
> >> > >> > > > > Best Regards
> >> > >> > > > > Peter Huang
> >> > >> > > > >
> >> > >> > > > > On Wed, Jan 8, 2020 at 12:02 AM tison <
> [hidden email]>
> >> > >> wrote:
> >> > >> > > > >
> >> > >> > > > > > Hi Peter,
> >> > >> > > > > >
> >> > >> > > > > > As described above, this effort should get attention
> from people
> >> > >> > > > > developing
> >> > >> > > > > > FLIP-73 a.k.a. Executor abstractions. I recommend you to
> join
> >> > >> the
> >> > >> > > > public
> >> > >> > > > > > slack channel[1] for Flink Client API Enhancement and
> you can
> >> > >> try to
> >> > >> > > > > share
> >> > >> > > > > > you detailed thoughts there. It possibly gets more
> concrete
> >> > >> > > attentions.
> >> > >> > > > > >
> >> > >> > > > > > Best,
> >> > >> > > > > > tison.
> >> > >> > > > > >
> >> > >> > > > > > [1]
> >> > >> > > > > >
> >> > >> > > > > >
> >> > >> > > > >
> >> > >> > > >
> >> > >> > >
> >> > >>
> https://slack.com/share/IS21SJ75H/Rk8HhUly9FuEHb7oGwBZ33uL/enQtODg2MDYwNjE5MTg3LTA2MjIzNDc1M2ZjZDVlMjdlZjk1M2RkYmJhNjAwMTk2ZDZkODQ4NmY5YmI4OGRhNWJkYTViMTM1NzlmMzc4OWM
> >> > >> > > > > >
> >> > >> > > > > >
> >> > >> > > > > > Peter Huang <[hidden email]> 于2020年1月7日周二
> 上午5:09写道:
> >> > >> > > > > >
> >> > >> > > > > > > Dear All,
> >> > >> > > > > > >
> >> > >> > > > > > > Happy new year! According to existing feedback from the
> >> > >> community,
> >> > >> > > we
> >> > >> > > > > > > revised the doc with the consideration of session
> cluster
> >> > >> support,
> >> > >> > > > and
> >> > >> > > > > > > concrete interface changes needed and execution plan.
> Please
> >> > >> take
> >> > >> > > one
> >> > >> > > > > > more
> >> > >> > > > > > > round of review at your most convenient time.
> >> > >> > > > > > >
> >> > >> > > > > > >
> >> > >> > > > > > >
> >> > >> > > > > >
> >> > >> > > > >
> >> > >> > > >
> >> > >> > >
> >> > >>
> https://docs.google.com/document/d/1aAwVjdZByA-0CHbgv16Me-vjaaDMCfhX7TzVVTuifYM/edit#
> >> > >> > > > > > >
> >> > >> > > > > > >
> >> > >> > > > > > > Best Regards
> >> > >> > > > > > > Peter Huang
> >> > >> > > > > > >
> >> > >> > > > > > >
> >> > >> > > > > > >
> >> > >> > > > > > >
> >> > >> > > > > > >
> >> > >> > > > > > > On Thu, Jan 2, 2020 at 11:29 AM Peter Huang <
> >> > >> > > > > [hidden email]>
> >> > >> > > > > > > wrote:
> >> > >> > > > > > >
> >> > >> > > > > > > > Hi Dian,
> >> > >> > > > > > > > Thanks for giving us valuable feedbacks.
> >> > >> > > > > > > >
> >> > >> > > > > > > > 1) It's better to have a whole design for this
> feature
> >> > >> > > > > > > > For the suggestion of enabling the cluster mode also
> session
> >> > >> > > > > cluster, I
> >> > >> > > > > > > > think Flink already supported it.
> WebSubmissionExtension
> >> > >> already
> >> > >> > > > > allows
> >> > >> > > > > > > > users to start a job with the specified jar by using
> web UI.
> >> > >> > > > > > > > But we need to enable the feature from CLI for both
> local
> >> > >> jar,
> >> > >> > > > remote
> >> > >> > > > > > > jar.
> >> > >> > > > > > > > I will align with Yang Wang first about the details
> and
> >> > >> update
> >> > >> > > the
> >> > >> > > > > > design
> >> > >> > > > > > > > doc.
> >> > >> > > > > > > >
> >> > >> > > > > > > > 2) It's better to consider the convenience for
> users, such
> >> > >> as
> >> > >> > > > > debugging
> >> > >> > > > > > > >
> >> > >> > > > > > > > I am wondering whether we can store the exception in
> >> > >> jobgragh
> >> > >> > > > > > > > generation in application master. As no streaming
> graph can
> >> > >> be
> >> > >> > > > > > scheduled
> >> > >> > > > > > > in
> >> > >> > > > > > > > this case, there will be no more TM will be
> requested from
> >> > >> > > FlinkRM.
> >> > >> > > > > > > > If the AM is still running, users can still query it
> from
> >> > >> CLI. As
> >> > >> > > > it
> >> > >> > > > > > > > requires more change, we can get some feedback from <
> >> > >> > > > > > [hidden email]
> >> > >> > > > > > > >
> >> > >> > > > > > > > and @[hidden email] <[hidden email]>.
> >> > >> > > > > > > >
> >> > >> > > > > > > > 3) It's better to consider the impact to the
> stability of
> >> > >> the
> >> > >> > > > cluster
> >> > >> > > > > > > >
> >> > >> > > > > > > > I agree with Yang Wang's opinion.
> >> > >> > > > > > > >
> >> > >> > > > > > > >
> >> > >> > > > > > > >
> >> > >> > > > > > > > Best Regards
> >> > >> > > > > > > > Peter Huang
> >> > >> > > > > > > >
> >> > >> > > > > > > >
> >> > >> > > > > > > > On Sun, Dec 29, 2019 at 9:44 PM Dian Fu <
> >> > >> [hidden email]>
> >> > >> > > > > wrote:
> >> > >> > > > > > > >
> >> > >> > > > > > > >> Hi all,
> >> > >> > > > > > > >>
> >> > >> > > > > > > >> Sorry to jump into this discussion. Thanks everyone
> for the
> >> > >> > > > > > discussion.
> >> > >> > > > > > > >> I'm very interested in this topic although I'm not
> an
> >> > >> expert in
> >> > >> > > > this
> >> > >> > > > > > > part.
> >> > >> > > > > > > >> So I'm glad to share my thoughts as following:
> >> > >> > > > > > > >>
> >> > >> > > > > > > >> 1) It's better to have a whole design for this
> feature
> >> > >> > > > > > > >> As we know, there are two deployment modes: per-job
> mode
> >> > >> and
> >> > >> > > > session
> >> > >> > > > > > > >> mode. I'm wondering which mode really needs this
> feature.
> >> > >> As the
> >> > >> > > > > > design
> >> > >> > > > > > > doc
> >> > >> > > > > > > >> mentioned, per-job mode is more used for streaming
> jobs and
> >> > >> > > > session
> >> > >> > > > > > > mode is
> >> > >> > > > > > > >> usually used for batch jobs(Of course, the job
> types and
> >> > >> the
> >> > >> > > > > > deployment
> >> > >> > > > > > > >> modes are orthogonal). Usually streaming job is only
> >> > >> needed to
> >> > >> > > be
> >> > >> > > > > > > submitted
> >> > >> > > > > > > >> once and it will run for days or weeks, while batch
> jobs
> >> > >> will be
> >> > >> > > > > > > submitted
> >> > >> > > > > > > >> more frequently compared with streaming jobs. This
> means
> >> > >> that
> >> > >> > > > maybe
> >> > >> > > > > > > session
> >> > >> > > > > > > >> mode also needs this feature. However, if we
> support this
> >> > >> > > feature
> >> > >> > > > in
> >> > >> > > > > > > >> session mode, the application master will become
> the new
> >> > >> > > > centralized
> >> > >> > > > > > > >> service(which should be solved). So in this case,
> it's
> >> > >> better to
> >> > >> > > > > have
> >> > >> > > > > > a
> >> > >> > > > > > > >> complete design for both per-job mode and session
> mode.
> >> > >> > > > Furthermore,
> >> > >> > > > > > > even
> >> > >> > > > > > > >> if we can do it phase by phase, we need to have a
> whole
> >> > >> picture
> >> > >> > > of
> >> > >> > > > > how
> >> > >> > > > > > > it
> >> > >> > > > > > > >> works in both per-job mode and session mode.
> >> > >> > > > > > > >>
> >> > >> > > > > > > >> 2) It's better to consider the convenience for
> users, such
> >> > >> as
> >> > >> > > > > > debugging
> >> > >> > > > > > > >> After we finish this feature, the job graph will be
> >> > >> compiled in
> >> > >> > > > the
> >> > >> > > > > > > >> application master, which means that users cannot
> easily
> >> > >> get the
> >> > >> > > > > > > exception
> >> > >> > > > > > > >> message synchorousely in the job client if there are
> >> > >> problems
> >> > >> > > > during
> >> > >> > > > > > the
> >> > >> > > > > > > >> job graph compiling (especially for platform
> users), such
> >> > >> as the
> >> > >> > > > > > > resource
> >> > >> > > > > > > >> path is incorrect, the user program itself has some
> >> > >> problems,
> >> > >> > > etc.
> >> > >> > > > > > What
> >> > >> > > > > > > I'm
> >> > >> > > > > > > >> thinking is that maybe we should throw the
> exceptions as
> >> > >> early
> >> > >> > > as
> >> > >> > > > > > > possible
> >> > >> > > > > > > >> (during job submission stage).
> >> > >> > > > > > > >>
> >> > >> > > > > > > >> 3) It's better to consider the impact to the
> stability of
> >> > >> the
> >> > >> > > > > cluster
> >> > >> > > > > > > >> If we perform the compiling in the application
> master, we
> >> > >> should
> >> > >> > > > > > > consider
> >> > >> > > > > > > >> the impact of the compiling errors. Although YARN
> could
> >> > >> resume
> >> > >> > > the
> >> > >> > > > > > > >> application master in case of failures, but in some
> case
> >> > >> the
> >> > >> > > > > compiling
> >> > >> > > > > > > >> failure may be a waste of cluster resource and may
> impact
> >> > >> the
> >> > >> > > > > > stability
> >> > >> > > > > > > the
> >> > >> > > > > > > >> cluster and the other jobs in the cluster, such as
> the
> >> > >> resource
> >> > >> > > > path
> >> > >> > > > > > is
> >> > >> > > > > > > >> incorrect, the user program itself has some
> problems(in
> >> > >> this
> >> > >> > > case,
> >> > >> > > > > job
> >> > >> > > > > > > >> failover cannot solve this kind of problems) etc.
> In the
> >> > >> current
> >> > >> > > > > > > >> implemention, the compiling errors are handled in
> the
> >> > >> client
> >> > >> > > side
> >> > >> > > > > and
> >> > >> > > > > > > there
> >> > >> > > > > > > >> is no impact to the cluster at all.
> >> > >> > > > > > > >>
> >> > >> > > > > > > >> Regarding to 1), it's clearly pointed in the design
> doc
> >> > >> that
> >> > >> > > only
> >> > >> > > > > > > per-job
> >> > >> > > > > > > >> mode will be supported. However, I think it's
> better to
> >> > >> also
> >> > >> > > > > consider
> >> > >> > > > > > > the
> >> > >> > > > > > > >> session mode in the design doc.
> >> > >> > > > > > > >> Regarding to 2) and 3), I have not seen related
> sections
> >> > >> in the
> >> > >> > > > > design
> >> > >> > > > > > > >> doc. It will be good if we can cover them in the
> design
> >> > >> doc.
> >> > >> > > > > > > >>
> >> > >> > > > > > > >> Feel free to correct me If there is anything I
> >> > >> misunderstand.
> >> > >> > > > > > > >>
> >> > >> > > > > > > >> Regards,
> >> > >> > > > > > > >> Dian
> >> > >> > > > > > > >>
> >> > >> > > > > > > >>
> >> > >> > > > > > > >> > 在 2019年12月27日,上午3:13,Peter Huang <
> >> > >> [hidden email]>
> >> > >> > > > 写道:
> >> > >> > > > > > > >> >
> >> > >> > > > > > > >> > Hi Yang,
> >> > >> > > > > > > >> >
> >> > >> > > > > > > >> > I can't agree more. The effort definitely needs
> to align
> >> > >> with
> >> > >> > > > the
> >> > >> > > > > > > final
> >> > >> > > > > > > >> > goal of FLIP-73.
> >> > >> > > > > > > >> > I am thinking about whether we can achieve the
> goal with
> >> > >> two
> >> > >> > > > > phases.
> >> > >> > > > > > > >> >
> >> > >> > > > > > > >> > 1) Phase I
> >> > >> > > > > > > >> > As the CLiFrontend will not be depreciated soon.
> We can
> >> > >> still
> >> > >> > > > use
> >> > >> > > > > > the
> >> > >> > > > > > > >> > deployMode flag there,
> >> > >> > > > > > > >> > pass the program info through Flink
> configuration,  use
> >> > >> the
> >> > >> > > > > > > >> > ClassPathJobGraphRetriever
> >> > >> > > > > > > >> > to generate the job graph in ClusterEntrypoints
> of yarn
> >> > >> and
> >> > >> > > > > > > Kubernetes.
> >> > >> > > > > > > >> >
> >> > >> > > > > > > >> > 2) Phase II
> >> > >> > > > > > > >> > In  AbstractJobClusterExecutor, the job graph is
> >> > >> generated in
> >> > >> > > > the
> >> > >> > > > > > > >> execute
> >> > >> > > > > > > >> > function. We can still
> >> > >> > > > > > > >> > use the deployMode in it. With deployMode =
> cluster, the
> >> > >> > > execute
> >> > >> > > > > > > >> function
> >> > >> > > > > > > >> > only starts the cluster.
> >> > >> > > > > > > >> >
> >> > >> > > > > > > >> > When {Yarn/Kuberneates}PerJobClusterEntrypoint
> starts,
> >> > >> It will
> >> > >> > > > > start
> >> > >> > > > > > > the
> >> > >> > > > > > > >> > dispatch first, then we can use
> >> > >> > > > > > > >> > a ClusterEnvironment similar to
> ContextEnvironment to
> >> > >> submit
> >> > >> > > the
> >> > >> > > > > job
> >> > >> > > > > > > >> with
> >> > >> > > > > > > >> > jobName the local
> >> > >> > > > > > > >> > dispatcher. For the details, we need more
> investigation.
> >> > >> Let's
> >> > >> > > > > wait
> >> > >> > > > > > > >> > for @Aljoscha
> >> > >> > > > > > > >> > Krettek <[hidden email]> @Till Rohrmann <
> >> > >> > > > > [hidden email]
> >> > >> > > > > > >'s
> >> > >> > > > > > > >> > feedback after the holiday season.
> >> > >> > > > > > > >> >
> >> > >> > > > > > > >> > Thank you in advance. Merry Chrismas and Happy New
> >> > >> Year!!!
> >> > >> > > > > > > >> >
> >> > >> > > > > > > >> >
> >> > >> > > > > > > >> > Best Regards
> >> > >> > > > > > > >> > Peter Huang
> >> > >> > > > > > > >> >
> >> > >> > > > > > > >> >
> >> > >> > > > > > > >> >
> >> > >> > > > > > > >> >
> >> > >> > > > > > > >> >
> >> > >> > > > > > > >> >
> >> > >> > > > > > > >> >
> >> > >> > > > > > > >> >
> >> > >> > > > > > > >> > On Wed, Dec 25, 2019 at 1:08 AM Yang Wang <
> >> > >> > > > [hidden email]>
> >> > >> > > > > > > >> wrote:
> >> > >> > > > > > > >> >
> >> > >> > > > > > > >> >> Hi Peter,
> >> > >> > > > > > > >> >>
> >> > >> > > > > > > >> >> I think we need to reconsider tison's suggestion
> >> > >> seriously.
> >> > >> > > > After
> >> > >> > > > > > > >> FLIP-73,
> >> > >> > > > > > > >> >> the deployJobCluster has
> >> > >> > > > > > > >> >> beenmoved into `JobClusterExecutor#execute`. It
> should
> >> > >> not be
> >> > >> > > > > > > perceived
> >> > >> > > > > > > >> >> for `CliFrontend`. That
> >> > >> > > > > > > >> >> means the user program will *ALWAYS* be executed
> on
> >> > >> client
> >> > >> > > > side.
> >> > >> > > > > > This
> >> > >> > > > > > > >> is
> >> > >> > > > > > > >> >> the by design behavior.
> >> > >> > > > > > > >> >> So, we could not just add `if(client mode) ..
> else
> >> > >> if(cluster
> >> > >> > > > > mode)
> >> > >> > > > > > > >> ...`
> >> > >> > > > > > > >> >> codes in `CliFrontend` to bypass
> >> > >> > > > > > > >> >> the executor. We need to find a clean way to
> decouple
> >> > >> > > executing
> >> > >> > > > > > user
> >> > >> > > > > > > >> >> program and deploying per-job
> >> > >> > > > > > > >> >> cluster. Based on this, we could support to
> execute user
> >> > >> > > > program
> >> > >> > > > > on
> >> > >> > > > > > > >> client
> >> > >> > > > > > > >> >> or master side.
> >> > >> > > > > > > >> >>
> >> > >> > > > > > > >> >> Maybe Aljoscha and Jeff could give some good
> >> > >> suggestions.
> >> > >> > > > > > > >> >>
> >> > >> > > > > > > >> >>
> >> > >> > > > > > > >> >>
> >> > >> > > > > > > >> >> Best,
> >> > >> > > > > > > >> >> Yang
> >> > >> > > > > > > >> >>
> >> > >> > > > > > > >> >> Peter Huang <[hidden email]>
> 于2019年12月25日周三
> >> > >> > > > > 上午4:03写道:
> >> > >> > > > > > > >> >>
> >> > >> > > > > > > >> >>> Hi Jingjing,
> >> > >> > > > > > > >> >>>
> >> > >> > > > > > > >> >>> The improvement proposed is a deployment option
> for
> >> > >> CLI. For
> >> > >> > > > SQL
> >> > >> > > > > > > based
> >> > >> > > > > > > >> >>> Flink application, It is more convenient to use
> the
> >> > >> existing
> >> > >> > > > > model
> >> > >> > > > > > > in
> >> > >> > > > > > > >> >>> SqlClient in which
> >> > >> > > > > > > >> >>> the job graph is generated within SqlClient.
> After
> >> > >> adding
> >> > >> > > the
> >> > >> > > > > > > delayed
> >> > >> > > > > > > >> job
> >> > >> > > > > > > >> >>> graph generation, I think there is no change is
> needed
> >> > >> for
> >> > >> > > > your
> >> > >> > > > > > > side.
> >> > >> > > > > > > >> >>>
> >> > >> > > > > > > >> >>>
> >> > >> > > > > > > >> >>> Best Regards
> >> > >> > > > > > > >> >>> Peter Huang
> >> > >> > > > > > > >> >>>
> >> > >> > > > > > > >> >>>
> >> > >> > > > > > > >> >>> On Wed, Dec 18, 2019 at 6:01 AM jingjing bai <
> >> > >> > > > > > > >> [hidden email]>
> >> > >> > > > > > > >> >>> wrote:
> >> > >> > > > > > > >> >>>
> >> > >> > > > > > > >> >>>> hi peter:
> >> > >> > > > > > > >> >>>>    we had extension SqlClent to support sql job
> >> > >> submit in
> >> > >> > > web
> >> > >> > > > > > base
> >> > >> > > > > > > on
> >> > >> > > > > > > >> >>>> flink 1.9.   we support submit to yarn on per
> job
> >> > >> mode too.
> >> > >> > > > > > > >> >>>>    in this case, the job graph generated  on
> client
> >> > >> side
> >> > >> > > .  I
> >> > >> > > > > > think
> >> > >> > > > > > > >> >>> this
> >> > >> > > > > > > >> >>>> discuss Mainly to improve api programme.  but
> in my
> >> > >> case ,
> >> > >> > > > > there
> >> > >> > > > > > is
> >> > >> > > > > > > >> no
> >> > >> > > > > > > >> >>>> jar to upload but only a sql string .
> >> > >> > > > > > > >> >>>>    do u had more suggestion to improve for sql
> mode
> >> > >> or it
> >> > >> > > is
> >> > >> > > > > > only a
> >> > >> > > > > > > >> >>>> switch for api programme?
> >> > >> > > > > > > >> >>>>
> >> > >> > > > > > > >> >>>>
> >> > >> > > > > > > >> >>>> best
> >> > >> > > > > > > >> >>>> bai jj
> >> > >> > > > > > > >> >>>>
> >> > >> > > > > > > >> >>>>
> >> > >> > > > > > > >> >>>> Yang Wang <[hidden email]>
> 于2019年12月18日周三
> >> > >> 下午7:21写道:
> >> > >> > > > > > > >> >>>>
> >> > >> > > > > > > >> >>>>> I just want to revive this discussion.
> >> > >> > > > > > > >> >>>>>
> >> > >> > > > > > > >> >>>>> Recently, i am thinking about how to natively
> run
> >> > >> flink
> >> > >> > > > > per-job
> >> > >> > > > > > > >> >>> cluster on
> >> > >> > > > > > > >> >>>>> Kubernetes.
> >> > >> > > > > > > >> >>>>> The per-job mode on Kubernetes is very
> different
> >> > >> from on
> >> > >> > > > Yarn.
> >> > >> > > > > > And
> >> > >> > > > > > > >> we
> >> > >> > > > > > > >> >>> will
> >> > >> > > > > > > >> >>>>> have
> >> > >> > > > > > > >> >>>>> the same deployment requirements to the
> client and
> >> > >> entry
> >> > >> > > > > point.
> >> > >> > > > > > > >> >>>>>
> >> > >> > > > > > > >> >>>>> 1. Flink client not always need a local jar
> to start
> >> > >> a
> >> > >> > > Flink
> >> > >> > > > > > > per-job
> >> > >> > > > > > > >> >>>>> cluster. We could
> >> > >> > > > > > > >> >>>>> support multiple schemas. For example,
> >> > >> > > > file:///path/of/my.jar
> >> > >> > > > > > > means
> >> > >> > > > > > > >> a
> >> > >> > > > > > > >> >>> jar
> >> > >> > > > > > > >> >>>>> located
> >> > >> > > > > > > >> >>>>> at client side,
> >> > >> hdfs://myhdfs/user/myname/flink/my.jar
> >> > >> > > > means a
> >> > >> > > > > > jar
> >> > >> > > > > > > >> >>> located
> >> > >> > > > > > > >> >>>>> at
> >> > >> > > > > > > >> >>>>> remote hdfs, local:///path/in/image/my.jar
> means a
> >> > >> jar
> >> > >> > > > located
> >> > >> > > > > > at
> >> > >> > > > > > > >> >>>>> jobmanager side.
> >> > >> > > > > > > >> >>>>>
> >> > >> > > > > > > >> >>>>> 2. Support running user program on master
> side. This
> >> > >> also
> >> > >> > > > > means
> >> > >> > > > > > > the
> >> > >> > > > > > > >> >>> entry
> >> > >> > > > > > > >> >>>>> point
> >> > >> > > > > > > >> >>>>> will generate the job graph on master side.
> We could
> >> > >> use
> >> > >> > > the
> >> > >> > > > > > > >> >>>>> ClasspathJobGraphRetriever
> >> > >> > > > > > > >> >>>>> or start a local Flink client to achieve this
> >> > >> purpose.
> >> > >> > > > > > > >> >>>>>
> >> > >> > > > > > > >> >>>>>
> >> > >> > > > > > > >> >>>>> cc tison, Aljoscha & Kostas Do you think this
> is the
> >> > >> right
> >> > >> > > > > > > >> direction we
> >> > >> > > > > > > >> >>>>> need to work?
> >> > >> > > > > > > >> >>>>>
> >> > >> > > > > > > >> >>>>> tison <[hidden email]> 于2019年12月12日周四
> >> > >> 下午4:48写道:
> >> > >> > > > > > > >> >>>>>
> >> > >> > > > > > > >> >>>>>> A quick idea is that we separate the
> deployment
> >> > >> from user
> >> > >> > > > > > program
> >> > >> > > > > > > >> >>> that
> >> > >> > > > > > > >> >>>>> it
> >> > >> > > > > > > >> >>>>>> has always been done
> >> > >> > > > > > > >> >>>>>> outside the program. On user program
> executed there
> >> > >> is
> >> > >> > > > > always a
> >> > >> > > > > > > >> >>>>>> ClusterClient that communicates with
> >> > >> > > > > > > >> >>>>>> an existing cluster, remote or local. It
> will be
> >> > >> another
> >> > >> > > > > thread
> >> > >> > > > > > > so
> >> > >> > > > > > > >> >>> just
> >> > >> > > > > > > >> >>>>> for
> >> > >> > > > > > > >> >>>>>> your information.
> >> > >> > > > > > > >> >>>>>>
> >> > >> > > > > > > >> >>>>>> Best,
> >> > >> > > > > > > >> >>>>>> tison.
> >> > >> > > > > > > >> >>>>>>
> >> > >> > > > > > > >> >>>>>>
> >> > >> > > > > > > >> >>>>>> tison <[hidden email]> 于2019年12月12日周四
> >> > >> 下午4:40写道:
> >> > >> > > > > > > >> >>>>>>
> >> > >> > > > > > > >> >>>>>>> Hi Peter,
> >> > >> > > > > > > >> >>>>>>>
> >> > >> > > > > > > >> >>>>>>> Another concern I realized recently is that
> with
> >> > >> current
> >> > >> > > > > > > Executors
> >> > >> > > > > > > >> >>>>>>> abstraction(FLIP-73)
> >> > >> > > > > > > >> >>>>>>> I'm afraid that user program is designed to
> ALWAYS
> >> > >> run
> >> > >> > > on
> >> > >> > > > > the
> >> > >> > > > > > > >> >>> client
> >> > >> > > > > > > >> >>>>>> side.
> >> > >> > > > > > > >> >>>>>>> Specifically,
> >> > >> > > > > > > >> >>>>>>> we deploy the job in executor when
> env.execute
> >> > >> called.
> >> > >> > > > This
> >> > >> > > > > > > >> >>>>> abstraction
> >> > >> > > > > > > >> >>>>>>> possibly prevents
> >> > >> > > > > > > >> >>>>>>> Flink runs user program on the cluster side.
> >> > >> > > > > > > >> >>>>>>>
> >> > >> > > > > > > >> >>>>>>> For your proposal, in this case we already
> >> > >> compiled the
> >> > >> > > > > > program
> >> > >> > > > > > > >> and
> >> > >> > > > > > > >> >>>>> run
> >> > >> > > > > > > >> >>>>>> on
> >> > >> > > > > > > >> >>>>>>> the client side,
> >> > >> > > > > > > >> >>>>>>> even we deploy a cluster and retrieve job
> graph
> >> > >> from
> >> > >> > > > program
> >> > >> > > > > > > >> >>>>> metadata, it
> >> > >> > > > > > > >> >>>>>>> doesn't make
> >> > >> > > > > > > >> >>>>>>> many sense.
> >> > >> > > > > > > >> >>>>>>>
> >> > >> > > > > > > >> >>>>>>> cc Aljoscha & Kostas what do you think
> about this
> >> > >> > > > > constraint?
> >> > >> > > > > > > >> >>>>>>>
> >> > >> > > > > > > >> >>>>>>> Best,
> >> > >> > > > > > > >> >>>>>>> tison.
> >> > >> > > > > > > >> >>>>>>>
> >> > >> > > > > > > >> >>>>>>>
> >> > >> > > > > > > >> >>>>>>> Peter Huang <[hidden email]>
> >> > >> 于2019年12月10日周二
> >> > >> > > > > > > >> 下午12:45写道:
> >> > >> > > > > > > >> >>>>>>>
> >> > >> > > > > > > >> >>>>>>>> Hi Tison,
> >> > >> > > > > > > >> >>>>>>>>
> >> > >> > > > > > > >> >>>>>>>> Yes, you are right. I think I made the
> wrong
> >> > >> argument
> >> > >> > > in
> >> > >> > > > > the
> >> > >> > > > > > > doc.
> >> > >> > > > > > > >> >>>>>>>> Basically, the packaging jar problem is
> only for
> >> > >> > > platform
> >> > >> > > > > > > users.
> >> > >> > > > > > > >> >>> In
> >> > >> > > > > > > >> >>>>> our
> >> > >> > > > > > > >> >>>>>>>> internal deploy service,
> >> > >> > > > > > > >> >>>>>>>> we further optimized the deployment
> latency by
> >> > >> letting
> >> > >> > > > > users
> >> > >> > > > > > to
> >> > >> > > > > > > >> >>>>>> packaging
> >> > >> > > > > > > >> >>>>>>>> flink-runtime together with the uber jar,
> so that
> >> > >> we
> >> > >> > > > don't
> >> > >> > > > > > need
> >> > >> > > > > > > >> to
> >> > >> > > > > > > >> >>>>>>>> consider
> >> > >> > > > > > > >> >>>>>>>> multiple flink version
> >> > >> > > > > > > >> >>>>>>>> support for now. In the session client
> mode, as
> >> > >> Flink
> >> > >> > > > libs
> >> > >> > > > > > will
> >> > >> > > > > > > >> be
> >> > >> > > > > > > >> >>>>>> shipped
> >> > >> > > > > > > >> >>>>>>>> anyway as local resources of yarn. Users
> actually
> >> > >> don't
> >> > >> > > > > need
> >> > >> > > > > > to
> >> > >> > > > > > > >> >>>>> package
> >> > >> > > > > > > >> >>>>>>>> those libs into job jar.
> >> > >> > > > > > > >> >>>>>>>>
> >> > >> > > > > > > >> >>>>>>>>
> >> > >> > > > > > > >> >>>>>>>>
> >> > >> > > > > > > >> >>>>>>>> Best Regards
> >> > >> > > > > > > >> >>>>>>>> Peter Huang
> >> > >> > > > > > > >> >>>>>>>>
> >> > >> > > > > > > >> >>>>>>>> On Mon, Dec 9, 2019 at 8:35 PM tison <
> >> > >> > > > [hidden email]
> >> > >> > > > > >
> >> > >> > > > > > > >> >>> wrote:
> >> > >> > > > > > > >> >>>>>>>>
> >> > >> > > > > > > >> >>>>>>>>>> 3. What do you mean about the package?
> Do users
> >> > >> need
> >> > >> > > to
> >> > >> > > > > > > >> >>> compile
> >> > >> > > > > > > >> >>>>>> their
> >> > >> > > > > > > >> >>>>>>>>> jars
> >> > >> > > > > > > >> >>>>>>>>> inlcuding flink-clients, flink-optimizer,
> >> > >> flink-table
> >> > >> > > > > codes?
> >> > >> > > > > > > >> >>>>>>>>>
> >> > >> > > > > > > >> >>>>>>>>> The answer should be no because they
> exist in
> >> > >> system
> >> > >> > > > > > > classpath.
> >> > >> > > > > > > >> >>>>>>>>>
> >> > >> > > > > > > >> >>>>>>>>> Best,
> >> > >> > > > > > > >> >>>>>>>>> tison.
> >> > >> > > > > > > >> >>>>>>>>>
> >> > >> > > > > > > >> >>>>>>>>>
> >> > >> > > > > > > >> >>>>>>>>> Yang Wang <[hidden email]>
> 于2019年12月10日周二
> >> > >> > > > > 下午12:18写道:
> >> > >> > > > > > > >> >>>>>>>>>
> >> > >> > > > > > > >> >>>>>>>>>> Hi Peter,
> >> > >> > > > > > > >> >>>>>>>>>>
> >> > >> > > > > > > >> >>>>>>>>>> Thanks a lot for starting this
> discussion. I
> >> > >> think
> >> > >> > > this
> >> > >> > > > > is
> >> > >> > > > > > a
> >> > >> > > > > > > >> >>> very
> >> > >> > > > > > > >> >>>>>>>> useful
> >> > >> > > > > > > >> >>>>>>>>>> feature.
> >> > >> > > > > > > >> >>>>>>>>>>
> >> > >> > > > > > > >> >>>>>>>>>> Not only for Yarn, i am focused on flink
> on
> >> > >> > > Kubernetes
> >> > >> > > > > > > >> >>>>> integration
> >> > >> > > > > > > >> >>>>>> and
> >> > >> > > > > > > >> >>>>>>>>> come
> >> > >> > > > > > > >> >>>>>>>>>> across the same
> >> > >> > > > > > > >> >>>>>>>>>> problem. I do not want the job graph
> generated
> >> > >> on
> >> > >> > > > client
> >> > >> > > > > > > side.
> >> > >> > > > > > > >> >>>>>>>> Instead,
> >> > >> > > > > > > >> >>>>>>>>> the
> >> > >> > > > > > > >> >>>>>>>>>> user jars are built in
> >> > >> > > > > > > >> >>>>>>>>>> a user-defined image. When the job
> manager
> >> > >> launched,
> >> > >> > > we
> >> > >> > > > > > just
> >> > >> > > > > > > >> >>>>> need to
> >> > >> > > > > > > >> >>>>>>>>>> generate the job graph
> >> > >> > > > > > > >> >>>>>>>>>> based on local user jars.
> >> > >> > > > > > > >> >>>>>>>>>>
> >> > >> > > > > > > >> >>>>>>>>>> I have some small suggestion about this.
> >> > >> > > > > > > >> >>>>>>>>>>
> >> > >> > > > > > > >> >>>>>>>>>> 1. `ProgramJobGraphRetriever` is very
> similar to
> >> > >> > > > > > > >> >>>>>>>>>> `ClasspathJobGraphRetriever`, the
> differences
> >> > >> > > > > > > >> >>>>>>>>>> are the former needs `ProgramMetadata`
> and the
> >> > >> latter
> >> > >> > > > > needs
> >> > >> > > > > > > >> >>> some
> >> > >> > > > > > > >> >>>>>>>>> arguments.
> >> > >> > > > > > > >> >>>>>>>>>> Is it possible to
> >> > >> > > > > > > >> >>>>>>>>>> have an unified `JobGraphRetriever` to
> support
> >> > >> both?
> >> > >> > > > > > > >> >>>>>>>>>> 2. Is it possible to not use a local
> user jar to
> >> > >> > > start
> >> > >> > > > a
> >> > >> > > > > > > >> >>> per-job
> >> > >> > > > > > > >> >>>>>>>> cluster?
> >> > >> > > > > > > >> >>>>>>>>>> In your case, the user jars has
> >> > >> > > > > > > >> >>>>>>>>>> existed on hdfs already and we do need to
> >> > >> download
> >> > >> > > the
> >> > >> > > > > jars
> >> > >> > > > > > > to
> >> > >> > > > > > > >> >>>>>>>> deployer
> >> > >> > > > > > > >> >>>>>>>>>> service. Currently, we
> >> > >> > > > > > > >> >>>>>>>>>> always need a local user jar to start a
> flink
> >> > >> > > cluster.
> >> > >> > > > It
> >> > >> > > > > > is
> >> > >> > > > > > > >> >>> be
> >> > >> > > > > > > >> >>>>>> great
> >> > >> > > > > > > >> >>>>>>>> if
> >> > >> > > > > > > >> >>>>>>>>> we
> >> > >> > > > > > > >> >>>>>>>>>> could support remote user jars.
> >> > >> > > > > > > >> >>>>>>>>>>>> In the implementation, we assume users
> package
> >> > >> > > > > > > >> >>> flink-clients,
> >> > >> > > > > > > >> >>>>>>>>>> flink-optimizer, flink-table together
> within
> >> > >> the job
> >> > >> > > > jar.
> >> > >> > > > > > > >> >>>>> Otherwise,
> >> > >> > > > > > > >> >>>>>>>> the
> >> > >> > > > > > > >> >>>>>>>>>> job graph generation within
> >> > >> JobClusterEntryPoint will
> >> > >> > > > > fail.
> >> > >> > > > > > > >> >>>>>>>>>> 3. What do you mean about the package?
> Do users
> >> > >> need
> >> > >> > > to
> >> > >> > > > > > > >> >>> compile
> >> > >> > > > > > > >> >>>>>> their
> >> > >> > > > > > > >> >>>>>>>>> jars
> >> > >> > > > > > > >> >>>>>>>>>> inlcuding flink-clients, flink-optimizer,
> >> > >> flink-table
> >> > >> > > > > > codes?
> >> > >> > > > > > > >> >>>>>>>>>>
> >> > >> > > > > > > >> >>>>>>>>>>
> >> > >> > > > > > > >> >>>>>>>>>>
> >> > >> > > > > > > >> >>>>>>>>>> Best,
> >> > >> > > > > > > >> >>>>>>>>>> Yang
> >> > >> > > > > > > >> >>>>>>>>>>
> >> > >> > > > > > > >> >>>>>>>>>> Peter Huang <[hidden email]>
> >> > >> > > > 于2019年12月10日周二
> >> > >> > > > > > > >> >>>>> 上午2:37写道:
> >> > >> > > > > > > >> >>>>>>>>>>
> >> > >> > > > > > > >> >>>>>>>>>>> Dear All,
> >> > >> > > > > > > >> >>>>>>>>>>>
> >> > >> > > > > > > >> >>>>>>>>>>> Recently, the Flink community starts to
> >> > >> improve the
> >> > >> > > > yarn
> >> > >> > > > > > > >> >>>>> cluster
> >> > >> > > > > > > >> >>>>>>>>>> descriptor
> >> > >> > > > > > > >> >>>>>>>>>>> to make job jar and config files
> configurable
> >> > >> from
> >> > >> > > > CLI.
> >> > >> > > > > It
> >> > >> > > > > > > >> >>>>>> improves
> >> > >> > > > > > > >> >>>>>>>> the
> >> > >> > > > > > > >> >>>>>>>>>>> flexibility of  Flink deployment Yarn
> Per Job
> >> > >> Mode.
> >> > >> > > > For
> >> > >> > > > > > > >> >>>>> platform
> >> > >> > > > > > > >> >>>>>>>> users
> >> > >> > > > > > > >> >>>>>>>>>> who
> >> > >> > > > > > > >> >>>>>>>>>>> manage tens of hundreds of streaming
> pipelines
> >> > >> for
> >> > >> > > the
> >> > >> > > > > > whole
> >> > >> > > > > > > >> >>>>> org
> >> > >> > > > > > > >> >>>>>> or
> >> > >> > > > > > > >> >>>>>>>>>>> company, we found the job graph
> generation in
> >> > >> > > > > client-side
> >> > >> > > > > > is
> >> > >> > > > > > > >> >>>>>> another
> >> > >> > > > > > > >> >>>>>>>>>>> pinpoint. Thus, we want to propose a
> >> > >> configurable
> >> > >> > > > > feature
> >> > >> > > > > > > >> >>> for
> >> > >> > > > > > > >> >>>>>>>>>>> FlinkYarnSessionCli. The feature can
> allow
> >> > >> users to
> >> > >> > > > > choose
> >> > >> > > > > > > >> >>> the
> >> > >> > > > > > > >> >>>>> job
> >> > >> > > > > > > >> >>>>>>>>> graph
> >> > >> > > > > > > >> >>>>>>>>>>> generation in Flink ClusterEntryPoint
> so that
> >> > >> the
> >> > >> > > job
> >> > >> > > > > jar
> >> > >> > > > > > > >> >>>>> doesn't
> >> > >> > > > > > > >> >>>>>>>> need
> >> > >> > > > > > > >> >>>>>>>>> to
> >> > >> > > > > > > >> >>>>>>>>>>> be locally for the job graph
> generation. The
> >> > >> > > proposal
> >> > >> > > > is
> >> > >> > > > > > > >> >>>>> organized
> >> > >> > > > > > > >> >>>>>>>> as a
> >> > >> > > > > > > >> >>>>>>>>>>> FLIP
> >> > >> > > > > > > >> >>>>>>>>>>>
> >> > >> > > > > > > >> >>>>>>>>>>>
> >> > >> > > > > > > >> >>>>>>>>>>
> >> > >> > > > > > > >> >>>>>>>>>
> >> > >> > > > > > > >> >>>>>>>>
> >> > >> > > > > > > >> >>>>>>
> >> > >> > > > > > > >> >>>>>
> >> > >> > > > > > > >> >>>
> >> > >> > > > > > > >>
> >> > >> > > > > > >
> >> > >> > > > > >
> >> > >> > > > >
> >> > >> > > >
> >> > >> > >
> >> > >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-85+Delayed+JobGraph+Generation
> >> > >> > > > > > > >> >>>>>>>>>>> .
> >> > >> > > > > > > >> >>>>>>>>>>>
> >> > >> > > > > > > >> >>>>>>>>>>> Any questions and suggestions are
> welcomed.
> >> > >> Thank
> >> > >> > > you
> >> > >> > > > in
> >> > >> > > > > > > >> >>>>> advance.
> >> > >> > > > > > > >> >>>>>>>>>>>
> >> > >> > > > > > > >> >>>>>>>>>>>
> >> > >> > > > > > > >> >>>>>>>>>>> Best Regards
> >> > >> > > > > > > >> >>>>>>>>>>> Peter Huang
> >> > >> > > > > > > >> >>>>>>>>>>>
> >> > >> > > > > > > >> >>>>>>>>>>
> >> > >> > > > > > > >> >>>>>>>>>
> >> > >> > > > > > > >> >>>>>>>>
> >> > >> > > > > > > >> >>>>>>>
> >> > >> > > > > > > >> >>>>>>
> >> > >> > > > > > > >> >>>>>
> >> > >> > > > > > > >> >>>>
> >> > >> > > > > > > >> >>>
> >> > >> > > > > > > >> >>
> >> > >> > > > > > > >>
> >> > >> > > > > > > >>
> >> > >> > > > > > >
> >> > >> > > > > >
> >> > >> > > > >
> >> > >> > > >
> >> > >> > >
> >> > >>
> >> > >
> >>
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] FLIP-85: Delayed Job Graph Generation

Kostas Kloudas-4
Hi Peter,

I understand your point. This is why I was also a bit torn about the
name and my proposal was a bit aligned with yours (something along the
lines of "cluster deploy" mode).

But many of the other participants in the discussion suggested the
"Application Mode". I think that the reasoning is that now the user's
Application is more self-contained.
It will be submitted to the cluster and the user can just disconnect.
In addition, as discussed briefly in the doc, in the future there may
be better support for multi-execute applications which will bring us
one step closer to the true "Application Mode". But this is how I
interpreted their arguments, of course they can also express their
thoughts on the topic :)

Cheers,
Kostas

On Mon, Mar 2, 2020 at 6:15 PM Peter Huang <[hidden email]> wrote:

>
> Hi Kostas,
>
> Thanks for updating the wiki. We have aligned with the implementations in the doc. But I feel it is still a little bit confusing of the naming from a user's perspective. It is well known that Flink support per job cluster and session cluster. The concept is in the layer of how a job is managed within Flink. The method introduced util now is a kind of mixing job and session cluster to promising the implementation complexity. We probably don't need to label it as Application Model as the same layer of per job cluster and session cluster. Conceptually, I think it is still a cluster mode implementation for per job cluster.
>
> To minimize the confusion of users, I think it would be better just an option of per job cluster for each type of cluster manager. How do you think?
>
>
> Best Regards
> Peter Huang
>
>
>
>
>
>
>
>
> On Mon, Mar 2, 2020 at 7:22 AM Kostas Kloudas <[hidden email]> wrote:
>>
>> Hi Yang,
>>
>> The difference between per-job and application mode is that, as you
>> described, in the per-job mode the main is executed on the client
>> while in the application mode, the main is executed on the cluster.
>> I do not think we have to offer "application mode" with running the
>> main on the client side as this is exactly what the per-job mode does
>> currently and, as you described also, it would be redundant.
>>
>> Sorry if this was not clear in the document.
>>
>> Cheers,
>> Kostas
>>
>> On Mon, Mar 2, 2020 at 3:17 PM Yang Wang <[hidden email]> wrote:
>> >
>> > Hi Kostas,
>> >
>> > Thanks a lot for your conclusion and updating the FLIP-85 WIKI. Currently, i have no more
>> > questions about motivation, approach, fault tolerance and the first phase implementation.
>> >
>> > I think the new title "Flink Application Mode" makes a lot senses to me. Especially for the
>> > containerized environment, the cluster deploy option will be very useful.
>> >
>> > Just one concern, how do we introduce this new application mode to our users?
>> > Each user program(i.e. `main()`) is an application. Currently, we intend to only support one
>> > `execute()`. So what's the difference between per-job and application mode?
>> >
>> > For per-job, user `main()` is always executed on client side. And For application mode, user
>> > `main()` could be executed on client or master side(configured via cli option).
>> > Right? We need to have a clear concept. Otherwise, the users will be more and more confusing.
>> >
>> >
>> > Best,
>> > Yang
>> >
>> > Kostas Kloudas <[hidden email]> 于2020年3月2日周一 下午5:58写道:
>> >>
>> >> Hi all,
>> >>
>> >> I update https://cwiki.apache.org/confluence/display/FLINK/FLIP-85+Flink+Application+Mode
>> >> based on the discussion we had here:
>> >>
>> >> https://docs.google.com/document/d/1ji72s3FD9DYUyGuKnJoO4ApzV-nSsZa0-bceGXW7Ocw/edit#
>> >>
>> >> Please let me know what you think and please keep the discussion in the ML :)
>> >>
>> >> Thanks for starting the discussion and I hope that soon we will be
>> >> able to vote on the FLIP.
>> >>
>> >> Cheers,
>> >> Kostas
>> >>
>> >> On Thu, Jan 16, 2020 at 3:40 AM Yang Wang <[hidden email]> wrote:
>> >> >
>> >> > Hi all,
>> >> >
>> >> > Thanks a lot for the feedback from @Kostas Kloudas. Your all concerns are
>> >> > on point. The FLIP-85 is mainly
>> >> > focused on supporting cluster mode for per-job. Since it is more urgent and
>> >> > have much more use
>> >> > cases both in Yarn and Kubernetes deployment. For session cluster, we could
>> >> > have more discussion
>> >> > in a new thread later.
>> >> >
>> >> > #1, How to download the user jars and dependencies for per-job in cluster
>> >> > mode?
>> >> > For Yarn, we could register the user jars and dependencies as
>> >> > LocalResource. They will be distributed
>> >> > by Yarn. And once the JobManager and TaskManager launched, the jars are
>> >> > already exists.
>> >> > For Standalone per-job and K8s, we expect that the user jars
>> >> > and dependencies are built into the image.
>> >> > Or the InitContainer could be used for downloading. It is natively
>> >> > distributed and we will not have bottleneck.
>> >> >
>> >> > #2, Job graph recovery
>> >> > We could have an optimization to store job graph on the DFS. However, i
>> >> > suggest building a new jobgraph
>> >> > from the configuration is the default option. Since we will not always have
>> >> > a DFS store when deploying a
>> >> > Flink per-job cluster. Of course, we assume that using the same
>> >> > configuration(e.g. job_id, user_jar, main_class,
>> >> > main_args, parallelism, savepoint_settings, etc.) will get a same job
>> >> > graph. I think the standalone per-job
>> >> > already has the similar behavior.
>> >> >
>> >> > #3, What happens with jobs that have multiple execute calls?
>> >> > Currently, it is really a problem. Even we use a local client on Flink
>> >> > master side, it will have different behavior with
>> >> > client mode. For client mode, if we execute multiple times, then we will
>> >> > deploy multiple Flink clusters for each execute.
>> >> > I am not pretty sure whether it is reasonable. However, i still think using
>> >> > the local client is a good choice. We could
>> >> > continue the discussion in a new thread. @Zili Chen <[hidden email]> Do
>> >> > you want to drive this?
>> >> >
>> >> >
>> >> >
>> >> > Best,
>> >> > Yang
>> >> >
>> >> > Peter Huang <[hidden email]> 于2020年1月16日周四 上午1:55写道:
>> >> >
>> >> > > Hi Kostas,
>> >> > >
>> >> > > Thanks for this feedback. I can't agree more about the opinion. The
>> >> > > cluster mode should be added
>> >> > > first in per job cluster.
>> >> > >
>> >> > > 1) For job cluster implementation
>> >> > > 1. Job graph recovery from configuration or store as static job graph as
>> >> > > session cluster. I think the static one will be better for less recovery
>> >> > > time.
>> >> > > Let me update the doc for details.
>> >> > >
>> >> > > 2. For job execute multiple times, I think @Zili Chen
>> >> > > <[hidden email]> has proposed the local client solution that can
>> >> > > the run program actually in the cluster entry point. We can put the
>> >> > > implementation in the second stage,
>> >> > > or even a new FLIP for further discussion.
>> >> > >
>> >> > > 2) For session cluster implementation
>> >> > > We can disable the cluster mode for the session cluster in the first
>> >> > > stage. I agree the jar downloading will be a painful thing.
>> >> > > We can consider about PoC and performance evaluation first. If the end to
>> >> > > end experience is good enough, then we can consider
>> >> > > proceeding with the solution.
>> >> > >
>> >> > > Looking forward to more opinions from @Yang Wang <[hidden email]> @Zili
>> >> > > Chen <[hidden email]> @Dian Fu <[hidden email]>.
>> >> > >
>> >> > >
>> >> > > Best Regards
>> >> > > Peter Huang
>> >> > >
>> >> > > On Wed, Jan 15, 2020 at 7:50 AM Kostas Kloudas <[hidden email]> wrote:
>> >> > >
>> >> > >> Hi all,
>> >> > >>
>> >> > >> I am writing here as the discussion on the Google Doc seems to be a
>> >> > >> bit difficult to follow.
>> >> > >>
>> >> > >> I think that in order to be able to make progress, it would be helpful
>> >> > >> to focus on per-job mode for now.
>> >> > >> The reason is that:
>> >> > >>  1) making the (unique) JobSubmitHandler responsible for creating the
>> >> > >> jobgraphs,
>> >> > >>   which includes downloading dependencies, is not an optimal solution
>> >> > >>  2) even if we put the responsibility on the JobMaster, currently each
>> >> > >> job has its own
>> >> > >>   JobMaster but they all run on the same process, so we have again a
>> >> > >> single entity.
>> >> > >>
>> >> > >> Of course after this is done, and if we feel comfortable with the
>> >> > >> solution, then we can go to the session mode.
>> >> > >>
>> >> > >> A second comment has to do with fault-tolerance in the per-job,
>> >> > >> cluster-deploy mode.
>> >> > >> In the document, it is suggested that upon recovery, the JobMaster of
>> >> > >> each job re-creates the JobGraph.
>> >> > >> I am just wondering if it is better to create and store the jobGraph
>> >> > >> upon submission and only fetch it
>> >> > >> upon recovery so that we have a static jobGraph.
>> >> > >>
>> >> > >> Finally, I have a question which is what happens with jobs that have
>> >> > >> multiple execute calls?
>> >> > >> The semantics seem to change compared to the current behaviour, right?
>> >> > >>
>> >> > >> Cheers,
>> >> > >> Kostas
>> >> > >>
>> >> > >> On Wed, Jan 8, 2020 at 8:05 PM tison <[hidden email]> wrote:
>> >> > >> >
>> >> > >> > not always, Yang Wang is also not yet a committer but he can join the
>> >> > >> > channel. I cannot find the id by clicking “Add new member in channel” so
>> >> > >> > come to you and ask for try out the link. Possibly I will find other
>> >> > >> ways
>> >> > >> > but the original purpose is that the slack channel is a public area we
>> >> > >> > discuss about developing...
>> >> > >> > Best,
>> >> > >> > tison.
>> >> > >> >
>> >> > >> >
>> >> > >> > Peter Huang <[hidden email]> 于2020年1月9日周四 上午2:44写道:
>> >> > >> >
>> >> > >> > > Hi Tison,
>> >> > >> > >
>> >> > >> > > I am not the committer of Flink yet. I think I can't join it also.
>> >> > >> > >
>> >> > >> > >
>> >> > >> > > Best Regards
>> >> > >> > > Peter Huang
>> >> > >> > >
>> >> > >> > > On Wed, Jan 8, 2020 at 9:39 AM tison <[hidden email]> wrote:
>> >> > >> > >
>> >> > >> > > > Hi Peter,
>> >> > >> > > >
>> >> > >> > > > Could you try out this link?
>> >> > >> > > https://the-asf.slack.com/messages/CNA3ADZPH
>> >> > >> > > >
>> >> > >> > > > Best,
>> >> > >> > > > tison.
>> >> > >> > > >
>> >> > >> > > >
>> >> > >> > > > Peter Huang <[hidden email]> 于2020年1月9日周四 上午1:22写道:
>> >> > >> > > >
>> >> > >> > > > > Hi Tison,
>> >> > >> > > > >
>> >> > >> > > > > I can't join the group with shared link. Would you please add me
>> >> > >> into
>> >> > >> > > the
>> >> > >> > > > > group? My slack account is huangzhenqiu0825.
>> >> > >> > > > > Thank you in advance.
>> >> > >> > > > >
>> >> > >> > > > >
>> >> > >> > > > > Best Regards
>> >> > >> > > > > Peter Huang
>> >> > >> > > > >
>> >> > >> > > > > On Wed, Jan 8, 2020 at 12:02 AM tison <[hidden email]>
>> >> > >> wrote:
>> >> > >> > > > >
>> >> > >> > > > > > Hi Peter,
>> >> > >> > > > > >
>> >> > >> > > > > > As described above, this effort should get attention from people
>> >> > >> > > > > developing
>> >> > >> > > > > > FLIP-73 a.k.a. Executor abstractions. I recommend you to join
>> >> > >> the
>> >> > >> > > > public
>> >> > >> > > > > > slack channel[1] for Flink Client API Enhancement and you can
>> >> > >> try to
>> >> > >> > > > > share
>> >> > >> > > > > > you detailed thoughts there. It possibly gets more concrete
>> >> > >> > > attentions.
>> >> > >> > > > > >
>> >> > >> > > > > > Best,
>> >> > >> > > > > > tison.
>> >> > >> > > > > >
>> >> > >> > > > > > [1]
>> >> > >> > > > > >
>> >> > >> > > > > >
>> >> > >> > > > >
>> >> > >> > > >
>> >> > >> > >
>> >> > >> https://slack.com/share/IS21SJ75H/Rk8HhUly9FuEHb7oGwBZ33uL/enQtODg2MDYwNjE5MTg3LTA2MjIzNDc1M2ZjZDVlMjdlZjk1M2RkYmJhNjAwMTk2ZDZkODQ4NmY5YmI4OGRhNWJkYTViMTM1NzlmMzc4OWM
>> >> > >> > > > > >
>> >> > >> > > > > >
>> >> > >> > > > > > Peter Huang <[hidden email]> 于2020年1月7日周二 上午5:09写道:
>> >> > >> > > > > >
>> >> > >> > > > > > > Dear All,
>> >> > >> > > > > > >
>> >> > >> > > > > > > Happy new year! According to existing feedback from the
>> >> > >> community,
>> >> > >> > > we
>> >> > >> > > > > > > revised the doc with the consideration of session cluster
>> >> > >> support,
>> >> > >> > > > and
>> >> > >> > > > > > > concrete interface changes needed and execution plan. Please
>> >> > >> take
>> >> > >> > > one
>> >> > >> > > > > > more
>> >> > >> > > > > > > round of review at your most convenient time.
>> >> > >> > > > > > >
>> >> > >> > > > > > >
>> >> > >> > > > > > >
>> >> > >> > > > > >
>> >> > >> > > > >
>> >> > >> > > >
>> >> > >> > >
>> >> > >> https://docs.google.com/document/d/1aAwVjdZByA-0CHbgv16Me-vjaaDMCfhX7TzVVTuifYM/edit#
>> >> > >> > > > > > >
>> >> > >> > > > > > >
>> >> > >> > > > > > > Best Regards
>> >> > >> > > > > > > Peter Huang
>> >> > >> > > > > > >
>> >> > >> > > > > > >
>> >> > >> > > > > > >
>> >> > >> > > > > > >
>> >> > >> > > > > > >
>> >> > >> > > > > > > On Thu, Jan 2, 2020 at 11:29 AM Peter Huang <
>> >> > >> > > > > [hidden email]>
>> >> > >> > > > > > > wrote:
>> >> > >> > > > > > >
>> >> > >> > > > > > > > Hi Dian,
>> >> > >> > > > > > > > Thanks for giving us valuable feedbacks.
>> >> > >> > > > > > > >
>> >> > >> > > > > > > > 1) It's better to have a whole design for this feature
>> >> > >> > > > > > > > For the suggestion of enabling the cluster mode also session
>> >> > >> > > > > cluster, I
>> >> > >> > > > > > > > think Flink already supported it. WebSubmissionExtension
>> >> > >> already
>> >> > >> > > > > allows
>> >> > >> > > > > > > > users to start a job with the specified jar by using web UI.
>> >> > >> > > > > > > > But we need to enable the feature from CLI for both local
>> >> > >> jar,
>> >> > >> > > > remote
>> >> > >> > > > > > > jar.
>> >> > >> > > > > > > > I will align with Yang Wang first about the details and
>> >> > >> update
>> >> > >> > > the
>> >> > >> > > > > > design
>> >> > >> > > > > > > > doc.
>> >> > >> > > > > > > >
>> >> > >> > > > > > > > 2) It's better to consider the convenience for users, such
>> >> > >> as
>> >> > >> > > > > debugging
>> >> > >> > > > > > > >
>> >> > >> > > > > > > > I am wondering whether we can store the exception in
>> >> > >> jobgragh
>> >> > >> > > > > > > > generation in application master. As no streaming graph can
>> >> > >> be
>> >> > >> > > > > > scheduled
>> >> > >> > > > > > > in
>> >> > >> > > > > > > > this case, there will be no more TM will be requested from
>> >> > >> > > FlinkRM.
>> >> > >> > > > > > > > If the AM is still running, users can still query it from
>> >> > >> CLI. As
>> >> > >> > > > it
>> >> > >> > > > > > > > requires more change, we can get some feedback from <
>> >> > >> > > > > > [hidden email]
>> >> > >> > > > > > > >
>> >> > >> > > > > > > > and @[hidden email] <[hidden email]>.
>> >> > >> > > > > > > >
>> >> > >> > > > > > > > 3) It's better to consider the impact to the stability of
>> >> > >> the
>> >> > >> > > > cluster
>> >> > >> > > > > > > >
>> >> > >> > > > > > > > I agree with Yang Wang's opinion.
>> >> > >> > > > > > > >
>> >> > >> > > > > > > >
>> >> > >> > > > > > > >
>> >> > >> > > > > > > > Best Regards
>> >> > >> > > > > > > > Peter Huang
>> >> > >> > > > > > > >
>> >> > >> > > > > > > >
>> >> > >> > > > > > > > On Sun, Dec 29, 2019 at 9:44 PM Dian Fu <
>> >> > >> [hidden email]>
>> >> > >> > > > > wrote:
>> >> > >> > > > > > > >
>> >> > >> > > > > > > >> Hi all,
>> >> > >> > > > > > > >>
>> >> > >> > > > > > > >> Sorry to jump into this discussion. Thanks everyone for the
>> >> > >> > > > > > discussion.
>> >> > >> > > > > > > >> I'm very interested in this topic although I'm not an
>> >> > >> expert in
>> >> > >> > > > this
>> >> > >> > > > > > > part.
>> >> > >> > > > > > > >> So I'm glad to share my thoughts as following:
>> >> > >> > > > > > > >>
>> >> > >> > > > > > > >> 1) It's better to have a whole design for this feature
>> >> > >> > > > > > > >> As we know, there are two deployment modes: per-job mode
>> >> > >> and
>> >> > >> > > > session
>> >> > >> > > > > > > >> mode. I'm wondering which mode really needs this feature.
>> >> > >> As the
>> >> > >> > > > > > design
>> >> > >> > > > > > > doc
>> >> > >> > > > > > > >> mentioned, per-job mode is more used for streaming jobs and
>> >> > >> > > > session
>> >> > >> > > > > > > mode is
>> >> > >> > > > > > > >> usually used for batch jobs(Of course, the job types and
>> >> > >> the
>> >> > >> > > > > > deployment
>> >> > >> > > > > > > >> modes are orthogonal). Usually streaming job is only
>> >> > >> needed to
>> >> > >> > > be
>> >> > >> > > > > > > submitted
>> >> > >> > > > > > > >> once and it will run for days or weeks, while batch jobs
>> >> > >> will be
>> >> > >> > > > > > > submitted
>> >> > >> > > > > > > >> more frequently compared with streaming jobs. This means
>> >> > >> that
>> >> > >> > > > maybe
>> >> > >> > > > > > > session
>> >> > >> > > > > > > >> mode also needs this feature. However, if we support this
>> >> > >> > > feature
>> >> > >> > > > in
>> >> > >> > > > > > > >> session mode, the application master will become the new
>> >> > >> > > > centralized
>> >> > >> > > > > > > >> service(which should be solved). So in this case, it's
>> >> > >> better to
>> >> > >> > > > > have
>> >> > >> > > > > > a
>> >> > >> > > > > > > >> complete design for both per-job mode and session mode.
>> >> > >> > > > Furthermore,
>> >> > >> > > > > > > even
>> >> > >> > > > > > > >> if we can do it phase by phase, we need to have a whole
>> >> > >> picture
>> >> > >> > > of
>> >> > >> > > > > how
>> >> > >> > > > > > > it
>> >> > >> > > > > > > >> works in both per-job mode and session mode.
>> >> > >> > > > > > > >>
>> >> > >> > > > > > > >> 2) It's better to consider the convenience for users, such
>> >> > >> as
>> >> > >> > > > > > debugging
>> >> > >> > > > > > > >> After we finish this feature, the job graph will be
>> >> > >> compiled in
>> >> > >> > > > the
>> >> > >> > > > > > > >> application master, which means that users cannot easily
>> >> > >> get the
>> >> > >> > > > > > > exception
>> >> > >> > > > > > > >> message synchorousely in the job client if there are
>> >> > >> problems
>> >> > >> > > > during
>> >> > >> > > > > > the
>> >> > >> > > > > > > >> job graph compiling (especially for platform users), such
>> >> > >> as the
>> >> > >> > > > > > > resource
>> >> > >> > > > > > > >> path is incorrect, the user program itself has some
>> >> > >> problems,
>> >> > >> > > etc.
>> >> > >> > > > > > What
>> >> > >> > > > > > > I'm
>> >> > >> > > > > > > >> thinking is that maybe we should throw the exceptions as
>> >> > >> early
>> >> > >> > > as
>> >> > >> > > > > > > possible
>> >> > >> > > > > > > >> (during job submission stage).
>> >> > >> > > > > > > >>
>> >> > >> > > > > > > >> 3) It's better to consider the impact to the stability of
>> >> > >> the
>> >> > >> > > > > cluster
>> >> > >> > > > > > > >> If we perform the compiling in the application master, we
>> >> > >> should
>> >> > >> > > > > > > consider
>> >> > >> > > > > > > >> the impact of the compiling errors. Although YARN could
>> >> > >> resume
>> >> > >> > > the
>> >> > >> > > > > > > >> application master in case of failures, but in some case
>> >> > >> the
>> >> > >> > > > > compiling
>> >> > >> > > > > > > >> failure may be a waste of cluster resource and may impact
>> >> > >> the
>> >> > >> > > > > > stability
>> >> > >> > > > > > > the
>> >> > >> > > > > > > >> cluster and the other jobs in the cluster, such as the
>> >> > >> resource
>> >> > >> > > > path
>> >> > >> > > > > > is
>> >> > >> > > > > > > >> incorrect, the user program itself has some problems(in
>> >> > >> this
>> >> > >> > > case,
>> >> > >> > > > > job
>> >> > >> > > > > > > >> failover cannot solve this kind of problems) etc. In the
>> >> > >> current
>> >> > >> > > > > > > >> implemention, the compiling errors are handled in the
>> >> > >> client
>> >> > >> > > side
>> >> > >> > > > > and
>> >> > >> > > > > > > there
>> >> > >> > > > > > > >> is no impact to the cluster at all.
>> >> > >> > > > > > > >>
>> >> > >> > > > > > > >> Regarding to 1), it's clearly pointed in the design doc
>> >> > >> that
>> >> > >> > > only
>> >> > >> > > > > > > per-job
>> >> > >> > > > > > > >> mode will be supported. However, I think it's better to
>> >> > >> also
>> >> > >> > > > > consider
>> >> > >> > > > > > > the
>> >> > >> > > > > > > >> session mode in the design doc.
>> >> > >> > > > > > > >> Regarding to 2) and 3), I have not seen related sections
>> >> > >> in the
>> >> > >> > > > > design
>> >> > >> > > > > > > >> doc. It will be good if we can cover them in the design
>> >> > >> doc.
>> >> > >> > > > > > > >>
>> >> > >> > > > > > > >> Feel free to correct me If there is anything I
>> >> > >> misunderstand.
>> >> > >> > > > > > > >>
>> >> > >> > > > > > > >> Regards,
>> >> > >> > > > > > > >> Dian
>> >> > >> > > > > > > >>
>> >> > >> > > > > > > >>
>> >> > >> > > > > > > >> > 在 2019年12月27日,上午3:13,Peter Huang <
>> >> > >> [hidden email]>
>> >> > >> > > > 写道:
>> >> > >> > > > > > > >> >
>> >> > >> > > > > > > >> > Hi Yang,
>> >> > >> > > > > > > >> >
>> >> > >> > > > > > > >> > I can't agree more. The effort definitely needs to align
>> >> > >> with
>> >> > >> > > > the
>> >> > >> > > > > > > final
>> >> > >> > > > > > > >> > goal of FLIP-73.
>> >> > >> > > > > > > >> > I am thinking about whether we can achieve the goal with
>> >> > >> two
>> >> > >> > > > > phases.
>> >> > >> > > > > > > >> >
>> >> > >> > > > > > > >> > 1) Phase I
>> >> > >> > > > > > > >> > As the CLiFrontend will not be depreciated soon. We can
>> >> > >> still
>> >> > >> > > > use
>> >> > >> > > > > > the
>> >> > >> > > > > > > >> > deployMode flag there,
>> >> > >> > > > > > > >> > pass the program info through Flink configuration,  use
>> >> > >> the
>> >> > >> > > > > > > >> > ClassPathJobGraphRetriever
>> >> > >> > > > > > > >> > to generate the job graph in ClusterEntrypoints of yarn
>> >> > >> and
>> >> > >> > > > > > > Kubernetes.
>> >> > >> > > > > > > >> >
>> >> > >> > > > > > > >> > 2) Phase II
>> >> > >> > > > > > > >> > In  AbstractJobClusterExecutor, the job graph is
>> >> > >> generated in
>> >> > >> > > > the
>> >> > >> > > > > > > >> execute
>> >> > >> > > > > > > >> > function. We can still
>> >> > >> > > > > > > >> > use the deployMode in it. With deployMode = cluster, the
>> >> > >> > > execute
>> >> > >> > > > > > > >> function
>> >> > >> > > > > > > >> > only starts the cluster.
>> >> > >> > > > > > > >> >
>> >> > >> > > > > > > >> > When {Yarn/Kuberneates}PerJobClusterEntrypoint starts,
>> >> > >> It will
>> >> > >> > > > > start
>> >> > >> > > > > > > the
>> >> > >> > > > > > > >> > dispatch first, then we can use
>> >> > >> > > > > > > >> > a ClusterEnvironment similar to ContextEnvironment to
>> >> > >> submit
>> >> > >> > > the
>> >> > >> > > > > job
>> >> > >> > > > > > > >> with
>> >> > >> > > > > > > >> > jobName the local
>> >> > >> > > > > > > >> > dispatcher. For the details, we need more investigation.
>> >> > >> Let's
>> >> > >> > > > > wait
>> >> > >> > > > > > > >> > for @Aljoscha
>> >> > >> > > > > > > >> > Krettek <[hidden email]> @Till Rohrmann <
>> >> > >> > > > > [hidden email]
>> >> > >> > > > > > >'s
>> >> > >> > > > > > > >> > feedback after the holiday season.
>> >> > >> > > > > > > >> >
>> >> > >> > > > > > > >> > Thank you in advance. Merry Chrismas and Happy New
>> >> > >> Year!!!
>> >> > >> > > > > > > >> >
>> >> > >> > > > > > > >> >
>> >> > >> > > > > > > >> > Best Regards
>> >> > >> > > > > > > >> > Peter Huang
>> >> > >> > > > > > > >> >
>> >> > >> > > > > > > >> >
>> >> > >> > > > > > > >> >
>> >> > >> > > > > > > >> >
>> >> > >> > > > > > > >> >
>> >> > >> > > > > > > >> >
>> >> > >> > > > > > > >> >
>> >> > >> > > > > > > >> >
>> >> > >> > > > > > > >> > On Wed, Dec 25, 2019 at 1:08 AM Yang Wang <
>> >> > >> > > > [hidden email]>
>> >> > >> > > > > > > >> wrote:
>> >> > >> > > > > > > >> >
>> >> > >> > > > > > > >> >> Hi Peter,
>> >> > >> > > > > > > >> >>
>> >> > >> > > > > > > >> >> I think we need to reconsider tison's suggestion
>> >> > >> seriously.
>> >> > >> > > > After
>> >> > >> > > > > > > >> FLIP-73,
>> >> > >> > > > > > > >> >> the deployJobCluster has
>> >> > >> > > > > > > >> >> beenmoved into `JobClusterExecutor#execute`. It should
>> >> > >> not be
>> >> > >> > > > > > > perceived
>> >> > >> > > > > > > >> >> for `CliFrontend`. That
>> >> > >> > > > > > > >> >> means the user program will *ALWAYS* be executed on
>> >> > >> client
>> >> > >> > > > side.
>> >> > >> > > > > > This
>> >> > >> > > > > > > >> is
>> >> > >> > > > > > > >> >> the by design behavior.
>> >> > >> > > > > > > >> >> So, we could not just add `if(client mode) .. else
>> >> > >> if(cluster
>> >> > >> > > > > mode)
>> >> > >> > > > > > > >> ...`
>> >> > >> > > > > > > >> >> codes in `CliFrontend` to bypass
>> >> > >> > > > > > > >> >> the executor. We need to find a clean way to decouple
>> >> > >> > > executing
>> >> > >> > > > > > user
>> >> > >> > > > > > > >> >> program and deploying per-job
>> >> > >> > > > > > > >> >> cluster. Based on this, we could support to execute user
>> >> > >> > > > program
>> >> > >> > > > > on
>> >> > >> > > > > > > >> client
>> >> > >> > > > > > > >> >> or master side.
>> >> > >> > > > > > > >> >>
>> >> > >> > > > > > > >> >> Maybe Aljoscha and Jeff could give some good
>> >> > >> suggestions.
>> >> > >> > > > > > > >> >>
>> >> > >> > > > > > > >> >>
>> >> > >> > > > > > > >> >>
>> >> > >> > > > > > > >> >> Best,
>> >> > >> > > > > > > >> >> Yang
>> >> > >> > > > > > > >> >>
>> >> > >> > > > > > > >> >> Peter Huang <[hidden email]> 于2019年12月25日周三
>> >> > >> > > > > 上午4:03写道:
>> >> > >> > > > > > > >> >>
>> >> > >> > > > > > > >> >>> Hi Jingjing,
>> >> > >> > > > > > > >> >>>
>> >> > >> > > > > > > >> >>> The improvement proposed is a deployment option for
>> >> > >> CLI. For
>> >> > >> > > > SQL
>> >> > >> > > > > > > based
>> >> > >> > > > > > > >> >>> Flink application, It is more convenient to use the
>> >> > >> existing
>> >> > >> > > > > model
>> >> > >> > > > > > > in
>> >> > >> > > > > > > >> >>> SqlClient in which
>> >> > >> > > > > > > >> >>> the job graph is generated within SqlClient. After
>> >> > >> adding
>> >> > >> > > the
>> >> > >> > > > > > > delayed
>> >> > >> > > > > > > >> job
>> >> > >> > > > > > > >> >>> graph generation, I think there is no change is needed
>> >> > >> for
>> >> > >> > > > your
>> >> > >> > > > > > > side.
>> >> > >> > > > > > > >> >>>
>> >> > >> > > > > > > >> >>>
>> >> > >> > > > > > > >> >>> Best Regards
>> >> > >> > > > > > > >> >>> Peter Huang
>> >> > >> > > > > > > >> >>>
>> >> > >> > > > > > > >> >>>
>> >> > >> > > > > > > >> >>> On Wed, Dec 18, 2019 at 6:01 AM jingjing bai <
>> >> > >> > > > > > > >> [hidden email]>
>> >> > >> > > > > > > >> >>> wrote:
>> >> > >> > > > > > > >> >>>
>> >> > >> > > > > > > >> >>>> hi peter:
>> >> > >> > > > > > > >> >>>>    we had extension SqlClent to support sql job
>> >> > >> submit in
>> >> > >> > > web
>> >> > >> > > > > > base
>> >> > >> > > > > > > on
>> >> > >> > > > > > > >> >>>> flink 1.9.   we support submit to yarn on per job
>> >> > >> mode too.
>> >> > >> > > > > > > >> >>>>    in this case, the job graph generated  on client
>> >> > >> side
>> >> > >> > > .  I
>> >> > >> > > > > > think
>> >> > >> > > > > > > >> >>> this
>> >> > >> > > > > > > >> >>>> discuss Mainly to improve api programme.  but in my
>> >> > >> case ,
>> >> > >> > > > > there
>> >> > >> > > > > > is
>> >> > >> > > > > > > >> no
>> >> > >> > > > > > > >> >>>> jar to upload but only a sql string .
>> >> > >> > > > > > > >> >>>>    do u had more suggestion to improve for sql mode
>> >> > >> or it
>> >> > >> > > is
>> >> > >> > > > > > only a
>> >> > >> > > > > > > >> >>>> switch for api programme?
>> >> > >> > > > > > > >> >>>>
>> >> > >> > > > > > > >> >>>>
>> >> > >> > > > > > > >> >>>> best
>> >> > >> > > > > > > >> >>>> bai jj
>> >> > >> > > > > > > >> >>>>
>> >> > >> > > > > > > >> >>>>
>> >> > >> > > > > > > >> >>>> Yang Wang <[hidden email]> 于2019年12月18日周三
>> >> > >> 下午7:21写道:
>> >> > >> > > > > > > >> >>>>
>> >> > >> > > > > > > >> >>>>> I just want to revive this discussion.
>> >> > >> > > > > > > >> >>>>>
>> >> > >> > > > > > > >> >>>>> Recently, i am thinking about how to natively run
>> >> > >> flink
>> >> > >> > > > > per-job
>> >> > >> > > > > > > >> >>> cluster on
>> >> > >> > > > > > > >> >>>>> Kubernetes.
>> >> > >> > > > > > > >> >>>>> The per-job mode on Kubernetes is very different
>> >> > >> from on
>> >> > >> > > > Yarn.
>> >> > >> > > > > > And
>> >> > >> > > > > > > >> we
>> >> > >> > > > > > > >> >>> will
>> >> > >> > > > > > > >> >>>>> have
>> >> > >> > > > > > > >> >>>>> the same deployment requirements to the client and
>> >> > >> entry
>> >> > >> > > > > point.
>> >> > >> > > > > > > >> >>>>>
>> >> > >> > > > > > > >> >>>>> 1. Flink client not always need a local jar to start
>> >> > >> a
>> >> > >> > > Flink
>> >> > >> > > > > > > per-job
>> >> > >> > > > > > > >> >>>>> cluster. We could
>> >> > >> > > > > > > >> >>>>> support multiple schemas. For example,
>> >> > >> > > > file:///path/of/my.jar
>> >> > >> > > > > > > means
>> >> > >> > > > > > > >> a
>> >> > >> > > > > > > >> >>> jar
>> >> > >> > > > > > > >> >>>>> located
>> >> > >> > > > > > > >> >>>>> at client side,
>> >> > >> hdfs://myhdfs/user/myname/flink/my.jar
>> >> > >> > > > means a
>> >> > >> > > > > > jar
>> >> > >> > > > > > > >> >>> located
>> >> > >> > > > > > > >> >>>>> at
>> >> > >> > > > > > > >> >>>>> remote hdfs, local:///path/in/image/my.jar means a
>> >> > >> jar
>> >> > >> > > > located
>> >> > >> > > > > > at
>> >> > >> > > > > > > >> >>>>> jobmanager side.
>> >> > >> > > > > > > >> >>>>>
>> >> > >> > > > > > > >> >>>>> 2. Support running user program on master side. This
>> >> > >> also
>> >> > >> > > > > means
>> >> > >> > > > > > > the
>> >> > >> > > > > > > >> >>> entry
>> >> > >> > > > > > > >> >>>>> point
>> >> > >> > > > > > > >> >>>>> will generate the job graph on master side. We could
>> >> > >> use
>> >> > >> > > the
>> >> > >> > > > > > > >> >>>>> ClasspathJobGraphRetriever
>> >> > >> > > > > > > >> >>>>> or start a local Flink client to achieve this
>> >> > >> purpose.
>> >> > >> > > > > > > >> >>>>>
>> >> > >> > > > > > > >> >>>>>
>> >> > >> > > > > > > >> >>>>> cc tison, Aljoscha & Kostas Do you think this is the
>> >> > >> right
>> >> > >> > > > > > > >> direction we
>> >> > >> > > > > > > >> >>>>> need to work?
>> >> > >> > > > > > > >> >>>>>
>> >> > >> > > > > > > >> >>>>> tison <[hidden email]> 于2019年12月12日周四
>> >> > >> 下午4:48写道:
>> >> > >> > > > > > > >> >>>>>
>> >> > >> > > > > > > >> >>>>>> A quick idea is that we separate the deployment
>> >> > >> from user
>> >> > >> > > > > > program
>> >> > >> > > > > > > >> >>> that
>> >> > >> > > > > > > >> >>>>> it
>> >> > >> > > > > > > >> >>>>>> has always been done
>> >> > >> > > > > > > >> >>>>>> outside the program. On user program executed there
>> >> > >> is
>> >> > >> > > > > always a
>> >> > >> > > > > > > >> >>>>>> ClusterClient that communicates with
>> >> > >> > > > > > > >> >>>>>> an existing cluster, remote or local. It will be
>> >> > >> another
>> >> > >> > > > > thread
>> >> > >> > > > > > > so
>> >> > >> > > > > > > >> >>> just
>> >> > >> > > > > > > >> >>>>> for
>> >> > >> > > > > > > >> >>>>>> your information.
>> >> > >> > > > > > > >> >>>>>>
>> >> > >> > > > > > > >> >>>>>> Best,
>> >> > >> > > > > > > >> >>>>>> tison.
>> >> > >> > > > > > > >> >>>>>>
>> >> > >> > > > > > > >> >>>>>>
>> >> > >> > > > > > > >> >>>>>> tison <[hidden email]> 于2019年12月12日周四
>> >> > >> 下午4:40写道:
>> >> > >> > > > > > > >> >>>>>>
>> >> > >> > > > > > > >> >>>>>>> Hi Peter,
>> >> > >> > > > > > > >> >>>>>>>
>> >> > >> > > > > > > >> >>>>>>> Another concern I realized recently is that with
>> >> > >> current
>> >> > >> > > > > > > Executors
>> >> > >> > > > > > > >> >>>>>>> abstraction(FLIP-73)
>> >> > >> > > > > > > >> >>>>>>> I'm afraid that user program is designed to ALWAYS
>> >> > >> run
>> >> > >> > > on
>> >> > >> > > > > the
>> >> > >> > > > > > > >> >>> client
>> >> > >> > > > > > > >> >>>>>> side.
>> >> > >> > > > > > > >> >>>>>>> Specifically,
>> >> > >> > > > > > > >> >>>>>>> we deploy the job in executor when env.execute
>> >> > >> called.
>> >> > >> > > > This
>> >> > >> > > > > > > >> >>>>> abstraction
>> >> > >> > > > > > > >> >>>>>>> possibly prevents
>> >> > >> > > > > > > >> >>>>>>> Flink runs user program on the cluster side.
>> >> > >> > > > > > > >> >>>>>>>
>> >> > >> > > > > > > >> >>>>>>> For your proposal, in this case we already
>> >> > >> compiled the
>> >> > >> > > > > > program
>> >> > >> > > > > > > >> and
>> >> > >> > > > > > > >> >>>>> run
>> >> > >> > > > > > > >> >>>>>> on
>> >> > >> > > > > > > >> >>>>>>> the client side,
>> >> > >> > > > > > > >> >>>>>>> even we deploy a cluster and retrieve job graph
>> >> > >> from
>> >> > >> > > > program
>> >> > >> > > > > > > >> >>>>> metadata, it
>> >> > >> > > > > > > >> >>>>>>> doesn't make
>> >> > >> > > > > > > >> >>>>>>> many sense.
>> >> > >> > > > > > > >> >>>>>>>
>> >> > >> > > > > > > >> >>>>>>> cc Aljoscha & Kostas what do you think about this
>> >> > >> > > > > constraint?
>> >> > >> > > > > > > >> >>>>>>>
>> >> > >> > > > > > > >> >>>>>>> Best,
>> >> > >> > > > > > > >> >>>>>>> tison.
>> >> > >> > > > > > > >> >>>>>>>
>> >> > >> > > > > > > >> >>>>>>>
>> >> > >> > > > > > > >> >>>>>>> Peter Huang <[hidden email]>
>> >> > >> 于2019年12月10日周二
>> >> > >> > > > > > > >> 下午12:45写道:
>> >> > >> > > > > > > >> >>>>>>>
>> >> > >> > > > > > > >> >>>>>>>> Hi Tison,
>> >> > >> > > > > > > >> >>>>>>>>
>> >> > >> > > > > > > >> >>>>>>>> Yes, you are right. I think I made the wrong
>> >> > >> argument
>> >> > >> > > in
>> >> > >> > > > > the
>> >> > >> > > > > > > doc.
>> >> > >> > > > > > > >> >>>>>>>> Basically, the packaging jar problem is only for
>> >> > >> > > platform
>> >> > >> > > > > > > users.
>> >> > >> > > > > > > >> >>> In
>> >> > >> > > > > > > >> >>>>> our
>> >> > >> > > > > > > >> >>>>>>>> internal deploy service,
>> >> > >> > > > > > > >> >>>>>>>> we further optimized the deployment latency by
>> >> > >> letting
>> >> > >> > > > > users
>> >> > >> > > > > > to
>> >> > >> > > > > > > >> >>>>>> packaging
>> >> > >> > > > > > > >> >>>>>>>> flink-runtime together with the uber jar, so that
>> >> > >> we
>> >> > >> > > > don't
>> >> > >> > > > > > need
>> >> > >> > > > > > > >> to
>> >> > >> > > > > > > >> >>>>>>>> consider
>> >> > >> > > > > > > >> >>>>>>>> multiple flink version
>> >> > >> > > > > > > >> >>>>>>>> support for now. In the session client mode, as
>> >> > >> Flink
>> >> > >> > > > libs
>> >> > >> > > > > > will
>> >> > >> > > > > > > >> be
>> >> > >> > > > > > > >> >>>>>> shipped
>> >> > >> > > > > > > >> >>>>>>>> anyway as local resources of yarn. Users actually
>> >> > >> don't
>> >> > >> > > > > need
>> >> > >> > > > > > to
>> >> > >> > > > > > > >> >>>>> package
>> >> > >> > > > > > > >> >>>>>>>> those libs into job jar.
>> >> > >> > > > > > > >> >>>>>>>>
>> >> > >> > > > > > > >> >>>>>>>>
>> >> > >> > > > > > > >> >>>>>>>>
>> >> > >> > > > > > > >> >>>>>>>> Best Regards
>> >> > >> > > > > > > >> >>>>>>>> Peter Huang
>> >> > >> > > > > > > >> >>>>>>>>
>> >> > >> > > > > > > >> >>>>>>>> On Mon, Dec 9, 2019 at 8:35 PM tison <
>> >> > >> > > > [hidden email]
>> >> > >> > > > > >
>> >> > >> > > > > > > >> >>> wrote:
>> >> > >> > > > > > > >> >>>>>>>>
>> >> > >> > > > > > > >> >>>>>>>>>> 3. What do you mean about the package? Do users
>> >> > >> need
>> >> > >> > > to
>> >> > >> > > > > > > >> >>> compile
>> >> > >> > > > > > > >> >>>>>> their
>> >> > >> > > > > > > >> >>>>>>>>> jars
>> >> > >> > > > > > > >> >>>>>>>>> inlcuding flink-clients, flink-optimizer,
>> >> > >> flink-table
>> >> > >> > > > > codes?
>> >> > >> > > > > > > >> >>>>>>>>>
>> >> > >> > > > > > > >> >>>>>>>>> The answer should be no because they exist in
>> >> > >> system
>> >> > >> > > > > > > classpath.
>> >> > >> > > > > > > >> >>>>>>>>>
>> >> > >> > > > > > > >> >>>>>>>>> Best,
>> >> > >> > > > > > > >> >>>>>>>>> tison.
>> >> > >> > > > > > > >> >>>>>>>>>
>> >> > >> > > > > > > >> >>>>>>>>>
>> >> > >> > > > > > > >> >>>>>>>>> Yang Wang <[hidden email]> 于2019年12月10日周二
>> >> > >> > > > > 下午12:18写道:
>> >> > >> > > > > > > >> >>>>>>>>>
>> >> > >> > > > > > > >> >>>>>>>>>> Hi Peter,
>> >> > >> > > > > > > >> >>>>>>>>>>
>> >> > >> > > > > > > >> >>>>>>>>>> Thanks a lot for starting this discussion. I
>> >> > >> think
>> >> > >> > > this
>> >> > >> > > > > is
>> >> > >> > > > > > a
>> >> > >> > > > > > > >> >>> very
>> >> > >> > > > > > > >> >>>>>>>> useful
>> >> > >> > > > > > > >> >>>>>>>>>> feature.
>> >> > >> > > > > > > >> >>>>>>>>>>
>> >> > >> > > > > > > >> >>>>>>>>>> Not only for Yarn, i am focused on flink on
>> >> > >> > > Kubernetes
>> >> > >> > > > > > > >> >>>>> integration
>> >> > >> > > > > > > >> >>>>>> and
>> >> > >> > > > > > > >> >>>>>>>>> come
>> >> > >> > > > > > > >> >>>>>>>>>> across the same
>> >> > >> > > > > > > >> >>>>>>>>>> problem. I do not want the job graph generated
>> >> > >> on
>> >> > >> > > > client
>> >> > >> > > > > > > side.
>> >> > >> > > > > > > >> >>>>>>>> Instead,
>> >> > >> > > > > > > >> >>>>>>>>> the
>> >> > >> > > > > > > >> >>>>>>>>>> user jars are built in
>> >> > >> > > > > > > >> >>>>>>>>>> a user-defined image. When the job manager
>> >> > >> launched,
>> >> > >> > > we
>> >> > >> > > > > > just
>> >> > >> > > > > > > >> >>>>> need to
>> >> > >> > > > > > > >> >>>>>>>>>> generate the job graph
>> >> > >> > > > > > > >> >>>>>>>>>> based on local user jars.
>> >> > >> > > > > > > >> >>>>>>>>>>
>> >> > >> > > > > > > >> >>>>>>>>>> I have some small suggestion about this.
>> >> > >> > > > > > > >> >>>>>>>>>>
>> >> > >> > > > > > > >> >>>>>>>>>> 1. `ProgramJobGraphRetriever` is very similar to
>> >> > >> > > > > > > >> >>>>>>>>>> `ClasspathJobGraphRetriever`, the differences
>> >> > >> > > > > > > >> >>>>>>>>>> are the former needs `ProgramMetadata` and the
>> >> > >> latter
>> >> > >> > > > > needs
>> >> > >> > > > > > > >> >>> some
>> >> > >> > > > > > > >> >>>>>>>>> arguments.
>> >> > >> > > > > > > >> >>>>>>>>>> Is it possible to
>> >> > >> > > > > > > >> >>>>>>>>>> have an unified `JobGraphRetriever` to support
>> >> > >> both?
>> >> > >> > > > > > > >> >>>>>>>>>> 2. Is it possible to not use a local user jar to
>> >> > >> > > start
>> >> > >> > > > a
>> >> > >> > > > > > > >> >>> per-job
>> >> > >> > > > > > > >> >>>>>>>> cluster?
>> >> > >> > > > > > > >> >>>>>>>>>> In your case, the user jars has
>> >> > >> > > > > > > >> >>>>>>>>>> existed on hdfs already and we do need to
>> >> > >> download
>> >> > >> > > the
>> >> > >> > > > > jars
>> >> > >> > > > > > > to
>> >> > >> > > > > > > >> >>>>>>>> deployer
>> >> > >> > > > > > > >> >>>>>>>>>> service. Currently, we
>> >> > >> > > > > > > >> >>>>>>>>>> always need a local user jar to start a flink
>> >> > >> > > cluster.
>> >> > >> > > > It
>> >> > >> > > > > > is
>> >> > >> > > > > > > >> >>> be
>> >> > >> > > > > > > >> >>>>>> great
>> >> > >> > > > > > > >> >>>>>>>> if
>> >> > >> > > > > > > >> >>>>>>>>> we
>> >> > >> > > > > > > >> >>>>>>>>>> could support remote user jars.
>> >> > >> > > > > > > >> >>>>>>>>>>>> In the implementation, we assume users package
>> >> > >> > > > > > > >> >>> flink-clients,
>> >> > >> > > > > > > >> >>>>>>>>>> flink-optimizer, flink-table together within
>> >> > >> the job
>> >> > >> > > > jar.
>> >> > >> > > > > > > >> >>>>> Otherwise,
>> >> > >> > > > > > > >> >>>>>>>> the
>> >> > >> > > > > > > >> >>>>>>>>>> job graph generation within
>> >> > >> JobClusterEntryPoint will
>> >> > >> > > > > fail.
>> >> > >> > > > > > > >> >>>>>>>>>> 3. What do you mean about the package? Do users
>> >> > >> need
>> >> > >> > > to
>> >> > >> > > > > > > >> >>> compile
>> >> > >> > > > > > > >> >>>>>> their
>> >> > >> > > > > > > >> >>>>>>>>> jars
>> >> > >> > > > > > > >> >>>>>>>>>> inlcuding flink-clients, flink-optimizer,
>> >> > >> flink-table
>> >> > >> > > > > > codes?
>> >> > >> > > > > > > >> >>>>>>>>>>
>> >> > >> > > > > > > >> >>>>>>>>>>
>> >> > >> > > > > > > >> >>>>>>>>>>
>> >> > >> > > > > > > >> >>>>>>>>>> Best,
>> >> > >> > > > > > > >> >>>>>>>>>> Yang
>> >> > >> > > > > > > >> >>>>>>>>>>
>> >> > >> > > > > > > >> >>>>>>>>>> Peter Huang <[hidden email]>
>> >> > >> > > > 于2019年12月10日周二
>> >> > >> > > > > > > >> >>>>> 上午2:37写道:
>> >> > >> > > > > > > >> >>>>>>>>>>
>> >> > >> > > > > > > >> >>>>>>>>>>> Dear All,
>> >> > >> > > > > > > >> >>>>>>>>>>>
>> >> > >> > > > > > > >> >>>>>>>>>>> Recently, the Flink community starts to
>> >> > >> improve the
>> >> > >> > > > yarn
>> >> > >> > > > > > > >> >>>>> cluster
>> >> > >> > > > > > > >> >>>>>>>>>> descriptor
>> >> > >> > > > > > > >> >>>>>>>>>>> to make job jar and config files configurable
>> >> > >> from
>> >> > >> > > > CLI.
>> >> > >> > > > > It
>> >> > >> > > > > > > >> >>>>>> improves
>> >> > >> > > > > > > >> >>>>>>>> the
>> >> > >> > > > > > > >> >>>>>>>>>>> flexibility of  Flink deployment Yarn Per Job
>> >> > >> Mode.
>> >> > >> > > > For
>> >> > >> > > > > > > >> >>>>> platform
>> >> > >> > > > > > > >> >>>>>>>> users
>> >> > >> > > > > > > >> >>>>>>>>>> who
>> >> > >> > > > > > > >> >>>>>>>>>>> manage tens of hundreds of streaming pipelines
>> >> > >> for
>> >> > >> > > the
>> >> > >> > > > > > whole
>> >> > >> > > > > > > >> >>>>> org
>> >> > >> > > > > > > >> >>>>>> or
>> >> > >> > > > > > > >> >>>>>>>>>>> company, we found the job graph generation in
>> >> > >> > > > > client-side
>> >> > >> > > > > > is
>> >> > >> > > > > > > >> >>>>>> another
>> >> > >> > > > > > > >> >>>>>>>>>>> pinpoint. Thus, we want to propose a
>> >> > >> configurable
>> >> > >> > > > > feature
>> >> > >> > > > > > > >> >>> for
>> >> > >> > > > > > > >> >>>>>>>>>>> FlinkYarnSessionCli. The feature can allow
>> >> > >> users to
>> >> > >> > > > > choose
>> >> > >> > > > > > > >> >>> the
>> >> > >> > > > > > > >> >>>>> job
>> >> > >> > > > > > > >> >>>>>>>>> graph
>> >> > >> > > > > > > >> >>>>>>>>>>> generation in Flink ClusterEntryPoint so that
>> >> > >> the
>> >> > >> > > job
>> >> > >> > > > > jar
>> >> > >> > > > > > > >> >>>>> doesn't
>> >> > >> > > > > > > >> >>>>>>>> need
>> >> > >> > > > > > > >> >>>>>>>>> to
>> >> > >> > > > > > > >> >>>>>>>>>>> be locally for the job graph generation. The
>> >> > >> > > proposal
>> >> > >> > > > is
>> >> > >> > > > > > > >> >>>>> organized
>> >> > >> > > > > > > >> >>>>>>>> as a
>> >> > >> > > > > > > >> >>>>>>>>>>> FLIP
>> >> > >> > > > > > > >> >>>>>>>>>>>
>> >> > >> > > > > > > >> >>>>>>>>>>>
>> >> > >> > > > > > > >> >>>>>>>>>>
>> >> > >> > > > > > > >> >>>>>>>>>
>> >> > >> > > > > > > >> >>>>>>>>
>> >> > >> > > > > > > >> >>>>>>
>> >> > >> > > > > > > >> >>>>>
>> >> > >> > > > > > > >> >>>
>> >> > >> > > > > > > >>
>> >> > >> > > > > > >
>> >> > >> > > > > >
>> >> > >> > > > >
>> >> > >> > > >
>> >> > >> > >
>> >> > >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-85+Delayed+JobGraph+Generation
>> >> > >> > > > > > > >> >>>>>>>>>>> .
>> >> > >> > > > > > > >> >>>>>>>>>>>
>> >> > >> > > > > > > >> >>>>>>>>>>> Any questions and suggestions are welcomed.
>> >> > >> Thank
>> >> > >> > > you
>> >> > >> > > > in
>> >> > >> > > > > > > >> >>>>> advance.
>> >> > >> > > > > > > >> >>>>>>>>>>>
>> >> > >> > > > > > > >> >>>>>>>>>>>
>> >> > >> > > > > > > >> >>>>>>>>>>> Best Regards
>> >> > >> > > > > > > >> >>>>>>>>>>> Peter Huang
>> >> > >> > > > > > > >> >>>>>>>>>>>
>> >> > >> > > > > > > >> >>>>>>>>>>
>> >> > >> > > > > > > >> >>>>>>>>>
>> >> > >> > > > > > > >> >>>>>>>>
>> >> > >> > > > > > > >> >>>>>>>
>> >> > >> > > > > > > >> >>>>>>
>> >> > >> > > > > > > >> >>>>>
>> >> > >> > > > > > > >> >>>>
>> >> > >> > > > > > > >> >>>
>> >> > >> > > > > > > >> >>
>> >> > >> > > > > > > >>
>> >> > >> > > > > > > >>
>> >> > >> > > > > > >
>> >> > >> > > > > >
>> >> > >> > > > >
>> >> > >> > > >
>> >> > >> > >
>> >> > >>
>> >> > >
>> >>

Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] FLIP-85: Delayed Job Graph Generation

Yang Wang
Hi Peter,

Having the application mode does not mean we will drop the cluster-deploy
option. I just want to share some thoughts about “Application Mode”.


1. The application mode could cover the per-job sematic. Its lifecyle is
bound
to the user `main()`. And all the jobs in the user main will be executed in
a same
Flink cluster. In first phase of FLIP-85 implementation, running user main
on the
cluster side could be supported in application mode.

2. Maybe in the future, we also need to support multiple `execute()` on
client side
in a same Flink cluster. Then the per-job mode will evolve to application
mode.

3. From user perspective, only a `-R/-- remote-deploy` cli option is
visible. They
are not aware of the application mode.

4. In the first phase, the application mode is working as “per-job”(only
one job in
the user main). We just leave more potential for the future.


I am not against with calling it “cluster deploy mode” if you all think it
is clearer for users.



Best,
Yang

Kostas Kloudas <[hidden email]> 于2020年3月3日周二 下午6:49写道:

> Hi Peter,
>
> I understand your point. This is why I was also a bit torn about the
> name and my proposal was a bit aligned with yours (something along the
> lines of "cluster deploy" mode).
>
> But many of the other participants in the discussion suggested the
> "Application Mode". I think that the reasoning is that now the user's
> Application is more self-contained.
> It will be submitted to the cluster and the user can just disconnect.
> In addition, as discussed briefly in the doc, in the future there may
> be better support for multi-execute applications which will bring us
> one step closer to the true "Application Mode". But this is how I
> interpreted their arguments, of course they can also express their
> thoughts on the topic :)
>
> Cheers,
> Kostas
>
> On Mon, Mar 2, 2020 at 6:15 PM Peter Huang <[hidden email]>
> wrote:
> >
> > Hi Kostas,
> >
> > Thanks for updating the wiki. We have aligned with the implementations
> in the doc. But I feel it is still a little bit confusing of the naming
> from a user's perspective. It is well known that Flink support per job
> cluster and session cluster. The concept is in the layer of how a job is
> managed within Flink. The method introduced util now is a kind of mixing
> job and session cluster to promising the implementation complexity. We
> probably don't need to label it as Application Model as the same layer of
> per job cluster and session cluster. Conceptually, I think it is still a
> cluster mode implementation for per job cluster.
> >
> > To minimize the confusion of users, I think it would be better just an
> option of per job cluster for each type of cluster manager. How do you
> think?
> >
> >
> > Best Regards
> > Peter Huang
> >
> >
> >
> >
> >
> >
> >
> >
> > On Mon, Mar 2, 2020 at 7:22 AM Kostas Kloudas <[hidden email]>
> wrote:
> >>
> >> Hi Yang,
> >>
> >> The difference between per-job and application mode is that, as you
> >> described, in the per-job mode the main is executed on the client
> >> while in the application mode, the main is executed on the cluster.
> >> I do not think we have to offer "application mode" with running the
> >> main on the client side as this is exactly what the per-job mode does
> >> currently and, as you described also, it would be redundant.
> >>
> >> Sorry if this was not clear in the document.
> >>
> >> Cheers,
> >> Kostas
> >>
> >> On Mon, Mar 2, 2020 at 3:17 PM Yang Wang <[hidden email]> wrote:
> >> >
> >> > Hi Kostas,
> >> >
> >> > Thanks a lot for your conclusion and updating the FLIP-85 WIKI.
> Currently, i have no more
> >> > questions about motivation, approach, fault tolerance and the first
> phase implementation.
> >> >
> >> > I think the new title "Flink Application Mode" makes a lot senses to
> me. Especially for the
> >> > containerized environment, the cluster deploy option will be very
> useful.
> >> >
> >> > Just one concern, how do we introduce this new application mode to
> our users?
> >> > Each user program(i.e. `main()`) is an application. Currently, we
> intend to only support one
> >> > `execute()`. So what's the difference between per-job and application
> mode?
> >> >
> >> > For per-job, user `main()` is always executed on client side. And For
> application mode, user
> >> > `main()` could be executed on client or master side(configured via
> cli option).
> >> > Right? We need to have a clear concept. Otherwise, the users will be
> more and more confusing.
> >> >
> >> >
> >> > Best,
> >> > Yang
> >> >
> >> > Kostas Kloudas <[hidden email]> 于2020年3月2日周一 下午5:58写道:
> >> >>
> >> >> Hi all,
> >> >>
> >> >> I update
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-85+Flink+Application+Mode
> >> >> based on the discussion we had here:
> >> >>
> >> >>
> https://docs.google.com/document/d/1ji72s3FD9DYUyGuKnJoO4ApzV-nSsZa0-bceGXW7Ocw/edit#
> >> >>
> >> >> Please let me know what you think and please keep the discussion in
> the ML :)
> >> >>
> >> >> Thanks for starting the discussion and I hope that soon we will be
> >> >> able to vote on the FLIP.
> >> >>
> >> >> Cheers,
> >> >> Kostas
> >> >>
> >> >> On Thu, Jan 16, 2020 at 3:40 AM Yang Wang <[hidden email]>
> wrote:
> >> >> >
> >> >> > Hi all,
> >> >> >
> >> >> > Thanks a lot for the feedback from @Kostas Kloudas. Your all
> concerns are
> >> >> > on point. The FLIP-85 is mainly
> >> >> > focused on supporting cluster mode for per-job. Since it is more
> urgent and
> >> >> > have much more use
> >> >> > cases both in Yarn and Kubernetes deployment. For session cluster,
> we could
> >> >> > have more discussion
> >> >> > in a new thread later.
> >> >> >
> >> >> > #1, How to download the user jars and dependencies for per-job in
> cluster
> >> >> > mode?
> >> >> > For Yarn, we could register the user jars and dependencies as
> >> >> > LocalResource. They will be distributed
> >> >> > by Yarn. And once the JobManager and TaskManager launched, the
> jars are
> >> >> > already exists.
> >> >> > For Standalone per-job and K8s, we expect that the user jars
> >> >> > and dependencies are built into the image.
> >> >> > Or the InitContainer could be used for downloading. It is natively
> >> >> > distributed and we will not have bottleneck.
> >> >> >
> >> >> > #2, Job graph recovery
> >> >> > We could have an optimization to store job graph on the DFS.
> However, i
> >> >> > suggest building a new jobgraph
> >> >> > from the configuration is the default option. Since we will not
> always have
> >> >> > a DFS store when deploying a
> >> >> > Flink per-job cluster. Of course, we assume that using the same
> >> >> > configuration(e.g. job_id, user_jar, main_class,
> >> >> > main_args, parallelism, savepoint_settings, etc.) will get a same
> job
> >> >> > graph. I think the standalone per-job
> >> >> > already has the similar behavior.
> >> >> >
> >> >> > #3, What happens with jobs that have multiple execute calls?
> >> >> > Currently, it is really a problem. Even we use a local client on
> Flink
> >> >> > master side, it will have different behavior with
> >> >> > client mode. For client mode, if we execute multiple times, then
> we will
> >> >> > deploy multiple Flink clusters for each execute.
> >> >> > I am not pretty sure whether it is reasonable. However, i still
> think using
> >> >> > the local client is a good choice. We could
> >> >> > continue the discussion in a new thread. @Zili Chen <
> [hidden email]> Do
> >> >> > you want to drive this?
> >> >> >
> >> >> >
> >> >> >
> >> >> > Best,
> >> >> > Yang
> >> >> >
> >> >> > Peter Huang <[hidden email]> 于2020年1月16日周四 上午1:55写道:
> >> >> >
> >> >> > > Hi Kostas,
> >> >> > >
> >> >> > > Thanks for this feedback. I can't agree more about the opinion.
> The
> >> >> > > cluster mode should be added
> >> >> > > first in per job cluster.
> >> >> > >
> >> >> > > 1) For job cluster implementation
> >> >> > > 1. Job graph recovery from configuration or store as static job
> graph as
> >> >> > > session cluster. I think the static one will be better for less
> recovery
> >> >> > > time.
> >> >> > > Let me update the doc for details.
> >> >> > >
> >> >> > > 2. For job execute multiple times, I think @Zili Chen
> >> >> > > <[hidden email]> has proposed the local client solution
> that can
> >> >> > > the run program actually in the cluster entry point. We can put
> the
> >> >> > > implementation in the second stage,
> >> >> > > or even a new FLIP for further discussion.
> >> >> > >
> >> >> > > 2) For session cluster implementation
> >> >> > > We can disable the cluster mode for the session cluster in the
> first
> >> >> > > stage. I agree the jar downloading will be a painful thing.
> >> >> > > We can consider about PoC and performance evaluation first. If
> the end to
> >> >> > > end experience is good enough, then we can consider
> >> >> > > proceeding with the solution.
> >> >> > >
> >> >> > > Looking forward to more opinions from @Yang Wang <
> [hidden email]> @Zili
> >> >> > > Chen <[hidden email]> @Dian Fu <[hidden email]>.
> >> >> > >
> >> >> > >
> >> >> > > Best Regards
> >> >> > > Peter Huang
> >> >> > >
> >> >> > > On Wed, Jan 15, 2020 at 7:50 AM Kostas Kloudas <
> [hidden email]> wrote:
> >> >> > >
> >> >> > >> Hi all,
> >> >> > >>
> >> >> > >> I am writing here as the discussion on the Google Doc seems to
> be a
> >> >> > >> bit difficult to follow.
> >> >> > >>
> >> >> > >> I think that in order to be able to make progress, it would be
> helpful
> >> >> > >> to focus on per-job mode for now.
> >> >> > >> The reason is that:
> >> >> > >>  1) making the (unique) JobSubmitHandler responsible for
> creating the
> >> >> > >> jobgraphs,
> >> >> > >>   which includes downloading dependencies, is not an optimal
> solution
> >> >> > >>  2) even if we put the responsibility on the JobMaster,
> currently each
> >> >> > >> job has its own
> >> >> > >>   JobMaster but they all run on the same process, so we have
> again a
> >> >> > >> single entity.
> >> >> > >>
> >> >> > >> Of course after this is done, and if we feel comfortable with
> the
> >> >> > >> solution, then we can go to the session mode.
> >> >> > >>
> >> >> > >> A second comment has to do with fault-tolerance in the per-job,
> >> >> > >> cluster-deploy mode.
> >> >> > >> In the document, it is suggested that upon recovery, the
> JobMaster of
> >> >> > >> each job re-creates the JobGraph.
> >> >> > >> I am just wondering if it is better to create and store the
> jobGraph
> >> >> > >> upon submission and only fetch it
> >> >> > >> upon recovery so that we have a static jobGraph.
> >> >> > >>
> >> >> > >> Finally, I have a question which is what happens with jobs that
> have
> >> >> > >> multiple execute calls?
> >> >> > >> The semantics seem to change compared to the current behaviour,
> right?
> >> >> > >>
> >> >> > >> Cheers,
> >> >> > >> Kostas
> >> >> > >>
> >> >> > >> On Wed, Jan 8, 2020 at 8:05 PM tison <[hidden email]>
> wrote:
> >> >> > >> >
> >> >> > >> > not always, Yang Wang is also not yet a committer but he can
> join the
> >> >> > >> > channel. I cannot find the id by clicking “Add new member in
> channel” so
> >> >> > >> > come to you and ask for try out the link. Possibly I will
> find other
> >> >> > >> ways
> >> >> > >> > but the original purpose is that the slack channel is a
> public area we
> >> >> > >> > discuss about developing...
> >> >> > >> > Best,
> >> >> > >> > tison.
> >> >> > >> >
> >> >> > >> >
> >> >> > >> > Peter Huang <[hidden email]> 于2020年1月9日周四
> 上午2:44写道:
> >> >> > >> >
> >> >> > >> > > Hi Tison,
> >> >> > >> > >
> >> >> > >> > > I am not the committer of Flink yet. I think I can't join
> it also.
> >> >> > >> > >
> >> >> > >> > >
> >> >> > >> > > Best Regards
> >> >> > >> > > Peter Huang
> >> >> > >> > >
> >> >> > >> > > On Wed, Jan 8, 2020 at 9:39 AM tison <[hidden email]>
> wrote:
> >> >> > >> > >
> >> >> > >> > > > Hi Peter,
> >> >> > >> > > >
> >> >> > >> > > > Could you try out this link?
> >> >> > >> > > https://the-asf.slack.com/messages/CNA3ADZPH
> >> >> > >> > > >
> >> >> > >> > > > Best,
> >> >> > >> > > > tison.
> >> >> > >> > > >
> >> >> > >> > > >
> >> >> > >> > > > Peter Huang <[hidden email]> 于2020年1月9日周四
> 上午1:22写道:
> >> >> > >> > > >
> >> >> > >> > > > > Hi Tison,
> >> >> > >> > > > >
> >> >> > >> > > > > I can't join the group with shared link. Would you
> please add me
> >> >> > >> into
> >> >> > >> > > the
> >> >> > >> > > > > group? My slack account is huangzhenqiu0825.
> >> >> > >> > > > > Thank you in advance.
> >> >> > >> > > > >
> >> >> > >> > > > >
> >> >> > >> > > > > Best Regards
> >> >> > >> > > > > Peter Huang
> >> >> > >> > > > >
> >> >> > >> > > > > On Wed, Jan 8, 2020 at 12:02 AM tison <
> [hidden email]>
> >> >> > >> wrote:
> >> >> > >> > > > >
> >> >> > >> > > > > > Hi Peter,
> >> >> > >> > > > > >
> >> >> > >> > > > > > As described above, this effort should get attention
> from people
> >> >> > >> > > > > developing
> >> >> > >> > > > > > FLIP-73 a.k.a. Executor abstractions. I recommend you
> to join
> >> >> > >> the
> >> >> > >> > > > public
> >> >> > >> > > > > > slack channel[1] for Flink Client API Enhancement and
> you can
> >> >> > >> try to
> >> >> > >> > > > > share
> >> >> > >> > > > > > you detailed thoughts there. It possibly gets more
> concrete
> >> >> > >> > > attentions.
> >> >> > >> > > > > >
> >> >> > >> > > > > > Best,
> >> >> > >> > > > > > tison.
> >> >> > >> > > > > >
> >> >> > >> > > > > > [1]
> >> >> > >> > > > > >
> >> >> > >> > > > > >
> >> >> > >> > > > >
> >> >> > >> > > >
> >> >> > >> > >
> >> >> > >>
> https://slack.com/share/IS21SJ75H/Rk8HhUly9FuEHb7oGwBZ33uL/enQtODg2MDYwNjE5MTg3LTA2MjIzNDc1M2ZjZDVlMjdlZjk1M2RkYmJhNjAwMTk2ZDZkODQ4NmY5YmI4OGRhNWJkYTViMTM1NzlmMzc4OWM
> >> >> > >> > > > > >
> >> >> > >> > > > > >
> >> >> > >> > > > > > Peter Huang <[hidden email]>
> 于2020年1月7日周二 上午5:09写道:
> >> >> > >> > > > > >
> >> >> > >> > > > > > > Dear All,
> >> >> > >> > > > > > >
> >> >> > >> > > > > > > Happy new year! According to existing feedback from
> the
> >> >> > >> community,
> >> >> > >> > > we
> >> >> > >> > > > > > > revised the doc with the consideration of session
> cluster
> >> >> > >> support,
> >> >> > >> > > > and
> >> >> > >> > > > > > > concrete interface changes needed and execution
> plan. Please
> >> >> > >> take
> >> >> > >> > > one
> >> >> > >> > > > > > more
> >> >> > >> > > > > > > round of review at your most convenient time.
> >> >> > >> > > > > > >
> >> >> > >> > > > > > >
> >> >> > >> > > > > > >
> >> >> > >> > > > > >
> >> >> > >> > > > >
> >> >> > >> > > >
> >> >> > >> > >
> >> >> > >>
> https://docs.google.com/document/d/1aAwVjdZByA-0CHbgv16Me-vjaaDMCfhX7TzVVTuifYM/edit#
> >> >> > >> > > > > > >
> >> >> > >> > > > > > >
> >> >> > >> > > > > > > Best Regards
> >> >> > >> > > > > > > Peter Huang
> >> >> > >> > > > > > >
> >> >> > >> > > > > > >
> >> >> > >> > > > > > >
> >> >> > >> > > > > > >
> >> >> > >> > > > > > >
> >> >> > >> > > > > > > On Thu, Jan 2, 2020 at 11:29 AM Peter Huang <
> >> >> > >> > > > > [hidden email]>
> >> >> > >> > > > > > > wrote:
> >> >> > >> > > > > > >
> >> >> > >> > > > > > > > Hi Dian,
> >> >> > >> > > > > > > > Thanks for giving us valuable feedbacks.
> >> >> > >> > > > > > > >
> >> >> > >> > > > > > > > 1) It's better to have a whole design for this
> feature
> >> >> > >> > > > > > > > For the suggestion of enabling the cluster mode
> also session
> >> >> > >> > > > > cluster, I
> >> >> > >> > > > > > > > think Flink already supported it.
> WebSubmissionExtension
> >> >> > >> already
> >> >> > >> > > > > allows
> >> >> > >> > > > > > > > users to start a job with the specified jar by
> using web UI.
> >> >> > >> > > > > > > > But we need to enable the feature from CLI for
> both local
> >> >> > >> jar,
> >> >> > >> > > > remote
> >> >> > >> > > > > > > jar.
> >> >> > >> > > > > > > > I will align with Yang Wang first about the
> details and
> >> >> > >> update
> >> >> > >> > > the
> >> >> > >> > > > > > design
> >> >> > >> > > > > > > > doc.
> >> >> > >> > > > > > > >
> >> >> > >> > > > > > > > 2) It's better to consider the convenience for
> users, such
> >> >> > >> as
> >> >> > >> > > > > debugging
> >> >> > >> > > > > > > >
> >> >> > >> > > > > > > > I am wondering whether we can store the exception
> in
> >> >> > >> jobgragh
> >> >> > >> > > > > > > > generation in application master. As no streaming
> graph can
> >> >> > >> be
> >> >> > >> > > > > > scheduled
> >> >> > >> > > > > > > in
> >> >> > >> > > > > > > > this case, there will be no more TM will be
> requested from
> >> >> > >> > > FlinkRM.
> >> >> > >> > > > > > > > If the AM is still running, users can still query
> it from
> >> >> > >> CLI. As
> >> >> > >> > > > it
> >> >> > >> > > > > > > > requires more change, we can get some feedback
> from <
> >> >> > >> > > > > > [hidden email]
> >> >> > >> > > > > > > >
> >> >> > >> > > > > > > > and @[hidden email] <[hidden email]>.
> >> >> > >> > > > > > > >
> >> >> > >> > > > > > > > 3) It's better to consider the impact to the
> stability of
> >> >> > >> the
> >> >> > >> > > > cluster
> >> >> > >> > > > > > > >
> >> >> > >> > > > > > > > I agree with Yang Wang's opinion.
> >> >> > >> > > > > > > >
> >> >> > >> > > > > > > >
> >> >> > >> > > > > > > >
> >> >> > >> > > > > > > > Best Regards
> >> >> > >> > > > > > > > Peter Huang
> >> >> > >> > > > > > > >
> >> >> > >> > > > > > > >
> >> >> > >> > > > > > > > On Sun, Dec 29, 2019 at 9:44 PM Dian Fu <
> >> >> > >> [hidden email]>
> >> >> > >> > > > > wrote:
> >> >> > >> > > > > > > >
> >> >> > >> > > > > > > >> Hi all,
> >> >> > >> > > > > > > >>
> >> >> > >> > > > > > > >> Sorry to jump into this discussion. Thanks
> everyone for the
> >> >> > >> > > > > > discussion.
> >> >> > >> > > > > > > >> I'm very interested in this topic although I'm
> not an
> >> >> > >> expert in
> >> >> > >> > > > this
> >> >> > >> > > > > > > part.
> >> >> > >> > > > > > > >> So I'm glad to share my thoughts as following:
> >> >> > >> > > > > > > >>
> >> >> > >> > > > > > > >> 1) It's better to have a whole design for this
> feature
> >> >> > >> > > > > > > >> As we know, there are two deployment modes:
> per-job mode
> >> >> > >> and
> >> >> > >> > > > session
> >> >> > >> > > > > > > >> mode. I'm wondering which mode really needs this
> feature.
> >> >> > >> As the
> >> >> > >> > > > > > design
> >> >> > >> > > > > > > doc
> >> >> > >> > > > > > > >> mentioned, per-job mode is more used for
> streaming jobs and
> >> >> > >> > > > session
> >> >> > >> > > > > > > mode is
> >> >> > >> > > > > > > >> usually used for batch jobs(Of course, the job
> types and
> >> >> > >> the
> >> >> > >> > > > > > deployment
> >> >> > >> > > > > > > >> modes are orthogonal). Usually streaming job is
> only
> >> >> > >> needed to
> >> >> > >> > > be
> >> >> > >> > > > > > > submitted
> >> >> > >> > > > > > > >> once and it will run for days or weeks, while
> batch jobs
> >> >> > >> will be
> >> >> > >> > > > > > > submitted
> >> >> > >> > > > > > > >> more frequently compared with streaming jobs.
> This means
> >> >> > >> that
> >> >> > >> > > > maybe
> >> >> > >> > > > > > > session
> >> >> > >> > > > > > > >> mode also needs this feature. However, if we
> support this
> >> >> > >> > > feature
> >> >> > >> > > > in
> >> >> > >> > > > > > > >> session mode, the application master will become
> the new
> >> >> > >> > > > centralized
> >> >> > >> > > > > > > >> service(which should be solved). So in this
> case, it's
> >> >> > >> better to
> >> >> > >> > > > > have
> >> >> > >> > > > > > a
> >> >> > >> > > > > > > >> complete design for both per-job mode and
> session mode.
> >> >> > >> > > > Furthermore,
> >> >> > >> > > > > > > even
> >> >> > >> > > > > > > >> if we can do it phase by phase, we need to have
> a whole
> >> >> > >> picture
> >> >> > >> > > of
> >> >> > >> > > > > how
> >> >> > >> > > > > > > it
> >> >> > >> > > > > > > >> works in both per-job mode and session mode.
> >> >> > >> > > > > > > >>
> >> >> > >> > > > > > > >> 2) It's better to consider the convenience for
> users, such
> >> >> > >> as
> >> >> > >> > > > > > debugging
> >> >> > >> > > > > > > >> After we finish this feature, the job graph will
> be
> >> >> > >> compiled in
> >> >> > >> > > > the
> >> >> > >> > > > > > > >> application master, which means that users
> cannot easily
> >> >> > >> get the
> >> >> > >> > > > > > > exception
> >> >> > >> > > > > > > >> message synchorousely in the job client if there
> are
> >> >> > >> problems
> >> >> > >> > > > during
> >> >> > >> > > > > > the
> >> >> > >> > > > > > > >> job graph compiling (especially for platform
> users), such
> >> >> > >> as the
> >> >> > >> > > > > > > resource
> >> >> > >> > > > > > > >> path is incorrect, the user program itself has
> some
> >> >> > >> problems,
> >> >> > >> > > etc.
> >> >> > >> > > > > > What
> >> >> > >> > > > > > > I'm
> >> >> > >> > > > > > > >> thinking is that maybe we should throw the
> exceptions as
> >> >> > >> early
> >> >> > >> > > as
> >> >> > >> > > > > > > possible
> >> >> > >> > > > > > > >> (during job submission stage).
> >> >> > >> > > > > > > >>
> >> >> > >> > > > > > > >> 3) It's better to consider the impact to the
> stability of
> >> >> > >> the
> >> >> > >> > > > > cluster
> >> >> > >> > > > > > > >> If we perform the compiling in the application
> master, we
> >> >> > >> should
> >> >> > >> > > > > > > consider
> >> >> > >> > > > > > > >> the impact of the compiling errors. Although
> YARN could
> >> >> > >> resume
> >> >> > >> > > the
> >> >> > >> > > > > > > >> application master in case of failures, but in
> some case
> >> >> > >> the
> >> >> > >> > > > > compiling
> >> >> > >> > > > > > > >> failure may be a waste of cluster resource and
> may impact
> >> >> > >> the
> >> >> > >> > > > > > stability
> >> >> > >> > > > > > > the
> >> >> > >> > > > > > > >> cluster and the other jobs in the cluster, such
> as the
> >> >> > >> resource
> >> >> > >> > > > path
> >> >> > >> > > > > > is
> >> >> > >> > > > > > > >> incorrect, the user program itself has some
> problems(in
> >> >> > >> this
> >> >> > >> > > case,
> >> >> > >> > > > > job
> >> >> > >> > > > > > > >> failover cannot solve this kind of problems)
> etc. In the
> >> >> > >> current
> >> >> > >> > > > > > > >> implemention, the compiling errors are handled
> in the
> >> >> > >> client
> >> >> > >> > > side
> >> >> > >> > > > > and
> >> >> > >> > > > > > > there
> >> >> > >> > > > > > > >> is no impact to the cluster at all.
> >> >> > >> > > > > > > >>
> >> >> > >> > > > > > > >> Regarding to 1), it's clearly pointed in the
> design doc
> >> >> > >> that
> >> >> > >> > > only
> >> >> > >> > > > > > > per-job
> >> >> > >> > > > > > > >> mode will be supported. However, I think it's
> better to
> >> >> > >> also
> >> >> > >> > > > > consider
> >> >> > >> > > > > > > the
> >> >> > >> > > > > > > >> session mode in the design doc.
> >> >> > >> > > > > > > >> Regarding to 2) and 3), I have not seen related
> sections
> >> >> > >> in the
> >> >> > >> > > > > design
> >> >> > >> > > > > > > >> doc. It will be good if we can cover them in the
> design
> >> >> > >> doc.
> >> >> > >> > > > > > > >>
> >> >> > >> > > > > > > >> Feel free to correct me If there is anything I
> >> >> > >> misunderstand.
> >> >> > >> > > > > > > >>
> >> >> > >> > > > > > > >> Regards,
> >> >> > >> > > > > > > >> Dian
> >> >> > >> > > > > > > >>
> >> >> > >> > > > > > > >>
> >> >> > >> > > > > > > >> > 在 2019年12月27日,上午3:13,Peter Huang <
> >> >> > >> [hidden email]>
> >> >> > >> > > > 写道:
> >> >> > >> > > > > > > >> >
> >> >> > >> > > > > > > >> > Hi Yang,
> >> >> > >> > > > > > > >> >
> >> >> > >> > > > > > > >> > I can't agree more. The effort definitely
> needs to align
> >> >> > >> with
> >> >> > >> > > > the
> >> >> > >> > > > > > > final
> >> >> > >> > > > > > > >> > goal of FLIP-73.
> >> >> > >> > > > > > > >> > I am thinking about whether we can achieve the
> goal with
> >> >> > >> two
> >> >> > >> > > > > phases.
> >> >> > >> > > > > > > >> >
> >> >> > >> > > > > > > >> > 1) Phase I
> >> >> > >> > > > > > > >> > As the CLiFrontend will not be depreciated
> soon. We can
> >> >> > >> still
> >> >> > >> > > > use
> >> >> > >> > > > > > the
> >> >> > >> > > > > > > >> > deployMode flag there,
> >> >> > >> > > > > > > >> > pass the program info through Flink
> configuration,  use
> >> >> > >> the
> >> >> > >> > > > > > > >> > ClassPathJobGraphRetriever
> >> >> > >> > > > > > > >> > to generate the job graph in
> ClusterEntrypoints of yarn
> >> >> > >> and
> >> >> > >> > > > > > > Kubernetes.
> >> >> > >> > > > > > > >> >
> >> >> > >> > > > > > > >> > 2) Phase II
> >> >> > >> > > > > > > >> > In  AbstractJobClusterExecutor, the job graph
> is
> >> >> > >> generated in
> >> >> > >> > > > the
> >> >> > >> > > > > > > >> execute
> >> >> > >> > > > > > > >> > function. We can still
> >> >> > >> > > > > > > >> > use the deployMode in it. With deployMode =
> cluster, the
> >> >> > >> > > execute
> >> >> > >> > > > > > > >> function
> >> >> > >> > > > > > > >> > only starts the cluster.
> >> >> > >> > > > > > > >> >
> >> >> > >> > > > > > > >> > When {Yarn/Kuberneates}PerJobClusterEntrypoint
> starts,
> >> >> > >> It will
> >> >> > >> > > > > start
> >> >> > >> > > > > > > the
> >> >> > >> > > > > > > >> > dispatch first, then we can use
> >> >> > >> > > > > > > >> > a ClusterEnvironment similar to
> ContextEnvironment to
> >> >> > >> submit
> >> >> > >> > > the
> >> >> > >> > > > > job
> >> >> > >> > > > > > > >> with
> >> >> > >> > > > > > > >> > jobName the local
> >> >> > >> > > > > > > >> > dispatcher. For the details, we need more
> investigation.
> >> >> > >> Let's
> >> >> > >> > > > > wait
> >> >> > >> > > > > > > >> > for @Aljoscha
> >> >> > >> > > > > > > >> > Krettek <[hidden email]> @Till Rohrmann <
> >> >> > >> > > > > [hidden email]
> >> >> > >> > > > > > >'s
> >> >> > >> > > > > > > >> > feedback after the holiday season.
> >> >> > >> > > > > > > >> >
> >> >> > >> > > > > > > >> > Thank you in advance. Merry Chrismas and Happy
> New
> >> >> > >> Year!!!
> >> >> > >> > > > > > > >> >
> >> >> > >> > > > > > > >> >
> >> >> > >> > > > > > > >> > Best Regards
> >> >> > >> > > > > > > >> > Peter Huang
> >> >> > >> > > > > > > >> >
> >> >> > >> > > > > > > >> >
> >> >> > >> > > > > > > >> >
> >> >> > >> > > > > > > >> >
> >> >> > >> > > > > > > >> >
> >> >> > >> > > > > > > >> >
> >> >> > >> > > > > > > >> >
> >> >> > >> > > > > > > >> >
> >> >> > >> > > > > > > >> > On Wed, Dec 25, 2019 at 1:08 AM Yang Wang <
> >> >> > >> > > > [hidden email]>
> >> >> > >> > > > > > > >> wrote:
> >> >> > >> > > > > > > >> >
> >> >> > >> > > > > > > >> >> Hi Peter,
> >> >> > >> > > > > > > >> >>
> >> >> > >> > > > > > > >> >> I think we need to reconsider tison's
> suggestion
> >> >> > >> seriously.
> >> >> > >> > > > After
> >> >> > >> > > > > > > >> FLIP-73,
> >> >> > >> > > > > > > >> >> the deployJobCluster has
> >> >> > >> > > > > > > >> >> beenmoved into `JobClusterExecutor#execute`.
> It should
> >> >> > >> not be
> >> >> > >> > > > > > > perceived
> >> >> > >> > > > > > > >> >> for `CliFrontend`. That
> >> >> > >> > > > > > > >> >> means the user program will *ALWAYS* be
> executed on
> >> >> > >> client
> >> >> > >> > > > side.
> >> >> > >> > > > > > This
> >> >> > >> > > > > > > >> is
> >> >> > >> > > > > > > >> >> the by design behavior.
> >> >> > >> > > > > > > >> >> So, we could not just add `if(client mode) ..
> else
> >> >> > >> if(cluster
> >> >> > >> > > > > mode)
> >> >> > >> > > > > > > >> ...`
> >> >> > >> > > > > > > >> >> codes in `CliFrontend` to bypass
> >> >> > >> > > > > > > >> >> the executor. We need to find a clean way to
> decouple
> >> >> > >> > > executing
> >> >> > >> > > > > > user
> >> >> > >> > > > > > > >> >> program and deploying per-job
> >> >> > >> > > > > > > >> >> cluster. Based on this, we could support to
> execute user
> >> >> > >> > > > program
> >> >> > >> > > > > on
> >> >> > >> > > > > > > >> client
> >> >> > >> > > > > > > >> >> or master side.
> >> >> > >> > > > > > > >> >>
> >> >> > >> > > > > > > >> >> Maybe Aljoscha and Jeff could give some good
> >> >> > >> suggestions.
> >> >> > >> > > > > > > >> >>
> >> >> > >> > > > > > > >> >>
> >> >> > >> > > > > > > >> >>
> >> >> > >> > > > > > > >> >> Best,
> >> >> > >> > > > > > > >> >> Yang
> >> >> > >> > > > > > > >> >>
> >> >> > >> > > > > > > >> >> Peter Huang <[hidden email]>
> 于2019年12月25日周三
> >> >> > >> > > > > 上午4:03写道:
> >> >> > >> > > > > > > >> >>
> >> >> > >> > > > > > > >> >>> Hi Jingjing,
> >> >> > >> > > > > > > >> >>>
> >> >> > >> > > > > > > >> >>> The improvement proposed is a deployment
> option for
> >> >> > >> CLI. For
> >> >> > >> > > > SQL
> >> >> > >> > > > > > > based
> >> >> > >> > > > > > > >> >>> Flink application, It is more convenient to
> use the
> >> >> > >> existing
> >> >> > >> > > > > model
> >> >> > >> > > > > > > in
> >> >> > >> > > > > > > >> >>> SqlClient in which
> >> >> > >> > > > > > > >> >>> the job graph is generated within SqlClient.
> After
> >> >> > >> adding
> >> >> > >> > > the
> >> >> > >> > > > > > > delayed
> >> >> > >> > > > > > > >> job
> >> >> > >> > > > > > > >> >>> graph generation, I think there is no change
> is needed
> >> >> > >> for
> >> >> > >> > > > your
> >> >> > >> > > > > > > side.
> >> >> > >> > > > > > > >> >>>
> >> >> > >> > > > > > > >> >>>
> >> >> > >> > > > > > > >> >>> Best Regards
> >> >> > >> > > > > > > >> >>> Peter Huang
> >> >> > >> > > > > > > >> >>>
> >> >> > >> > > > > > > >> >>>
> >> >> > >> > > > > > > >> >>> On Wed, Dec 18, 2019 at 6:01 AM jingjing bai
> <
> >> >> > >> > > > > > > >> [hidden email]>
> >> >> > >> > > > > > > >> >>> wrote:
> >> >> > >> > > > > > > >> >>>
> >> >> > >> > > > > > > >> >>>> hi peter:
> >> >> > >> > > > > > > >> >>>>    we had extension SqlClent to support sql
> job
> >> >> > >> submit in
> >> >> > >> > > web
> >> >> > >> > > > > > base
> >> >> > >> > > > > > > on
> >> >> > >> > > > > > > >> >>>> flink 1.9.   we support submit to yarn on
> per job
> >> >> > >> mode too.
> >> >> > >> > > > > > > >> >>>>    in this case, the job graph generated
> on client
> >> >> > >> side
> >> >> > >> > > .  I
> >> >> > >> > > > > > think
> >> >> > >> > > > > > > >> >>> this
> >> >> > >> > > > > > > >> >>>> discuss Mainly to improve api programme.
> but in my
> >> >> > >> case ,
> >> >> > >> > > > > there
> >> >> > >> > > > > > is
> >> >> > >> > > > > > > >> no
> >> >> > >> > > > > > > >> >>>> jar to upload but only a sql string .
> >> >> > >> > > > > > > >> >>>>    do u had more suggestion to improve for
> sql mode
> >> >> > >> or it
> >> >> > >> > > is
> >> >> > >> > > > > > only a
> >> >> > >> > > > > > > >> >>>> switch for api programme?
> >> >> > >> > > > > > > >> >>>>
> >> >> > >> > > > > > > >> >>>>
> >> >> > >> > > > > > > >> >>>> best
> >> >> > >> > > > > > > >> >>>> bai jj
> >> >> > >> > > > > > > >> >>>>
> >> >> > >> > > > > > > >> >>>>
> >> >> > >> > > > > > > >> >>>> Yang Wang <[hidden email]>
> 于2019年12月18日周三
> >> >> > >> 下午7:21写道:
> >> >> > >> > > > > > > >> >>>>
> >> >> > >> > > > > > > >> >>>>> I just want to revive this discussion.
> >> >> > >> > > > > > > >> >>>>>
> >> >> > >> > > > > > > >> >>>>> Recently, i am thinking about how to
> natively run
> >> >> > >> flink
> >> >> > >> > > > > per-job
> >> >> > >> > > > > > > >> >>> cluster on
> >> >> > >> > > > > > > >> >>>>> Kubernetes.
> >> >> > >> > > > > > > >> >>>>> The per-job mode on Kubernetes is very
> different
> >> >> > >> from on
> >> >> > >> > > > Yarn.
> >> >> > >> > > > > > And
> >> >> > >> > > > > > > >> we
> >> >> > >> > > > > > > >> >>> will
> >> >> > >> > > > > > > >> >>>>> have
> >> >> > >> > > > > > > >> >>>>> the same deployment requirements to the
> client and
> >> >> > >> entry
> >> >> > >> > > > > point.
> >> >> > >> > > > > > > >> >>>>>
> >> >> > >> > > > > > > >> >>>>> 1. Flink client not always need a local
> jar to start
> >> >> > >> a
> >> >> > >> > > Flink
> >> >> > >> > > > > > > per-job
> >> >> > >> > > > > > > >> >>>>> cluster. We could
> >> >> > >> > > > > > > >> >>>>> support multiple schemas. For example,
> >> >> > >> > > > file:///path/of/my.jar
> >> >> > >> > > > > > > means
> >> >> > >> > > > > > > >> a
> >> >> > >> > > > > > > >> >>> jar
> >> >> > >> > > > > > > >> >>>>> located
> >> >> > >> > > > > > > >> >>>>> at client side,
> >> >> > >> hdfs://myhdfs/user/myname/flink/my.jar
> >> >> > >> > > > means a
> >> >> > >> > > > > > jar
> >> >> > >> > > > > > > >> >>> located
> >> >> > >> > > > > > > >> >>>>> at
> >> >> > >> > > > > > > >> >>>>> remote hdfs, local:///path/in/image/my.jar
> means a
> >> >> > >> jar
> >> >> > >> > > > located
> >> >> > >> > > > > > at
> >> >> > >> > > > > > > >> >>>>> jobmanager side.
> >> >> > >> > > > > > > >> >>>>>
> >> >> > >> > > > > > > >> >>>>> 2. Support running user program on master
> side. This
> >> >> > >> also
> >> >> > >> > > > > means
> >> >> > >> > > > > > > the
> >> >> > >> > > > > > > >> >>> entry
> >> >> > >> > > > > > > >> >>>>> point
> >> >> > >> > > > > > > >> >>>>> will generate the job graph on master
> side. We could
> >> >> > >> use
> >> >> > >> > > the
> >> >> > >> > > > > > > >> >>>>> ClasspathJobGraphRetriever
> >> >> > >> > > > > > > >> >>>>> or start a local Flink client to achieve
> this
> >> >> > >> purpose.
> >> >> > >> > > > > > > >> >>>>>
> >> >> > >> > > > > > > >> >>>>>
> >> >> > >> > > > > > > >> >>>>> cc tison, Aljoscha & Kostas Do you think
> this is the
> >> >> > >> right
> >> >> > >> > > > > > > >> direction we
> >> >> > >> > > > > > > >> >>>>> need to work?
> >> >> > >> > > > > > > >> >>>>>
> >> >> > >> > > > > > > >> >>>>> tison <[hidden email]>
> 于2019年12月12日周四
> >> >> > >> 下午4:48写道:
> >> >> > >> > > > > > > >> >>>>>
> >> >> > >> > > > > > > >> >>>>>> A quick idea is that we separate the
> deployment
> >> >> > >> from user
> >> >> > >> > > > > > program
> >> >> > >> > > > > > > >> >>> that
> >> >> > >> > > > > > > >> >>>>> it
> >> >> > >> > > > > > > >> >>>>>> has always been done
> >> >> > >> > > > > > > >> >>>>>> outside the program. On user program
> executed there
> >> >> > >> is
> >> >> > >> > > > > always a
> >> >> > >> > > > > > > >> >>>>>> ClusterClient that communicates with
> >> >> > >> > > > > > > >> >>>>>> an existing cluster, remote or local. It
> will be
> >> >> > >> another
> >> >> > >> > > > > thread
> >> >> > >> > > > > > > so
> >> >> > >> > > > > > > >> >>> just
> >> >> > >> > > > > > > >> >>>>> for
> >> >> > >> > > > > > > >> >>>>>> your information.
> >> >> > >> > > > > > > >> >>>>>>
> >> >> > >> > > > > > > >> >>>>>> Best,
> >> >> > >> > > > > > > >> >>>>>> tison.
> >> >> > >> > > > > > > >> >>>>>>
> >> >> > >> > > > > > > >> >>>>>>
> >> >> > >> > > > > > > >> >>>>>> tison <[hidden email]>
> 于2019年12月12日周四
> >> >> > >> 下午4:40写道:
> >> >> > >> > > > > > > >> >>>>>>
> >> >> > >> > > > > > > >> >>>>>>> Hi Peter,
> >> >> > >> > > > > > > >> >>>>>>>
> >> >> > >> > > > > > > >> >>>>>>> Another concern I realized recently is
> that with
> >> >> > >> current
> >> >> > >> > > > > > > Executors
> >> >> > >> > > > > > > >> >>>>>>> abstraction(FLIP-73)
> >> >> > >> > > > > > > >> >>>>>>> I'm afraid that user program is designed
> to ALWAYS
> >> >> > >> run
> >> >> > >> > > on
> >> >> > >> > > > > the
> >> >> > >> > > > > > > >> >>> client
> >> >> > >> > > > > > > >> >>>>>> side.
> >> >> > >> > > > > > > >> >>>>>>> Specifically,
> >> >> > >> > > > > > > >> >>>>>>> we deploy the job in executor when
> env.execute
> >> >> > >> called.
> >> >> > >> > > > This
> >> >> > >> > > > > > > >> >>>>> abstraction
> >> >> > >> > > > > > > >> >>>>>>> possibly prevents
> >> >> > >> > > > > > > >> >>>>>>> Flink runs user program on the cluster
> side.
> >> >> > >> > > > > > > >> >>>>>>>
> >> >> > >> > > > > > > >> >>>>>>> For your proposal, in this case we
> already
> >> >> > >> compiled the
> >> >> > >> > > > > > program
> >> >> > >> > > > > > > >> and
> >> >> > >> > > > > > > >> >>>>> run
> >> >> > >> > > > > > > >> >>>>>> on
> >> >> > >> > > > > > > >> >>>>>>> the client side,
> >> >> > >> > > > > > > >> >>>>>>> even we deploy a cluster and retrieve
> job graph
> >> >> > >> from
> >> >> > >> > > > program
> >> >> > >> > > > > > > >> >>>>> metadata, it
> >> >> > >> > > > > > > >> >>>>>>> doesn't make
> >> >> > >> > > > > > > >> >>>>>>> many sense.
> >> >> > >> > > > > > > >> >>>>>>>
> >> >> > >> > > > > > > >> >>>>>>> cc Aljoscha & Kostas what do you think
> about this
> >> >> > >> > > > > constraint?
> >> >> > >> > > > > > > >> >>>>>>>
> >> >> > >> > > > > > > >> >>>>>>> Best,
> >> >> > >> > > > > > > >> >>>>>>> tison.
> >> >> > >> > > > > > > >> >>>>>>>
> >> >> > >> > > > > > > >> >>>>>>>
> >> >> > >> > > > > > > >> >>>>>>> Peter Huang <[hidden email]>
> >> >> > >> 于2019年12月10日周二
> >> >> > >> > > > > > > >> 下午12:45写道:
> >> >> > >> > > > > > > >> >>>>>>>
> >> >> > >> > > > > > > >> >>>>>>>> Hi Tison,
> >> >> > >> > > > > > > >> >>>>>>>>
> >> >> > >> > > > > > > >> >>>>>>>> Yes, you are right. I think I made the
> wrong
> >> >> > >> argument
> >> >> > >> > > in
> >> >> > >> > > > > the
> >> >> > >> > > > > > > doc.
> >> >> > >> > > > > > > >> >>>>>>>> Basically, the packaging jar problem is
> only for
> >> >> > >> > > platform
> >> >> > >> > > > > > > users.
> >> >> > >> > > > > > > >> >>> In
> >> >> > >> > > > > > > >> >>>>> our
> >> >> > >> > > > > > > >> >>>>>>>> internal deploy service,
> >> >> > >> > > > > > > >> >>>>>>>> we further optimized the deployment
> latency by
> >> >> > >> letting
> >> >> > >> > > > > users
> >> >> > >> > > > > > to
> >> >> > >> > > > > > > >> >>>>>> packaging
> >> >> > >> > > > > > > >> >>>>>>>> flink-runtime together with the uber
> jar, so that
> >> >> > >> we
> >> >> > >> > > > don't
> >> >> > >> > > > > > need
> >> >> > >> > > > > > > >> to
> >> >> > >> > > > > > > >> >>>>>>>> consider
> >> >> > >> > > > > > > >> >>>>>>>> multiple flink version
> >> >> > >> > > > > > > >> >>>>>>>> support for now. In the session client
> mode, as
> >> >> > >> Flink
> >> >> > >> > > > libs
> >> >> > >> > > > > > will
> >> >> > >> > > > > > > >> be
> >> >> > >> > > > > > > >> >>>>>> shipped
> >> >> > >> > > > > > > >> >>>>>>>> anyway as local resources of yarn.
> Users actually
> >> >> > >> don't
> >> >> > >> > > > > need
> >> >> > >> > > > > > to
> >> >> > >> > > > > > > >> >>>>> package
> >> >> > >> > > > > > > >> >>>>>>>> those libs into job jar.
> >> >> > >> > > > > > > >> >>>>>>>>
> >> >> > >> > > > > > > >> >>>>>>>>
> >> >> > >> > > > > > > >> >>>>>>>>
> >> >> > >> > > > > > > >> >>>>>>>> Best Regards
> >> >> > >> > > > > > > >> >>>>>>>> Peter Huang
> >> >> > >> > > > > > > >> >>>>>>>>
> >> >> > >> > > > > > > >> >>>>>>>> On Mon, Dec 9, 2019 at 8:35 PM tison <
> >> >> > >> > > > [hidden email]
> >> >> > >> > > > > >
> >> >> > >> > > > > > > >> >>> wrote:
> >> >> > >> > > > > > > >> >>>>>>>>
> >> >> > >> > > > > > > >> >>>>>>>>>> 3. What do you mean about the
> package? Do users
> >> >> > >> need
> >> >> > >> > > to
> >> >> > >> > > > > > > >> >>> compile
> >> >> > >> > > > > > > >> >>>>>> their
> >> >> > >> > > > > > > >> >>>>>>>>> jars
> >> >> > >> > > > > > > >> >>>>>>>>> inlcuding flink-clients,
> flink-optimizer,
> >> >> > >> flink-table
> >> >> > >> > > > > codes?
> >> >> > >> > > > > > > >> >>>>>>>>>
> >> >> > >> > > > > > > >> >>>>>>>>> The answer should be no because they
> exist in
> >> >> > >> system
> >> >> > >> > > > > > > classpath.
> >> >> > >> > > > > > > >> >>>>>>>>>
> >> >> > >> > > > > > > >> >>>>>>>>> Best,
> >> >> > >> > > > > > > >> >>>>>>>>> tison.
> >> >> > >> > > > > > > >> >>>>>>>>>
> >> >> > >> > > > > > > >> >>>>>>>>>
> >> >> > >> > > > > > > >> >>>>>>>>> Yang Wang <[hidden email]>
> 于2019年12月10日周二
> >> >> > >> > > > > 下午12:18写道:
> >> >> > >> > > > > > > >> >>>>>>>>>
> >> >> > >> > > > > > > >> >>>>>>>>>> Hi Peter,
> >> >> > >> > > > > > > >> >>>>>>>>>>
> >> >> > >> > > > > > > >> >>>>>>>>>> Thanks a lot for starting this
> discussion. I
> >> >> > >> think
> >> >> > >> > > this
> >> >> > >> > > > > is
> >> >> > >> > > > > > a
> >> >> > >> > > > > > > >> >>> very
> >> >> > >> > > > > > > >> >>>>>>>> useful
> >> >> > >> > > > > > > >> >>>>>>>>>> feature.
> >> >> > >> > > > > > > >> >>>>>>>>>>
> >> >> > >> > > > > > > >> >>>>>>>>>> Not only for Yarn, i am focused on
> flink on
> >> >> > >> > > Kubernetes
> >> >> > >> > > > > > > >> >>>>> integration
> >> >> > >> > > > > > > >> >>>>>> and
> >> >> > >> > > > > > > >> >>>>>>>>> come
> >> >> > >> > > > > > > >> >>>>>>>>>> across the same
> >> >> > >> > > > > > > >> >>>>>>>>>> problem. I do not want the job graph
> generated
> >> >> > >> on
> >> >> > >> > > > client
> >> >> > >> > > > > > > side.
> >> >> > >> > > > > > > >> >>>>>>>> Instead,
> >> >> > >> > > > > > > >> >>>>>>>>> the
> >> >> > >> > > > > > > >> >>>>>>>>>> user jars are built in
> >> >> > >> > > > > > > >> >>>>>>>>>> a user-defined image. When the job
> manager
> >> >> > >> launched,
> >> >> > >> > > we
> >> >> > >> > > > > > just
> >> >> > >> > > > > > > >> >>>>> need to
> >> >> > >> > > > > > > >> >>>>>>>>>> generate the job graph
> >> >> > >> > > > > > > >> >>>>>>>>>> based on local user jars.
> >> >> > >> > > > > > > >> >>>>>>>>>>
> >> >> > >> > > > > > > >> >>>>>>>>>> I have some small suggestion about
> this.
> >> >> > >> > > > > > > >> >>>>>>>>>>
> >> >> > >> > > > > > > >> >>>>>>>>>> 1. `ProgramJobGraphRetriever` is very
> similar to
> >> >> > >> > > > > > > >> >>>>>>>>>> `ClasspathJobGraphRetriever`, the
> differences
> >> >> > >> > > > > > > >> >>>>>>>>>> are the former needs
> `ProgramMetadata` and the
> >> >> > >> latter
> >> >> > >> > > > > needs
> >> >> > >> > > > > > > >> >>> some
> >> >> > >> > > > > > > >> >>>>>>>>> arguments.
> >> >> > >> > > > > > > >> >>>>>>>>>> Is it possible to
> >> >> > >> > > > > > > >> >>>>>>>>>> have an unified `JobGraphRetriever`
> to support
> >> >> > >> both?
> >> >> > >> > > > > > > >> >>>>>>>>>> 2. Is it possible to not use a local
> user jar to
> >> >> > >> > > start
> >> >> > >> > > > a
> >> >> > >> > > > > > > >> >>> per-job
> >> >> > >> > > > > > > >> >>>>>>>> cluster?
> >> >> > >> > > > > > > >> >>>>>>>>>> In your case, the user jars has
> >> >> > >> > > > > > > >> >>>>>>>>>> existed on hdfs already and we do
> need to
> >> >> > >> download
> >> >> > >> > > the
> >> >> > >> > > > > jars
> >> >> > >> > > > > > > to
> >> >> > >> > > > > > > >> >>>>>>>> deployer
> >> >> > >> > > > > > > >> >>>>>>>>>> service. Currently, we
> >> >> > >> > > > > > > >> >>>>>>>>>> always need a local user jar to start
> a flink
> >> >> > >> > > cluster.
> >> >> > >> > > > It
> >> >> > >> > > > > > is
> >> >> > >> > > > > > > >> >>> be
> >> >> > >> > > > > > > >> >>>>>> great
> >> >> > >> > > > > > > >> >>>>>>>> if
> >> >> > >> > > > > > > >> >>>>>>>>> we
> >> >> > >> > > > > > > >> >>>>>>>>>> could support remote user jars.
> >> >> > >> > > > > > > >> >>>>>>>>>>>> In the implementation, we assume
> users package
> >> >> > >> > > > > > > >> >>> flink-clients,
> >> >> > >> > > > > > > >> >>>>>>>>>> flink-optimizer, flink-table together
> within
> >> >> > >> the job
> >> >> > >> > > > jar.
> >> >> > >> > > > > > > >> >>>>> Otherwise,
> >> >> > >> > > > > > > >> >>>>>>>> the
> >> >> > >> > > > > > > >> >>>>>>>>>> job graph generation within
> >> >> > >> JobClusterEntryPoint will
> >> >> > >> > > > > fail.
> >> >> > >> > > > > > > >> >>>>>>>>>> 3. What do you mean about the
> package? Do users
> >> >> > >> need
> >> >> > >> > > to
> >> >> > >> > > > > > > >> >>> compile
> >> >> > >> > > > > > > >> >>>>>> their
> >> >> > >> > > > > > > >> >>>>>>>>> jars
> >> >> > >> > > > > > > >> >>>>>>>>>> inlcuding flink-clients,
> flink-optimizer,
> >> >> > >> flink-table
> >> >> > >> > > > > > codes?
> >> >> > >> > > > > > > >> >>>>>>>>>>
> >> >> > >> > > > > > > >> >>>>>>>>>>
> >> >> > >> > > > > > > >> >>>>>>>>>>
> >> >> > >> > > > > > > >> >>>>>>>>>> Best,
> >> >> > >> > > > > > > >> >>>>>>>>>> Yang
> >> >> > >> > > > > > > >> >>>>>>>>>>
> >> >> > >> > > > > > > >> >>>>>>>>>> Peter Huang <
> [hidden email]>
> >> >> > >> > > > 于2019年12月10日周二
> >> >> > >> > > > > > > >> >>>>> 上午2:37写道:
> >> >> > >> > > > > > > >> >>>>>>>>>>
> >> >> > >> > > > > > > >> >>>>>>>>>>> Dear All,
> >> >> > >> > > > > > > >> >>>>>>>>>>>
> >> >> > >> > > > > > > >> >>>>>>>>>>> Recently, the Flink community starts
> to
> >> >> > >> improve the
> >> >> > >> > > > yarn
> >> >> > >> > > > > > > >> >>>>> cluster
> >> >> > >> > > > > > > >> >>>>>>>>>> descriptor
> >> >> > >> > > > > > > >> >>>>>>>>>>> to make job jar and config files
> configurable
> >> >> > >> from
> >> >> > >> > > > CLI.
> >> >> > >> > > > > It
> >> >> > >> > > > > > > >> >>>>>> improves
> >> >> > >> > > > > > > >> >>>>>>>> the
> >> >> > >> > > > > > > >> >>>>>>>>>>> flexibility of  Flink deployment
> Yarn Per Job
> >> >> > >> Mode.
> >> >> > >> > > > For
> >> >> > >> > > > > > > >> >>>>> platform
> >> >> > >> > > > > > > >> >>>>>>>> users
> >> >> > >> > > > > > > >> >>>>>>>>>> who
> >> >> > >> > > > > > > >> >>>>>>>>>>> manage tens of hundreds of streaming
> pipelines
> >> >> > >> for
> >> >> > >> > > the
> >> >> > >> > > > > > whole
> >> >> > >> > > > > > > >> >>>>> org
> >> >> > >> > > > > > > >> >>>>>> or
> >> >> > >> > > > > > > >> >>>>>>>>>>> company, we found the job graph
> generation in
> >> >> > >> > > > > client-side
> >> >> > >> > > > > > is
> >> >> > >> > > > > > > >> >>>>>> another
> >> >> > >> > > > > > > >> >>>>>>>>>>> pinpoint. Thus, we want to propose a
> >> >> > >> configurable
> >> >> > >> > > > > feature
> >> >> > >> > > > > > > >> >>> for
> >> >> > >> > > > > > > >> >>>>>>>>>>> FlinkYarnSessionCli. The feature can
> allow
> >> >> > >> users to
> >> >> > >> > > > > choose
> >> >> > >> > > > > > > >> >>> the
> >> >> > >> > > > > > > >> >>>>> job
> >> >> > >> > > > > > > >> >>>>>>>>> graph
> >> >> > >> > > > > > > >> >>>>>>>>>>> generation in Flink
> ClusterEntryPoint so that
> >> >> > >> the
> >> >> > >> > > job
> >> >> > >> > > > > jar
> >> >> > >> > > > > > > >> >>>>> doesn't
> >> >> > >> > > > > > > >> >>>>>>>> need
> >> >> > >> > > > > > > >> >>>>>>>>> to
> >> >> > >> > > > > > > >> >>>>>>>>>>> be locally for the job graph
> generation. The
> >> >> > >> > > proposal
> >> >> > >> > > > is
> >> >> > >> > > > > > > >> >>>>> organized
> >> >> > >> > > > > > > >> >>>>>>>> as a
> >> >> > >> > > > > > > >> >>>>>>>>>>> FLIP
> >> >> > >> > > > > > > >> >>>>>>>>>>>
> >> >> > >> > > > > > > >> >>>>>>>>>>>
> >> >> > >> > > > > > > >> >>>>>>>>>>
> >> >> > >> > > > > > > >> >>>>>>>>>
> >> >> > >> > > > > > > >> >>>>>>>>
> >> >> > >> > > > > > > >> >>>>>>
> >> >> > >> > > > > > > >> >>>>>
> >> >> > >> > > > > > > >> >>>
> >> >> > >> > > > > > > >>
> >> >> > >> > > > > > >
> >> >> > >> > > > > >
> >> >> > >> > > > >
> >> >> > >> > > >
> >> >> > >> > >
> >> >> > >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-85+Delayed+JobGraph+Generation
> >> >> > >> > > > > > > >> >>>>>>>>>>> .
> >> >> > >> > > > > > > >> >>>>>>>>>>>
> >> >> > >> > > > > > > >> >>>>>>>>>>> Any questions and suggestions are
> welcomed.
> >> >> > >> Thank
> >> >> > >> > > you
> >> >> > >> > > > in
> >> >> > >> > > > > > > >> >>>>> advance.
> >> >> > >> > > > > > > >> >>>>>>>>>>>
> >> >> > >> > > > > > > >> >>>>>>>>>>>
> >> >> > >> > > > > > > >> >>>>>>>>>>> Best Regards
> >> >> > >> > > > > > > >> >>>>>>>>>>> Peter Huang
> >> >> > >> > > > > > > >> >>>>>>>>>>>
> >> >> > >> > > > > > > >> >>>>>>>>>>
> >> >> > >> > > > > > > >> >>>>>>>>>
> >> >> > >> > > > > > > >> >>>>>>>>
> >> >> > >> > > > > > > >> >>>>>>>
> >> >> > >> > > > > > > >> >>>>>>
> >> >> > >> > > > > > > >> >>>>>
> >> >> > >> > > > > > > >> >>>>
> >> >> > >> > > > > > > >> >>>
> >> >> > >> > > > > > > >> >>
> >> >> > >> > > > > > > >>
> >> >> > >> > > > > > > >>
> >> >> > >> > > > > > >
> >> >> > >> > > > > >
> >> >> > >> > > > >
> >> >> > >> > > >
> >> >> > >> > >
> >> >> > >>
> >> >> > >
> >> >>
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] FLIP-85: Delayed Job Graph Generation

Peter Huang
Hi Yang and Kostas,

Thanks for the clarification. It makes more sense to me if the long term
goal is to replace per job mode to application mode
 in the future (at the time that multiple execute can be supported). Before
that, It will be better to keep the concept of
 application mode internally. As Yang suggested, User only need to use a
`-R/-- remote-deploy` cli option to launch
a per job cluster with the main function executed in cluster
entry-point.  +1 for the execution plan.



Best Regards
Peter Huang




On Tue, Mar 3, 2020 at 7:11 AM Yang Wang <[hidden email]> wrote:

> Hi Peter,
>
> Having the application mode does not mean we will drop the cluster-deploy
> option. I just want to share some thoughts about “Application Mode”.
>
>
> 1. The application mode could cover the per-job sematic. Its lifecyle is
> bound
> to the user `main()`. And all the jobs in the user main will be executed
> in a same
> Flink cluster. In first phase of FLIP-85 implementation, running user main
> on the
> cluster side could be supported in application mode.
>
> 2. Maybe in the future, we also need to support multiple `execute()` on
> client side
> in a same Flink cluster. Then the per-job mode will evolve to application
> mode.
>
> 3. From user perspective, only a `-R/-- remote-deploy` cli option is
> visible. They
> are not aware of the application mode.
>
> 4. In the first phase, the application mode is working as “per-job”(only
> one job in
> the user main). We just leave more potential for the future.
>
>
> I am not against with calling it “cluster deploy mode” if you all think it
> is clearer for users.
>
>
>
> Best,
> Yang
>
> Kostas Kloudas <[hidden email]> 于2020年3月3日周二 下午6:49写道:
>
>> Hi Peter,
>>
>> I understand your point. This is why I was also a bit torn about the
>> name and my proposal was a bit aligned with yours (something along the
>> lines of "cluster deploy" mode).
>>
>> But many of the other participants in the discussion suggested the
>> "Application Mode". I think that the reasoning is that now the user's
>> Application is more self-contained.
>> It will be submitted to the cluster and the user can just disconnect.
>> In addition, as discussed briefly in the doc, in the future there may
>> be better support for multi-execute applications which will bring us
>> one step closer to the true "Application Mode". But this is how I
>> interpreted their arguments, of course they can also express their
>> thoughts on the topic :)
>>
>> Cheers,
>> Kostas
>>
>> On Mon, Mar 2, 2020 at 6:15 PM Peter Huang <[hidden email]>
>> wrote:
>> >
>> > Hi Kostas,
>> >
>> > Thanks for updating the wiki. We have aligned with the implementations
>> in the doc. But I feel it is still a little bit confusing of the naming
>> from a user's perspective. It is well known that Flink support per job
>> cluster and session cluster. The concept is in the layer of how a job is
>> managed within Flink. The method introduced util now is a kind of mixing
>> job and session cluster to promising the implementation complexity. We
>> probably don't need to label it as Application Model as the same layer of
>> per job cluster and session cluster. Conceptually, I think it is still a
>> cluster mode implementation for per job cluster.
>> >
>> > To minimize the confusion of users, I think it would be better just an
>> option of per job cluster for each type of cluster manager. How do you
>> think?
>> >
>> >
>> > Best Regards
>> > Peter Huang
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> > On Mon, Mar 2, 2020 at 7:22 AM Kostas Kloudas <[hidden email]>
>> wrote:
>> >>
>> >> Hi Yang,
>> >>
>> >> The difference between per-job and application mode is that, as you
>> >> described, in the per-job mode the main is executed on the client
>> >> while in the application mode, the main is executed on the cluster.
>> >> I do not think we have to offer "application mode" with running the
>> >> main on the client side as this is exactly what the per-job mode does
>> >> currently and, as you described also, it would be redundant.
>> >>
>> >> Sorry if this was not clear in the document.
>> >>
>> >> Cheers,
>> >> Kostas
>> >>
>> >> On Mon, Mar 2, 2020 at 3:17 PM Yang Wang <[hidden email]>
>> wrote:
>> >> >
>> >> > Hi Kostas,
>> >> >
>> >> > Thanks a lot for your conclusion and updating the FLIP-85 WIKI.
>> Currently, i have no more
>> >> > questions about motivation, approach, fault tolerance and the first
>> phase implementation.
>> >> >
>> >> > I think the new title "Flink Application Mode" makes a lot senses to
>> me. Especially for the
>> >> > containerized environment, the cluster deploy option will be very
>> useful.
>> >> >
>> >> > Just one concern, how do we introduce this new application mode to
>> our users?
>> >> > Each user program(i.e. `main()`) is an application. Currently, we
>> intend to only support one
>> >> > `execute()`. So what's the difference between per-job and
>> application mode?
>> >> >
>> >> > For per-job, user `main()` is always executed on client side. And
>> For application mode, user
>> >> > `main()` could be executed on client or master side(configured via
>> cli option).
>> >> > Right? We need to have a clear concept. Otherwise, the users will be
>> more and more confusing.
>> >> >
>> >> >
>> >> > Best,
>> >> > Yang
>> >> >
>> >> > Kostas Kloudas <[hidden email]> 于2020年3月2日周一 下午5:58写道:
>> >> >>
>> >> >> Hi all,
>> >> >>
>> >> >> I update
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-85+Flink+Application+Mode
>> >> >> based on the discussion we had here:
>> >> >>
>> >> >>
>> https://docs.google.com/document/d/1ji72s3FD9DYUyGuKnJoO4ApzV-nSsZa0-bceGXW7Ocw/edit#
>> >> >>
>> >> >> Please let me know what you think and please keep the discussion in
>> the ML :)
>> >> >>
>> >> >> Thanks for starting the discussion and I hope that soon we will be
>> >> >> able to vote on the FLIP.
>> >> >>
>> >> >> Cheers,
>> >> >> Kostas
>> >> >>
>> >> >> On Thu, Jan 16, 2020 at 3:40 AM Yang Wang <[hidden email]>
>> wrote:
>> >> >> >
>> >> >> > Hi all,
>> >> >> >
>> >> >> > Thanks a lot for the feedback from @Kostas Kloudas. Your all
>> concerns are
>> >> >> > on point. The FLIP-85 is mainly
>> >> >> > focused on supporting cluster mode for per-job. Since it is more
>> urgent and
>> >> >> > have much more use
>> >> >> > cases both in Yarn and Kubernetes deployment. For session
>> cluster, we could
>> >> >> > have more discussion
>> >> >> > in a new thread later.
>> >> >> >
>> >> >> > #1, How to download the user jars and dependencies for per-job in
>> cluster
>> >> >> > mode?
>> >> >> > For Yarn, we could register the user jars and dependencies as
>> >> >> > LocalResource. They will be distributed
>> >> >> > by Yarn. And once the JobManager and TaskManager launched, the
>> jars are
>> >> >> > already exists.
>> >> >> > For Standalone per-job and K8s, we expect that the user jars
>> >> >> > and dependencies are built into the image.
>> >> >> > Or the InitContainer could be used for downloading. It is natively
>> >> >> > distributed and we will not have bottleneck.
>> >> >> >
>> >> >> > #2, Job graph recovery
>> >> >> > We could have an optimization to store job graph on the DFS.
>> However, i
>> >> >> > suggest building a new jobgraph
>> >> >> > from the configuration is the default option. Since we will not
>> always have
>> >> >> > a DFS store when deploying a
>> >> >> > Flink per-job cluster. Of course, we assume that using the same
>> >> >> > configuration(e.g. job_id, user_jar, main_class,
>> >> >> > main_args, parallelism, savepoint_settings, etc.) will get a same
>> job
>> >> >> > graph. I think the standalone per-job
>> >> >> > already has the similar behavior.
>> >> >> >
>> >> >> > #3, What happens with jobs that have multiple execute calls?
>> >> >> > Currently, it is really a problem. Even we use a local client on
>> Flink
>> >> >> > master side, it will have different behavior with
>> >> >> > client mode. For client mode, if we execute multiple times, then
>> we will
>> >> >> > deploy multiple Flink clusters for each execute.
>> >> >> > I am not pretty sure whether it is reasonable. However, i still
>> think using
>> >> >> > the local client is a good choice. We could
>> >> >> > continue the discussion in a new thread. @Zili Chen <
>> [hidden email]> Do
>> >> >> > you want to drive this?
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > Best,
>> >> >> > Yang
>> >> >> >
>> >> >> > Peter Huang <[hidden email]> 于2020年1月16日周四 上午1:55写道:
>> >> >> >
>> >> >> > > Hi Kostas,
>> >> >> > >
>> >> >> > > Thanks for this feedback. I can't agree more about the opinion.
>> The
>> >> >> > > cluster mode should be added
>> >> >> > > first in per job cluster.
>> >> >> > >
>> >> >> > > 1) For job cluster implementation
>> >> >> > > 1. Job graph recovery from configuration or store as static job
>> graph as
>> >> >> > > session cluster. I think the static one will be better for less
>> recovery
>> >> >> > > time.
>> >> >> > > Let me update the doc for details.
>> >> >> > >
>> >> >> > > 2. For job execute multiple times, I think @Zili Chen
>> >> >> > > <[hidden email]> has proposed the local client solution
>> that can
>> >> >> > > the run program actually in the cluster entry point. We can put
>> the
>> >> >> > > implementation in the second stage,
>> >> >> > > or even a new FLIP for further discussion.
>> >> >> > >
>> >> >> > > 2) For session cluster implementation
>> >> >> > > We can disable the cluster mode for the session cluster in the
>> first
>> >> >> > > stage. I agree the jar downloading will be a painful thing.
>> >> >> > > We can consider about PoC and performance evaluation first. If
>> the end to
>> >> >> > > end experience is good enough, then we can consider
>> >> >> > > proceeding with the solution.
>> >> >> > >
>> >> >> > > Looking forward to more opinions from @Yang Wang <
>> [hidden email]> @Zili
>> >> >> > > Chen <[hidden email]> @Dian Fu <[hidden email]>.
>> >> >> > >
>> >> >> > >
>> >> >> > > Best Regards
>> >> >> > > Peter Huang
>> >> >> > >
>> >> >> > > On Wed, Jan 15, 2020 at 7:50 AM Kostas Kloudas <
>> [hidden email]> wrote:
>> >> >> > >
>> >> >> > >> Hi all,
>> >> >> > >>
>> >> >> > >> I am writing here as the discussion on the Google Doc seems to
>> be a
>> >> >> > >> bit difficult to follow.
>> >> >> > >>
>> >> >> > >> I think that in order to be able to make progress, it would be
>> helpful
>> >> >> > >> to focus on per-job mode for now.
>> >> >> > >> The reason is that:
>> >> >> > >>  1) making the (unique) JobSubmitHandler responsible for
>> creating the
>> >> >> > >> jobgraphs,
>> >> >> > >>   which includes downloading dependencies, is not an optimal
>> solution
>> >> >> > >>  2) even if we put the responsibility on the JobMaster,
>> currently each
>> >> >> > >> job has its own
>> >> >> > >>   JobMaster but they all run on the same process, so we have
>> again a
>> >> >> > >> single entity.
>> >> >> > >>
>> >> >> > >> Of course after this is done, and if we feel comfortable with
>> the
>> >> >> > >> solution, then we can go to the session mode.
>> >> >> > >>
>> >> >> > >> A second comment has to do with fault-tolerance in the per-job,
>> >> >> > >> cluster-deploy mode.
>> >> >> > >> In the document, it is suggested that upon recovery, the
>> JobMaster of
>> >> >> > >> each job re-creates the JobGraph.
>> >> >> > >> I am just wondering if it is better to create and store the
>> jobGraph
>> >> >> > >> upon submission and only fetch it
>> >> >> > >> upon recovery so that we have a static jobGraph.
>> >> >> > >>
>> >> >> > >> Finally, I have a question which is what happens with jobs
>> that have
>> >> >> > >> multiple execute calls?
>> >> >> > >> The semantics seem to change compared to the current
>> behaviour, right?
>> >> >> > >>
>> >> >> > >> Cheers,
>> >> >> > >> Kostas
>> >> >> > >>
>> >> >> > >> On Wed, Jan 8, 2020 at 8:05 PM tison <[hidden email]>
>> wrote:
>> >> >> > >> >
>> >> >> > >> > not always, Yang Wang is also not yet a committer but he can
>> join the
>> >> >> > >> > channel. I cannot find the id by clicking “Add new member in
>> channel” so
>> >> >> > >> > come to you and ask for try out the link. Possibly I will
>> find other
>> >> >> > >> ways
>> >> >> > >> > but the original purpose is that the slack channel is a
>> public area we
>> >> >> > >> > discuss about developing...
>> >> >> > >> > Best,
>> >> >> > >> > tison.
>> >> >> > >> >
>> >> >> > >> >
>> >> >> > >> > Peter Huang <[hidden email]> 于2020年1月9日周四
>> 上午2:44写道:
>> >> >> > >> >
>> >> >> > >> > > Hi Tison,
>> >> >> > >> > >
>> >> >> > >> > > I am not the committer of Flink yet. I think I can't join
>> it also.
>> >> >> > >> > >
>> >> >> > >> > >
>> >> >> > >> > > Best Regards
>> >> >> > >> > > Peter Huang
>> >> >> > >> > >
>> >> >> > >> > > On Wed, Jan 8, 2020 at 9:39 AM tison <[hidden email]>
>> wrote:
>> >> >> > >> > >
>> >> >> > >> > > > Hi Peter,
>> >> >> > >> > > >
>> >> >> > >> > > > Could you try out this link?
>> >> >> > >> > > https://the-asf.slack.com/messages/CNA3ADZPH
>> >> >> > >> > > >
>> >> >> > >> > > > Best,
>> >> >> > >> > > > tison.
>> >> >> > >> > > >
>> >> >> > >> > > >
>> >> >> > >> > > > Peter Huang <[hidden email]> 于2020年1月9日周四
>> 上午1:22写道:
>> >> >> > >> > > >
>> >> >> > >> > > > > Hi Tison,
>> >> >> > >> > > > >
>> >> >> > >> > > > > I can't join the group with shared link. Would you
>> please add me
>> >> >> > >> into
>> >> >> > >> > > the
>> >> >> > >> > > > > group? My slack account is huangzhenqiu0825.
>> >> >> > >> > > > > Thank you in advance.
>> >> >> > >> > > > >
>> >> >> > >> > > > >
>> >> >> > >> > > > > Best Regards
>> >> >> > >> > > > > Peter Huang
>> >> >> > >> > > > >
>> >> >> > >> > > > > On Wed, Jan 8, 2020 at 12:02 AM tison <
>> [hidden email]>
>> >> >> > >> wrote:
>> >> >> > >> > > > >
>> >> >> > >> > > > > > Hi Peter,
>> >> >> > >> > > > > >
>> >> >> > >> > > > > > As described above, this effort should get attention
>> from people
>> >> >> > >> > > > > developing
>> >> >> > >> > > > > > FLIP-73 a.k.a. Executor abstractions. I recommend
>> you to join
>> >> >> > >> the
>> >> >> > >> > > > public
>> >> >> > >> > > > > > slack channel[1] for Flink Client API Enhancement
>> and you can
>> >> >> > >> try to
>> >> >> > >> > > > > share
>> >> >> > >> > > > > > you detailed thoughts there. It possibly gets more
>> concrete
>> >> >> > >> > > attentions.
>> >> >> > >> > > > > >
>> >> >> > >> > > > > > Best,
>> >> >> > >> > > > > > tison.
>> >> >> > >> > > > > >
>> >> >> > >> > > > > > [1]
>> >> >> > >> > > > > >
>> >> >> > >> > > > > >
>> >> >> > >> > > > >
>> >> >> > >> > > >
>> >> >> > >> > >
>> >> >> > >>
>> https://slack.com/share/IS21SJ75H/Rk8HhUly9FuEHb7oGwBZ33uL/enQtODg2MDYwNjE5MTg3LTA2MjIzNDc1M2ZjZDVlMjdlZjk1M2RkYmJhNjAwMTk2ZDZkODQ4NmY5YmI4OGRhNWJkYTViMTM1NzlmMzc4OWM
>> >> >> > >> > > > > >
>> >> >> > >> > > > > >
>> >> >> > >> > > > > > Peter Huang <[hidden email]>
>> 于2020年1月7日周二 上午5:09写道:
>> >> >> > >> > > > > >
>> >> >> > >> > > > > > > Dear All,
>> >> >> > >> > > > > > >
>> >> >> > >> > > > > > > Happy new year! According to existing feedback
>> from the
>> >> >> > >> community,
>> >> >> > >> > > we
>> >> >> > >> > > > > > > revised the doc with the consideration of session
>> cluster
>> >> >> > >> support,
>> >> >> > >> > > > and
>> >> >> > >> > > > > > > concrete interface changes needed and execution
>> plan. Please
>> >> >> > >> take
>> >> >> > >> > > one
>> >> >> > >> > > > > > more
>> >> >> > >> > > > > > > round of review at your most convenient time.
>> >> >> > >> > > > > > >
>> >> >> > >> > > > > > >
>> >> >> > >> > > > > > >
>> >> >> > >> > > > > >
>> >> >> > >> > > > >
>> >> >> > >> > > >
>> >> >> > >> > >
>> >> >> > >>
>> https://docs.google.com/document/d/1aAwVjdZByA-0CHbgv16Me-vjaaDMCfhX7TzVVTuifYM/edit#
>> >> >> > >> > > > > > >
>> >> >> > >> > > > > > >
>> >> >> > >> > > > > > > Best Regards
>> >> >> > >> > > > > > > Peter Huang
>> >> >> > >> > > > > > >
>> >> >> > >> > > > > > >
>> >> >> > >> > > > > > >
>> >> >> > >> > > > > > >
>> >> >> > >> > > > > > >
>> >> >> > >> > > > > > > On Thu, Jan 2, 2020 at 11:29 AM Peter Huang <
>> >> >> > >> > > > > [hidden email]>
>> >> >> > >> > > > > > > wrote:
>> >> >> > >> > > > > > >
>> >> >> > >> > > > > > > > Hi Dian,
>> >> >> > >> > > > > > > > Thanks for giving us valuable feedbacks.
>> >> >> > >> > > > > > > >
>> >> >> > >> > > > > > > > 1) It's better to have a whole design for this
>> feature
>> >> >> > >> > > > > > > > For the suggestion of enabling the cluster mode
>> also session
>> >> >> > >> > > > > cluster, I
>> >> >> > >> > > > > > > > think Flink already supported it.
>> WebSubmissionExtension
>> >> >> > >> already
>> >> >> > >> > > > > allows
>> >> >> > >> > > > > > > > users to start a job with the specified jar by
>> using web UI.
>> >> >> > >> > > > > > > > But we need to enable the feature from CLI for
>> both local
>> >> >> > >> jar,
>> >> >> > >> > > > remote
>> >> >> > >> > > > > > > jar.
>> >> >> > >> > > > > > > > I will align with Yang Wang first about the
>> details and
>> >> >> > >> update
>> >> >> > >> > > the
>> >> >> > >> > > > > > design
>> >> >> > >> > > > > > > > doc.
>> >> >> > >> > > > > > > >
>> >> >> > >> > > > > > > > 2) It's better to consider the convenience for
>> users, such
>> >> >> > >> as
>> >> >> > >> > > > > debugging
>> >> >> > >> > > > > > > >
>> >> >> > >> > > > > > > > I am wondering whether we can store the
>> exception in
>> >> >> > >> jobgragh
>> >> >> > >> > > > > > > > generation in application master. As no
>> streaming graph can
>> >> >> > >> be
>> >> >> > >> > > > > > scheduled
>> >> >> > >> > > > > > > in
>> >> >> > >> > > > > > > > this case, there will be no more TM will be
>> requested from
>> >> >> > >> > > FlinkRM.
>> >> >> > >> > > > > > > > If the AM is still running, users can still
>> query it from
>> >> >> > >> CLI. As
>> >> >> > >> > > > it
>> >> >> > >> > > > > > > > requires more change, we can get some feedback
>> from <
>> >> >> > >> > > > > > [hidden email]
>> >> >> > >> > > > > > > >
>> >> >> > >> > > > > > > > and @[hidden email] <[hidden email]>.
>> >> >> > >> > > > > > > >
>> >> >> > >> > > > > > > > 3) It's better to consider the impact to the
>> stability of
>> >> >> > >> the
>> >> >> > >> > > > cluster
>> >> >> > >> > > > > > > >
>> >> >> > >> > > > > > > > I agree with Yang Wang's opinion.
>> >> >> > >> > > > > > > >
>> >> >> > >> > > > > > > >
>> >> >> > >> > > > > > > >
>> >> >> > >> > > > > > > > Best Regards
>> >> >> > >> > > > > > > > Peter Huang
>> >> >> > >> > > > > > > >
>> >> >> > >> > > > > > > >
>> >> >> > >> > > > > > > > On Sun, Dec 29, 2019 at 9:44 PM Dian Fu <
>> >> >> > >> [hidden email]>
>> >> >> > >> > > > > wrote:
>> >> >> > >> > > > > > > >
>> >> >> > >> > > > > > > >> Hi all,
>> >> >> > >> > > > > > > >>
>> >> >> > >> > > > > > > >> Sorry to jump into this discussion. Thanks
>> everyone for the
>> >> >> > >> > > > > > discussion.
>> >> >> > >> > > > > > > >> I'm very interested in this topic although I'm
>> not an
>> >> >> > >> expert in
>> >> >> > >> > > > this
>> >> >> > >> > > > > > > part.
>> >> >> > >> > > > > > > >> So I'm glad to share my thoughts as following:
>> >> >> > >> > > > > > > >>
>> >> >> > >> > > > > > > >> 1) It's better to have a whole design for this
>> feature
>> >> >> > >> > > > > > > >> As we know, there are two deployment modes:
>> per-job mode
>> >> >> > >> and
>> >> >> > >> > > > session
>> >> >> > >> > > > > > > >> mode. I'm wondering which mode really needs
>> this feature.
>> >> >> > >> As the
>> >> >> > >> > > > > > design
>> >> >> > >> > > > > > > doc
>> >> >> > >> > > > > > > >> mentioned, per-job mode is more used for
>> streaming jobs and
>> >> >> > >> > > > session
>> >> >> > >> > > > > > > mode is
>> >> >> > >> > > > > > > >> usually used for batch jobs(Of course, the job
>> types and
>> >> >> > >> the
>> >> >> > >> > > > > > deployment
>> >> >> > >> > > > > > > >> modes are orthogonal). Usually streaming job is
>> only
>> >> >> > >> needed to
>> >> >> > >> > > be
>> >> >> > >> > > > > > > submitted
>> >> >> > >> > > > > > > >> once and it will run for days or weeks, while
>> batch jobs
>> >> >> > >> will be
>> >> >> > >> > > > > > > submitted
>> >> >> > >> > > > > > > >> more frequently compared with streaming jobs.
>> This means
>> >> >> > >> that
>> >> >> > >> > > > maybe
>> >> >> > >> > > > > > > session
>> >> >> > >> > > > > > > >> mode also needs this feature. However, if we
>> support this
>> >> >> > >> > > feature
>> >> >> > >> > > > in
>> >> >> > >> > > > > > > >> session mode, the application master will
>> become the new
>> >> >> > >> > > > centralized
>> >> >> > >> > > > > > > >> service(which should be solved). So in this
>> case, it's
>> >> >> > >> better to
>> >> >> > >> > > > > have
>> >> >> > >> > > > > > a
>> >> >> > >> > > > > > > >> complete design for both per-job mode and
>> session mode.
>> >> >> > >> > > > Furthermore,
>> >> >> > >> > > > > > > even
>> >> >> > >> > > > > > > >> if we can do it phase by phase, we need to have
>> a whole
>> >> >> > >> picture
>> >> >> > >> > > of
>> >> >> > >> > > > > how
>> >> >> > >> > > > > > > it
>> >> >> > >> > > > > > > >> works in both per-job mode and session mode.
>> >> >> > >> > > > > > > >>
>> >> >> > >> > > > > > > >> 2) It's better to consider the convenience for
>> users, such
>> >> >> > >> as
>> >> >> > >> > > > > > debugging
>> >> >> > >> > > > > > > >> After we finish this feature, the job graph
>> will be
>> >> >> > >> compiled in
>> >> >> > >> > > > the
>> >> >> > >> > > > > > > >> application master, which means that users
>> cannot easily
>> >> >> > >> get the
>> >> >> > >> > > > > > > exception
>> >> >> > >> > > > > > > >> message synchorousely in the job client if
>> there are
>> >> >> > >> problems
>> >> >> > >> > > > during
>> >> >> > >> > > > > > the
>> >> >> > >> > > > > > > >> job graph compiling (especially for platform
>> users), such
>> >> >> > >> as the
>> >> >> > >> > > > > > > resource
>> >> >> > >> > > > > > > >> path is incorrect, the user program itself has
>> some
>> >> >> > >> problems,
>> >> >> > >> > > etc.
>> >> >> > >> > > > > > What
>> >> >> > >> > > > > > > I'm
>> >> >> > >> > > > > > > >> thinking is that maybe we should throw the
>> exceptions as
>> >> >> > >> early
>> >> >> > >> > > as
>> >> >> > >> > > > > > > possible
>> >> >> > >> > > > > > > >> (during job submission stage).
>> >> >> > >> > > > > > > >>
>> >> >> > >> > > > > > > >> 3) It's better to consider the impact to the
>> stability of
>> >> >> > >> the
>> >> >> > >> > > > > cluster
>> >> >> > >> > > > > > > >> If we perform the compiling in the application
>> master, we
>> >> >> > >> should
>> >> >> > >> > > > > > > consider
>> >> >> > >> > > > > > > >> the impact of the compiling errors. Although
>> YARN could
>> >> >> > >> resume
>> >> >> > >> > > the
>> >> >> > >> > > > > > > >> application master in case of failures, but in
>> some case
>> >> >> > >> the
>> >> >> > >> > > > > compiling
>> >> >> > >> > > > > > > >> failure may be a waste of cluster resource and
>> may impact
>> >> >> > >> the
>> >> >> > >> > > > > > stability
>> >> >> > >> > > > > > > the
>> >> >> > >> > > > > > > >> cluster and the other jobs in the cluster, such
>> as the
>> >> >> > >> resource
>> >> >> > >> > > > path
>> >> >> > >> > > > > > is
>> >> >> > >> > > > > > > >> incorrect, the user program itself has some
>> problems(in
>> >> >> > >> this
>> >> >> > >> > > case,
>> >> >> > >> > > > > job
>> >> >> > >> > > > > > > >> failover cannot solve this kind of problems)
>> etc. In the
>> >> >> > >> current
>> >> >> > >> > > > > > > >> implemention, the compiling errors are handled
>> in the
>> >> >> > >> client
>> >> >> > >> > > side
>> >> >> > >> > > > > and
>> >> >> > >> > > > > > > there
>> >> >> > >> > > > > > > >> is no impact to the cluster at all.
>> >> >> > >> > > > > > > >>
>> >> >> > >> > > > > > > >> Regarding to 1), it's clearly pointed in the
>> design doc
>> >> >> > >> that
>> >> >> > >> > > only
>> >> >> > >> > > > > > > per-job
>> >> >> > >> > > > > > > >> mode will be supported. However, I think it's
>> better to
>> >> >> > >> also
>> >> >> > >> > > > > consider
>> >> >> > >> > > > > > > the
>> >> >> > >> > > > > > > >> session mode in the design doc.
>> >> >> > >> > > > > > > >> Regarding to 2) and 3), I have not seen related
>> sections
>> >> >> > >> in the
>> >> >> > >> > > > > design
>> >> >> > >> > > > > > > >> doc. It will be good if we can cover them in
>> the design
>> >> >> > >> doc.
>> >> >> > >> > > > > > > >>
>> >> >> > >> > > > > > > >> Feel free to correct me If there is anything I
>> >> >> > >> misunderstand.
>> >> >> > >> > > > > > > >>
>> >> >> > >> > > > > > > >> Regards,
>> >> >> > >> > > > > > > >> Dian
>> >> >> > >> > > > > > > >>
>> >> >> > >> > > > > > > >>
>> >> >> > >> > > > > > > >> > 在 2019年12月27日,上午3:13,Peter Huang <
>> >> >> > >> [hidden email]>
>> >> >> > >> > > > 写道:
>> >> >> > >> > > > > > > >> >
>> >> >> > >> > > > > > > >> > Hi Yang,
>> >> >> > >> > > > > > > >> >
>> >> >> > >> > > > > > > >> > I can't agree more. The effort definitely
>> needs to align
>> >> >> > >> with
>> >> >> > >> > > > the
>> >> >> > >> > > > > > > final
>> >> >> > >> > > > > > > >> > goal of FLIP-73.
>> >> >> > >> > > > > > > >> > I am thinking about whether we can achieve
>> the goal with
>> >> >> > >> two
>> >> >> > >> > > > > phases.
>> >> >> > >> > > > > > > >> >
>> >> >> > >> > > > > > > >> > 1) Phase I
>> >> >> > >> > > > > > > >> > As the CLiFrontend will not be depreciated
>> soon. We can
>> >> >> > >> still
>> >> >> > >> > > > use
>> >> >> > >> > > > > > the
>> >> >> > >> > > > > > > >> > deployMode flag there,
>> >> >> > >> > > > > > > >> > pass the program info through Flink
>> configuration,  use
>> >> >> > >> the
>> >> >> > >> > > > > > > >> > ClassPathJobGraphRetriever
>> >> >> > >> > > > > > > >> > to generate the job graph in
>> ClusterEntrypoints of yarn
>> >> >> > >> and
>> >> >> > >> > > > > > > Kubernetes.
>> >> >> > >> > > > > > > >> >
>> >> >> > >> > > > > > > >> > 2) Phase II
>> >> >> > >> > > > > > > >> > In  AbstractJobClusterExecutor, the job graph
>> is
>> >> >> > >> generated in
>> >> >> > >> > > > the
>> >> >> > >> > > > > > > >> execute
>> >> >> > >> > > > > > > >> > function. We can still
>> >> >> > >> > > > > > > >> > use the deployMode in it. With deployMode =
>> cluster, the
>> >> >> > >> > > execute
>> >> >> > >> > > > > > > >> function
>> >> >> > >> > > > > > > >> > only starts the cluster.
>> >> >> > >> > > > > > > >> >
>> >> >> > >> > > > > > > >> > When
>> {Yarn/Kuberneates}PerJobClusterEntrypoint starts,
>> >> >> > >> It will
>> >> >> > >> > > > > start
>> >> >> > >> > > > > > > the
>> >> >> > >> > > > > > > >> > dispatch first, then we can use
>> >> >> > >> > > > > > > >> > a ClusterEnvironment similar to
>> ContextEnvironment to
>> >> >> > >> submit
>> >> >> > >> > > the
>> >> >> > >> > > > > job
>> >> >> > >> > > > > > > >> with
>> >> >> > >> > > > > > > >> > jobName the local
>> >> >> > >> > > > > > > >> > dispatcher. For the details, we need more
>> investigation.
>> >> >> > >> Let's
>> >> >> > >> > > > > wait
>> >> >> > >> > > > > > > >> > for @Aljoscha
>> >> >> > >> > > > > > > >> > Krettek <[hidden email]> @Till Rohrmann
>> <
>> >> >> > >> > > > > [hidden email]
>> >> >> > >> > > > > > >'s
>> >> >> > >> > > > > > > >> > feedback after the holiday season.
>> >> >> > >> > > > > > > >> >
>> >> >> > >> > > > > > > >> > Thank you in advance. Merry Chrismas and
>> Happy New
>> >> >> > >> Year!!!
>> >> >> > >> > > > > > > >> >
>> >> >> > >> > > > > > > >> >
>> >> >> > >> > > > > > > >> > Best Regards
>> >> >> > >> > > > > > > >> > Peter Huang
>> >> >> > >> > > > > > > >> >
>> >> >> > >> > > > > > > >> >
>> >> >> > >> > > > > > > >> >
>> >> >> > >> > > > > > > >> >
>> >> >> > >> > > > > > > >> >
>> >> >> > >> > > > > > > >> >
>> >> >> > >> > > > > > > >> >
>> >> >> > >> > > > > > > >> >
>> >> >> > >> > > > > > > >> > On Wed, Dec 25, 2019 at 1:08 AM Yang Wang <
>> >> >> > >> > > > [hidden email]>
>> >> >> > >> > > > > > > >> wrote:
>> >> >> > >> > > > > > > >> >
>> >> >> > >> > > > > > > >> >> Hi Peter,
>> >> >> > >> > > > > > > >> >>
>> >> >> > >> > > > > > > >> >> I think we need to reconsider tison's
>> suggestion
>> >> >> > >> seriously.
>> >> >> > >> > > > After
>> >> >> > >> > > > > > > >> FLIP-73,
>> >> >> > >> > > > > > > >> >> the deployJobCluster has
>> >> >> > >> > > > > > > >> >> beenmoved into `JobClusterExecutor#execute`.
>> It should
>> >> >> > >> not be
>> >> >> > >> > > > > > > perceived
>> >> >> > >> > > > > > > >> >> for `CliFrontend`. That
>> >> >> > >> > > > > > > >> >> means the user program will *ALWAYS* be
>> executed on
>> >> >> > >> client
>> >> >> > >> > > > side.
>> >> >> > >> > > > > > This
>> >> >> > >> > > > > > > >> is
>> >> >> > >> > > > > > > >> >> the by design behavior.
>> >> >> > >> > > > > > > >> >> So, we could not just add `if(client mode)
>> .. else
>> >> >> > >> if(cluster
>> >> >> > >> > > > > mode)
>> >> >> > >> > > > > > > >> ...`
>> >> >> > >> > > > > > > >> >> codes in `CliFrontend` to bypass
>> >> >> > >> > > > > > > >> >> the executor. We need to find a clean way to
>> decouple
>> >> >> > >> > > executing
>> >> >> > >> > > > > > user
>> >> >> > >> > > > > > > >> >> program and deploying per-job
>> >> >> > >> > > > > > > >> >> cluster. Based on this, we could support to
>> execute user
>> >> >> > >> > > > program
>> >> >> > >> > > > > on
>> >> >> > >> > > > > > > >> client
>> >> >> > >> > > > > > > >> >> or master side.
>> >> >> > >> > > > > > > >> >>
>> >> >> > >> > > > > > > >> >> Maybe Aljoscha and Jeff could give some good
>> >> >> > >> suggestions.
>> >> >> > >> > > > > > > >> >>
>> >> >> > >> > > > > > > >> >>
>> >> >> > >> > > > > > > >> >>
>> >> >> > >> > > > > > > >> >> Best,
>> >> >> > >> > > > > > > >> >> Yang
>> >> >> > >> > > > > > > >> >>
>> >> >> > >> > > > > > > >> >> Peter Huang <[hidden email]>
>> 于2019年12月25日周三
>> >> >> > >> > > > > 上午4:03写道:
>> >> >> > >> > > > > > > >> >>
>> >> >> > >> > > > > > > >> >>> Hi Jingjing,
>> >> >> > >> > > > > > > >> >>>
>> >> >> > >> > > > > > > >> >>> The improvement proposed is a deployment
>> option for
>> >> >> > >> CLI. For
>> >> >> > >> > > > SQL
>> >> >> > >> > > > > > > based
>> >> >> > >> > > > > > > >> >>> Flink application, It is more convenient to
>> use the
>> >> >> > >> existing
>> >> >> > >> > > > > model
>> >> >> > >> > > > > > > in
>> >> >> > >> > > > > > > >> >>> SqlClient in which
>> >> >> > >> > > > > > > >> >>> the job graph is generated within
>> SqlClient. After
>> >> >> > >> adding
>> >> >> > >> > > the
>> >> >> > >> > > > > > > delayed
>> >> >> > >> > > > > > > >> job
>> >> >> > >> > > > > > > >> >>> graph generation, I think there is no
>> change is needed
>> >> >> > >> for
>> >> >> > >> > > > your
>> >> >> > >> > > > > > > side.
>> >> >> > >> > > > > > > >> >>>
>> >> >> > >> > > > > > > >> >>>
>> >> >> > >> > > > > > > >> >>> Best Regards
>> >> >> > >> > > > > > > >> >>> Peter Huang
>> >> >> > >> > > > > > > >> >>>
>> >> >> > >> > > > > > > >> >>>
>> >> >> > >> > > > > > > >> >>> On Wed, Dec 18, 2019 at 6:01 AM jingjing
>> bai <
>> >> >> > >> > > > > > > >> [hidden email]>
>> >> >> > >> > > > > > > >> >>> wrote:
>> >> >> > >> > > > > > > >> >>>
>> >> >> > >> > > > > > > >> >>>> hi peter:
>> >> >> > >> > > > > > > >> >>>>    we had extension SqlClent to support
>> sql job
>> >> >> > >> submit in
>> >> >> > >> > > web
>> >> >> > >> > > > > > base
>> >> >> > >> > > > > > > on
>> >> >> > >> > > > > > > >> >>>> flink 1.9.   we support submit to yarn on
>> per job
>> >> >> > >> mode too.
>> >> >> > >> > > > > > > >> >>>>    in this case, the job graph generated
>> on client
>> >> >> > >> side
>> >> >> > >> > > .  I
>> >> >> > >> > > > > > think
>> >> >> > >> > > > > > > >> >>> this
>> >> >> > >> > > > > > > >> >>>> discuss Mainly to improve api programme.
>> but in my
>> >> >> > >> case ,
>> >> >> > >> > > > > there
>> >> >> > >> > > > > > is
>> >> >> > >> > > > > > > >> no
>> >> >> > >> > > > > > > >> >>>> jar to upload but only a sql string .
>> >> >> > >> > > > > > > >> >>>>    do u had more suggestion to improve for
>> sql mode
>> >> >> > >> or it
>> >> >> > >> > > is
>> >> >> > >> > > > > > only a
>> >> >> > >> > > > > > > >> >>>> switch for api programme?
>> >> >> > >> > > > > > > >> >>>>
>> >> >> > >> > > > > > > >> >>>>
>> >> >> > >> > > > > > > >> >>>> best
>> >> >> > >> > > > > > > >> >>>> bai jj
>> >> >> > >> > > > > > > >> >>>>
>> >> >> > >> > > > > > > >> >>>>
>> >> >> > >> > > > > > > >> >>>> Yang Wang <[hidden email]>
>> 于2019年12月18日周三
>> >> >> > >> 下午7:21写道:
>> >> >> > >> > > > > > > >> >>>>
>> >> >> > >> > > > > > > >> >>>>> I just want to revive this discussion.
>> >> >> > >> > > > > > > >> >>>>>
>> >> >> > >> > > > > > > >> >>>>> Recently, i am thinking about how to
>> natively run
>> >> >> > >> flink
>> >> >> > >> > > > > per-job
>> >> >> > >> > > > > > > >> >>> cluster on
>> >> >> > >> > > > > > > >> >>>>> Kubernetes.
>> >> >> > >> > > > > > > >> >>>>> The per-job mode on Kubernetes is very
>> different
>> >> >> > >> from on
>> >> >> > >> > > > Yarn.
>> >> >> > >> > > > > > And
>> >> >> > >> > > > > > > >> we
>> >> >> > >> > > > > > > >> >>> will
>> >> >> > >> > > > > > > >> >>>>> have
>> >> >> > >> > > > > > > >> >>>>> the same deployment requirements to the
>> client and
>> >> >> > >> entry
>> >> >> > >> > > > > point.
>> >> >> > >> > > > > > > >> >>>>>
>> >> >> > >> > > > > > > >> >>>>> 1. Flink client not always need a local
>> jar to start
>> >> >> > >> a
>> >> >> > >> > > Flink
>> >> >> > >> > > > > > > per-job
>> >> >> > >> > > > > > > >> >>>>> cluster. We could
>> >> >> > >> > > > > > > >> >>>>> support multiple schemas. For example,
>> >> >> > >> > > > file:///path/of/my.jar
>> >> >> > >> > > > > > > means
>> >> >> > >> > > > > > > >> a
>> >> >> > >> > > > > > > >> >>> jar
>> >> >> > >> > > > > > > >> >>>>> located
>> >> >> > >> > > > > > > >> >>>>> at client side,
>> >> >> > >> hdfs://myhdfs/user/myname/flink/my.jar
>> >> >> > >> > > > means a
>> >> >> > >> > > > > > jar
>> >> >> > >> > > > > > > >> >>> located
>> >> >> > >> > > > > > > >> >>>>> at
>> >> >> > >> > > > > > > >> >>>>> remote hdfs,
>> local:///path/in/image/my.jar means a
>> >> >> > >> jar
>> >> >> > >> > > > located
>> >> >> > >> > > > > > at
>> >> >> > >> > > > > > > >> >>>>> jobmanager side.
>> >> >> > >> > > > > > > >> >>>>>
>> >> >> > >> > > > > > > >> >>>>> 2. Support running user program on master
>> side. This
>> >> >> > >> also
>> >> >> > >> > > > > means
>> >> >> > >> > > > > > > the
>> >> >> > >> > > > > > > >> >>> entry
>> >> >> > >> > > > > > > >> >>>>> point
>> >> >> > >> > > > > > > >> >>>>> will generate the job graph on master
>> side. We could
>> >> >> > >> use
>> >> >> > >> > > the
>> >> >> > >> > > > > > > >> >>>>> ClasspathJobGraphRetriever
>> >> >> > >> > > > > > > >> >>>>> or start a local Flink client to achieve
>> this
>> >> >> > >> purpose.
>> >> >> > >> > > > > > > >> >>>>>
>> >> >> > >> > > > > > > >> >>>>>
>> >> >> > >> > > > > > > >> >>>>> cc tison, Aljoscha & Kostas Do you think
>> this is the
>> >> >> > >> right
>> >> >> > >> > > > > > > >> direction we
>> >> >> > >> > > > > > > >> >>>>> need to work?
>> >> >> > >> > > > > > > >> >>>>>
>> >> >> > >> > > > > > > >> >>>>> tison <[hidden email]>
>> 于2019年12月12日周四
>> >> >> > >> 下午4:48写道:
>> >> >> > >> > > > > > > >> >>>>>
>> >> >> > >> > > > > > > >> >>>>>> A quick idea is that we separate the
>> deployment
>> >> >> > >> from user
>> >> >> > >> > > > > > program
>> >> >> > >> > > > > > > >> >>> that
>> >> >> > >> > > > > > > >> >>>>> it
>> >> >> > >> > > > > > > >> >>>>>> has always been done
>> >> >> > >> > > > > > > >> >>>>>> outside the program. On user program
>> executed there
>> >> >> > >> is
>> >> >> > >> > > > > always a
>> >> >> > >> > > > > > > >> >>>>>> ClusterClient that communicates with
>> >> >> > >> > > > > > > >> >>>>>> an existing cluster, remote or local. It
>> will be
>> >> >> > >> another
>> >> >> > >> > > > > thread
>> >> >> > >> > > > > > > so
>> >> >> > >> > > > > > > >> >>> just
>> >> >> > >> > > > > > > >> >>>>> for
>> >> >> > >> > > > > > > >> >>>>>> your information.
>> >> >> > >> > > > > > > >> >>>>>>
>> >> >> > >> > > > > > > >> >>>>>> Best,
>> >> >> > >> > > > > > > >> >>>>>> tison.
>> >> >> > >> > > > > > > >> >>>>>>
>> >> >> > >> > > > > > > >> >>>>>>
>> >> >> > >> > > > > > > >> >>>>>> tison <[hidden email]>
>> 于2019年12月12日周四
>> >> >> > >> 下午4:40写道:
>> >> >> > >> > > > > > > >> >>>>>>
>> >> >> > >> > > > > > > >> >>>>>>> Hi Peter,
>> >> >> > >> > > > > > > >> >>>>>>>
>> >> >> > >> > > > > > > >> >>>>>>> Another concern I realized recently is
>> that with
>> >> >> > >> current
>> >> >> > >> > > > > > > Executors
>> >> >> > >> > > > > > > >> >>>>>>> abstraction(FLIP-73)
>> >> >> > >> > > > > > > >> >>>>>>> I'm afraid that user program is
>> designed to ALWAYS
>> >> >> > >> run
>> >> >> > >> > > on
>> >> >> > >> > > > > the
>> >> >> > >> > > > > > > >> >>> client
>> >> >> > >> > > > > > > >> >>>>>> side.
>> >> >> > >> > > > > > > >> >>>>>>> Specifically,
>> >> >> > >> > > > > > > >> >>>>>>> we deploy the job in executor when
>> env.execute
>> >> >> > >> called.
>> >> >> > >> > > > This
>> >> >> > >> > > > > > > >> >>>>> abstraction
>> >> >> > >> > > > > > > >> >>>>>>> possibly prevents
>> >> >> > >> > > > > > > >> >>>>>>> Flink runs user program on the cluster
>> side.
>> >> >> > >> > > > > > > >> >>>>>>>
>> >> >> > >> > > > > > > >> >>>>>>> For your proposal, in this case we
>> already
>> >> >> > >> compiled the
>> >> >> > >> > > > > > program
>> >> >> > >> > > > > > > >> and
>> >> >> > >> > > > > > > >> >>>>> run
>> >> >> > >> > > > > > > >> >>>>>> on
>> >> >> > >> > > > > > > >> >>>>>>> the client side,
>> >> >> > >> > > > > > > >> >>>>>>> even we deploy a cluster and retrieve
>> job graph
>> >> >> > >> from
>> >> >> > >> > > > program
>> >> >> > >> > > > > > > >> >>>>> metadata, it
>> >> >> > >> > > > > > > >> >>>>>>> doesn't make
>> >> >> > >> > > > > > > >> >>>>>>> many sense.
>> >> >> > >> > > > > > > >> >>>>>>>
>> >> >> > >> > > > > > > >> >>>>>>> cc Aljoscha & Kostas what do you think
>> about this
>> >> >> > >> > > > > constraint?
>> >> >> > >> > > > > > > >> >>>>>>>
>> >> >> > >> > > > > > > >> >>>>>>> Best,
>> >> >> > >> > > > > > > >> >>>>>>> tison.
>> >> >> > >> > > > > > > >> >>>>>>>
>> >> >> > >> > > > > > > >> >>>>>>>
>> >> >> > >> > > > > > > >> >>>>>>> Peter Huang <[hidden email]
>> >
>> >> >> > >> 于2019年12月10日周二
>> >> >> > >> > > > > > > >> 下午12:45写道:
>> >> >> > >> > > > > > > >> >>>>>>>
>> >> >> > >> > > > > > > >> >>>>>>>> Hi Tison,
>> >> >> > >> > > > > > > >> >>>>>>>>
>> >> >> > >> > > > > > > >> >>>>>>>> Yes, you are right. I think I made the
>> wrong
>> >> >> > >> argument
>> >> >> > >> > > in
>> >> >> > >> > > > > the
>> >> >> > >> > > > > > > doc.
>> >> >> > >> > > > > > > >> >>>>>>>> Basically, the packaging jar problem
>> is only for
>> >> >> > >> > > platform
>> >> >> > >> > > > > > > users.
>> >> >> > >> > > > > > > >> >>> In
>> >> >> > >> > > > > > > >> >>>>> our
>> >> >> > >> > > > > > > >> >>>>>>>> internal deploy service,
>> >> >> > >> > > > > > > >> >>>>>>>> we further optimized the deployment
>> latency by
>> >> >> > >> letting
>> >> >> > >> > > > > users
>> >> >> > >> > > > > > to
>> >> >> > >> > > > > > > >> >>>>>> packaging
>> >> >> > >> > > > > > > >> >>>>>>>> flink-runtime together with the uber
>> jar, so that
>> >> >> > >> we
>> >> >> > >> > > > don't
>> >> >> > >> > > > > > need
>> >> >> > >> > > > > > > >> to
>> >> >> > >> > > > > > > >> >>>>>>>> consider
>> >> >> > >> > > > > > > >> >>>>>>>> multiple flink version
>> >> >> > >> > > > > > > >> >>>>>>>> support for now. In the session client
>> mode, as
>> >> >> > >> Flink
>> >> >> > >> > > > libs
>> >> >> > >> > > > > > will
>> >> >> > >> > > > > > > >> be
>> >> >> > >> > > > > > > >> >>>>>> shipped
>> >> >> > >> > > > > > > >> >>>>>>>> anyway as local resources of yarn.
>> Users actually
>> >> >> > >> don't
>> >> >> > >> > > > > need
>> >> >> > >> > > > > > to
>> >> >> > >> > > > > > > >> >>>>> package
>> >> >> > >> > > > > > > >> >>>>>>>> those libs into job jar.
>> >> >> > >> > > > > > > >> >>>>>>>>
>> >> >> > >> > > > > > > >> >>>>>>>>
>> >> >> > >> > > > > > > >> >>>>>>>>
>> >> >> > >> > > > > > > >> >>>>>>>> Best Regards
>> >> >> > >> > > > > > > >> >>>>>>>> Peter Huang
>> >> >> > >> > > > > > > >> >>>>>>>>
>> >> >> > >> > > > > > > >> >>>>>>>> On Mon, Dec 9, 2019 at 8:35 PM tison <
>> >> >> > >> > > > [hidden email]
>> >> >> > >> > > > > >
>> >> >> > >> > > > > > > >> >>> wrote:
>> >> >> > >> > > > > > > >> >>>>>>>>
>> >> >> > >> > > > > > > >> >>>>>>>>>> 3. What do you mean about the
>> package? Do users
>> >> >> > >> need
>> >> >> > >> > > to
>> >> >> > >> > > > > > > >> >>> compile
>> >> >> > >> > > > > > > >> >>>>>> their
>> >> >> > >> > > > > > > >> >>>>>>>>> jars
>> >> >> > >> > > > > > > >> >>>>>>>>> inlcuding flink-clients,
>> flink-optimizer,
>> >> >> > >> flink-table
>> >> >> > >> > > > > codes?
>> >> >> > >> > > > > > > >> >>>>>>>>>
>> >> >> > >> > > > > > > >> >>>>>>>>> The answer should be no because they
>> exist in
>> >> >> > >> system
>> >> >> > >> > > > > > > classpath.
>> >> >> > >> > > > > > > >> >>>>>>>>>
>> >> >> > >> > > > > > > >> >>>>>>>>> Best,
>> >> >> > >> > > > > > > >> >>>>>>>>> tison.
>> >> >> > >> > > > > > > >> >>>>>>>>>
>> >> >> > >> > > > > > > >> >>>>>>>>>
>> >> >> > >> > > > > > > >> >>>>>>>>> Yang Wang <[hidden email]>
>> 于2019年12月10日周二
>> >> >> > >> > > > > 下午12:18写道:
>> >> >> > >> > > > > > > >> >>>>>>>>>
>> >> >> > >> > > > > > > >> >>>>>>>>>> Hi Peter,
>> >> >> > >> > > > > > > >> >>>>>>>>>>
>> >> >> > >> > > > > > > >> >>>>>>>>>> Thanks a lot for starting this
>> discussion. I
>> >> >> > >> think
>> >> >> > >> > > this
>> >> >> > >> > > > > is
>> >> >> > >> > > > > > a
>> >> >> > >> > > > > > > >> >>> very
>> >> >> > >> > > > > > > >> >>>>>>>> useful
>> >> >> > >> > > > > > > >> >>>>>>>>>> feature.
>> >> >> > >> > > > > > > >> >>>>>>>>>>
>> >> >> > >> > > > > > > >> >>>>>>>>>> Not only for Yarn, i am focused on
>> flink on
>> >> >> > >> > > Kubernetes
>> >> >> > >> > > > > > > >> >>>>> integration
>> >> >> > >> > > > > > > >> >>>>>> and
>> >> >> > >> > > > > > > >> >>>>>>>>> come
>> >> >> > >> > > > > > > >> >>>>>>>>>> across the same
>> >> >> > >> > > > > > > >> >>>>>>>>>> problem. I do not want the job graph
>> generated
>> >> >> > >> on
>> >> >> > >> > > > client
>> >> >> > >> > > > > > > side.
>> >> >> > >> > > > > > > >> >>>>>>>> Instead,
>> >> >> > >> > > > > > > >> >>>>>>>>> the
>> >> >> > >> > > > > > > >> >>>>>>>>>> user jars are built in
>> >> >> > >> > > > > > > >> >>>>>>>>>> a user-defined image. When the job
>> manager
>> >> >> > >> launched,
>> >> >> > >> > > we
>> >> >> > >> > > > > > just
>> >> >> > >> > > > > > > >> >>>>> need to
>> >> >> > >> > > > > > > >> >>>>>>>>>> generate the job graph
>> >> >> > >> > > > > > > >> >>>>>>>>>> based on local user jars.
>> >> >> > >> > > > > > > >> >>>>>>>>>>
>> >> >> > >> > > > > > > >> >>>>>>>>>> I have some small suggestion about
>> this.
>> >> >> > >> > > > > > > >> >>>>>>>>>>
>> >> >> > >> > > > > > > >> >>>>>>>>>> 1. `ProgramJobGraphRetriever` is
>> very similar to
>> >> >> > >> > > > > > > >> >>>>>>>>>> `ClasspathJobGraphRetriever`, the
>> differences
>> >> >> > >> > > > > > > >> >>>>>>>>>> are the former needs
>> `ProgramMetadata` and the
>> >> >> > >> latter
>> >> >> > >> > > > > needs
>> >> >> > >> > > > > > > >> >>> some
>> >> >> > >> > > > > > > >> >>>>>>>>> arguments.
>> >> >> > >> > > > > > > >> >>>>>>>>>> Is it possible to
>> >> >> > >> > > > > > > >> >>>>>>>>>> have an unified `JobGraphRetriever`
>> to support
>> >> >> > >> both?
>> >> >> > >> > > > > > > >> >>>>>>>>>> 2. Is it possible to not use a local
>> user jar to
>> >> >> > >> > > start
>> >> >> > >> > > > a
>> >> >> > >> > > > > > > >> >>> per-job
>> >> >> > >> > > > > > > >> >>>>>>>> cluster?
>> >> >> > >> > > > > > > >> >>>>>>>>>> In your case, the user jars has
>> >> >> > >> > > > > > > >> >>>>>>>>>> existed on hdfs already and we do
>> need to
>> >> >> > >> download
>> >> >> > >> > > the
>> >> >> > >> > > > > jars
>> >> >> > >> > > > > > > to
>> >> >> > >> > > > > > > >> >>>>>>>> deployer
>> >> >> > >> > > > > > > >> >>>>>>>>>> service. Currently, we
>> >> >> > >> > > > > > > >> >>>>>>>>>> always need a local user jar to
>> start a flink
>> >> >> > >> > > cluster.
>> >> >> > >> > > > It
>> >> >> > >> > > > > > is
>> >> >> > >> > > > > > > >> >>> be
>> >> >> > >> > > > > > > >> >>>>>> great
>> >> >> > >> > > > > > > >> >>>>>>>> if
>> >> >> > >> > > > > > > >> >>>>>>>>> we
>> >> >> > >> > > > > > > >> >>>>>>>>>> could support remote user jars.
>> >> >> > >> > > > > > > >> >>>>>>>>>>>> In the implementation, we assume
>> users package
>> >> >> > >> > > > > > > >> >>> flink-clients,
>> >> >> > >> > > > > > > >> >>>>>>>>>> flink-optimizer, flink-table
>> together within
>> >> >> > >> the job
>> >> >> > >> > > > jar.
>> >> >> > >> > > > > > > >> >>>>> Otherwise,
>> >> >> > >> > > > > > > >> >>>>>>>> the
>> >> >> > >> > > > > > > >> >>>>>>>>>> job graph generation within
>> >> >> > >> JobClusterEntryPoint will
>> >> >> > >> > > > > fail.
>> >> >> > >> > > > > > > >> >>>>>>>>>> 3. What do you mean about the
>> package? Do users
>> >> >> > >> need
>> >> >> > >> > > to
>> >> >> > >> > > > > > > >> >>> compile
>> >> >> > >> > > > > > > >> >>>>>> their
>> >> >> > >> > > > > > > >> >>>>>>>>> jars
>> >> >> > >> > > > > > > >> >>>>>>>>>> inlcuding flink-clients,
>> flink-optimizer,
>> >> >> > >> flink-table
>> >> >> > >> > > > > > codes?
>> >> >> > >> > > > > > > >> >>>>>>>>>>
>> >> >> > >> > > > > > > >> >>>>>>>>>>
>> >> >> > >> > > > > > > >> >>>>>>>>>>
>> >> >> > >> > > > > > > >> >>>>>>>>>> Best,
>> >> >> > >> > > > > > > >> >>>>>>>>>> Yang
>> >> >> > >> > > > > > > >> >>>>>>>>>>
>> >> >> > >> > > > > > > >> >>>>>>>>>> Peter Huang <
>> [hidden email]>
>> >> >> > >> > > > 于2019年12月10日周二
>> >> >> > >> > > > > > > >> >>>>> 上午2:37写道:
>> >> >> > >> > > > > > > >> >>>>>>>>>>
>> >> >> > >> > > > > > > >> >>>>>>>>>>> Dear All,
>> >> >> > >> > > > > > > >> >>>>>>>>>>>
>> >> >> > >> > > > > > > >> >>>>>>>>>>> Recently, the Flink community
>> starts to
>> >> >> > >> improve the
>> >> >> > >> > > > yarn
>> >> >> > >> > > > > > > >> >>>>> cluster
>> >> >> > >> > > > > > > >> >>>>>>>>>> descriptor
>> >> >> > >> > > > > > > >> >>>>>>>>>>> to make job jar and config files
>> configurable
>> >> >> > >> from
>> >> >> > >> > > > CLI.
>> >> >> > >> > > > > It
>> >> >> > >> > > > > > > >> >>>>>> improves
>> >> >> > >> > > > > > > >> >>>>>>>> the
>> >> >> > >> > > > > > > >> >>>>>>>>>>> flexibility of  Flink deployment
>> Yarn Per Job
>> >> >> > >> Mode.
>> >> >> > >> > > > For
>> >> >> > >> > > > > > > >> >>>>> platform
>> >> >> > >> > > > > > > >> >>>>>>>> users
>> >> >> > >> > > > > > > >> >>>>>>>>>> who
>> >> >> > >> > > > > > > >> >>>>>>>>>>> manage tens of hundreds of
>> streaming pipelines
>> >> >> > >> for
>> >> >> > >> > > the
>> >> >> > >> > > > > > whole
>> >> >> > >> > > > > > > >> >>>>> org
>> >> >> > >> > > > > > > >> >>>>>> or
>> >> >> > >> > > > > > > >> >>>>>>>>>>> company, we found the job graph
>> generation in
>> >> >> > >> > > > > client-side
>> >> >> > >> > > > > > is
>> >> >> > >> > > > > > > >> >>>>>> another
>> >> >> > >> > > > > > > >> >>>>>>>>>>> pinpoint. Thus, we want to propose a
>> >> >> > >> configurable
>> >> >> > >> > > > > feature
>> >> >> > >> > > > > > > >> >>> for
>> >> >> > >> > > > > > > >> >>>>>>>>>>> FlinkYarnSessionCli. The feature
>> can allow
>> >> >> > >> users to
>> >> >> > >> > > > > choose
>> >> >> > >> > > > > > > >> >>> the
>> >> >> > >> > > > > > > >> >>>>> job
>> >> >> > >> > > > > > > >> >>>>>>>>> graph
>> >> >> > >> > > > > > > >> >>>>>>>>>>> generation in Flink
>> ClusterEntryPoint so that
>> >> >> > >> the
>> >> >> > >> > > job
>> >> >> > >> > > > > jar
>> >> >> > >> > > > > > > >> >>>>> doesn't
>> >> >> > >> > > > > > > >> >>>>>>>> need
>> >> >> > >> > > > > > > >> >>>>>>>>> to
>> >> >> > >> > > > > > > >> >>>>>>>>>>> be locally for the job graph
>> generation. The
>> >> >> > >> > > proposal
>> >> >> > >> > > > is
>> >> >> > >> > > > > > > >> >>>>> organized
>> >> >> > >> > > > > > > >> >>>>>>>> as a
>> >> >> > >> > > > > > > >> >>>>>>>>>>> FLIP
>> >> >> > >> > > > > > > >> >>>>>>>>>>>
>> >> >> > >> > > > > > > >> >>>>>>>>>>>
>> >> >> > >> > > > > > > >> >>>>>>>>>>
>> >> >> > >> > > > > > > >> >>>>>>>>>
>> >> >> > >> > > > > > > >> >>>>>>>>
>> >> >> > >> > > > > > > >> >>>>>>
>> >> >> > >> > > > > > > >> >>>>>
>> >> >> > >> > > > > > > >> >>>
>> >> >> > >> > > > > > > >>
>> >> >> > >> > > > > > >
>> >> >> > >> > > > > >
>> >> >> > >> > > > >
>> >> >> > >> > > >
>> >> >> > >> > >
>> >> >> > >>
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-85+Delayed+JobGraph+Generation
>> >> >> > >> > > > > > > >> >>>>>>>>>>> .
>> >> >> > >> > > > > > > >> >>>>>>>>>>>
>> >> >> > >> > > > > > > >> >>>>>>>>>>> Any questions and suggestions are
>> welcomed.
>> >> >> > >> Thank
>> >> >> > >> > > you
>> >> >> > >> > > > in
>> >> >> > >> > > > > > > >> >>>>> advance.
>> >> >> > >> > > > > > > >> >>>>>>>>>>>
>> >> >> > >> > > > > > > >> >>>>>>>>>>>
>> >> >> > >> > > > > > > >> >>>>>>>>>>> Best Regards
>> >> >> > >> > > > > > > >> >>>>>>>>>>> Peter Huang
>> >> >> > >> > > > > > > >> >>>>>>>>>>>
>> >> >> > >> > > > > > > >> >>>>>>>>>>
>> >> >> > >> > > > > > > >> >>>>>>>>>
>> >> >> > >> > > > > > > >> >>>>>>>>
>> >> >> > >> > > > > > > >> >>>>>>>
>> >> >> > >> > > > > > > >> >>>>>>
>> >> >> > >> > > > > > > >> >>>>>
>> >> >> > >> > > > > > > >> >>>>
>> >> >> > >> > > > > > > >> >>>
>> >> >> > >> > > > > > > >> >>
>> >> >> > >> > > > > > > >>
>> >> >> > >> > > > > > > >>
>> >> >> > >> > > > > > >
>> >> >> > >> > > > > >
>> >> >> > >> > > > >
>> >> >> > >> > > >
>> >> >> > >> > >
>> >> >> > >>
>> >> >> > >
>> >> >>
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] FLIP-85: Delayed Job Graph Generation

Yang Wang
Hi Peter,
Really thanks for your response.

Hi all @Kostas Kloudas <[hidden email]> @Zili Chen
<[hidden email]> @Peter Huang <[hidden email]> @Rong Rong
<[hidden email]>
It seems that we have reached an agreement. The “application mode”
is regarded as the enhanced “per-job”. It is
orthogonal with “cluster deploy”. Currently, we bind the “per-job” to
`run-user-main-on-client` and “application mode”
to `run-user-main-on-cluster`.

Do you have other concerns to moving FLIP-85 to voting?


Best,
Yang

Peter Huang <[hidden email]> 于2020年3月5日周四 下午12:48写道:

> Hi Yang and Kostas,
>
> Thanks for the clarification. It makes more sense to me if the long term
> goal is to replace per job mode to application mode
>  in the future (at the time that multiple execute can be supported).
> Before that, It will be better to keep the concept of
>  application mode internally. As Yang suggested, User only need to use a
> `-R/-- remote-deploy` cli option to launch
> a per job cluster with the main function executed in cluster
> entry-point.  +1 for the execution plan.
>
>
>
> Best Regards
> Peter Huang
>
>
>
>
> On Tue, Mar 3, 2020 at 7:11 AM Yang Wang <[hidden email]> wrote:
>
>> Hi Peter,
>>
>> Having the application mode does not mean we will drop the cluster-deploy
>> option. I just want to share some thoughts about “Application Mode”.
>>
>>
>> 1. The application mode could cover the per-job sematic. Its lifecyle is
>> bound
>> to the user `main()`. And all the jobs in the user main will be executed
>> in a same
>> Flink cluster. In first phase of FLIP-85 implementation, running user
>> main on the
>> cluster side could be supported in application mode.
>>
>> 2. Maybe in the future, we also need to support multiple `execute()` on
>> client side
>> in a same Flink cluster. Then the per-job mode will evolve to application
>> mode.
>>
>> 3. From user perspective, only a `-R/-- remote-deploy` cli option is
>> visible. They
>> are not aware of the application mode.
>>
>> 4. In the first phase, the application mode is working as “per-job”(only
>> one job in
>> the user main). We just leave more potential for the future.
>>
>>
>> I am not against with calling it “cluster deploy mode” if you all think
>> it is clearer for users.
>>
>>
>>
>> Best,
>> Yang
>>
>> Kostas Kloudas <[hidden email]> 于2020年3月3日周二 下午6:49写道:
>>
>>> Hi Peter,
>>>
>>> I understand your point. This is why I was also a bit torn about the
>>> name and my proposal was a bit aligned with yours (something along the
>>> lines of "cluster deploy" mode).
>>>
>>> But many of the other participants in the discussion suggested the
>>> "Application Mode". I think that the reasoning is that now the user's
>>> Application is more self-contained.
>>> It will be submitted to the cluster and the user can just disconnect.
>>> In addition, as discussed briefly in the doc, in the future there may
>>> be better support for multi-execute applications which will bring us
>>> one step closer to the true "Application Mode". But this is how I
>>> interpreted their arguments, of course they can also express their
>>> thoughts on the topic :)
>>>
>>> Cheers,
>>> Kostas
>>>
>>> On Mon, Mar 2, 2020 at 6:15 PM Peter Huang <[hidden email]>
>>> wrote:
>>> >
>>> > Hi Kostas,
>>> >
>>> > Thanks for updating the wiki. We have aligned with the implementations
>>> in the doc. But I feel it is still a little bit confusing of the naming
>>> from a user's perspective. It is well known that Flink support per job
>>> cluster and session cluster. The concept is in the layer of how a job is
>>> managed within Flink. The method introduced util now is a kind of mixing
>>> job and session cluster to promising the implementation complexity. We
>>> probably don't need to label it as Application Model as the same layer of
>>> per job cluster and session cluster. Conceptually, I think it is still a
>>> cluster mode implementation for per job cluster.
>>> >
>>> > To minimize the confusion of users, I think it would be better just an
>>> option of per job cluster for each type of cluster manager. How do you
>>> think?
>>> >
>>> >
>>> > Best Regards
>>> > Peter Huang
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > On Mon, Mar 2, 2020 at 7:22 AM Kostas Kloudas <[hidden email]>
>>> wrote:
>>> >>
>>> >> Hi Yang,
>>> >>
>>> >> The difference between per-job and application mode is that, as you
>>> >> described, in the per-job mode the main is executed on the client
>>> >> while in the application mode, the main is executed on the cluster.
>>> >> I do not think we have to offer "application mode" with running the
>>> >> main on the client side as this is exactly what the per-job mode does
>>> >> currently and, as you described also, it would be redundant.
>>> >>
>>> >> Sorry if this was not clear in the document.
>>> >>
>>> >> Cheers,
>>> >> Kostas
>>> >>
>>> >> On Mon, Mar 2, 2020 at 3:17 PM Yang Wang <[hidden email]>
>>> wrote:
>>> >> >
>>> >> > Hi Kostas,
>>> >> >
>>> >> > Thanks a lot for your conclusion and updating the FLIP-85 WIKI.
>>> Currently, i have no more
>>> >> > questions about motivation, approach, fault tolerance and the first
>>> phase implementation.
>>> >> >
>>> >> > I think the new title "Flink Application Mode" makes a lot senses
>>> to me. Especially for the
>>> >> > containerized environment, the cluster deploy option will be very
>>> useful.
>>> >> >
>>> >> > Just one concern, how do we introduce this new application mode to
>>> our users?
>>> >> > Each user program(i.e. `main()`) is an application. Currently, we
>>> intend to only support one
>>> >> > `execute()`. So what's the difference between per-job and
>>> application mode?
>>> >> >
>>> >> > For per-job, user `main()` is always executed on client side. And
>>> For application mode, user
>>> >> > `main()` could be executed on client or master side(configured via
>>> cli option).
>>> >> > Right? We need to have a clear concept. Otherwise, the users will
>>> be more and more confusing.
>>> >> >
>>> >> >
>>> >> > Best,
>>> >> > Yang
>>> >> >
>>> >> > Kostas Kloudas <[hidden email]> 于2020年3月2日周一 下午5:58写道:
>>> >> >>
>>> >> >> Hi all,
>>> >> >>
>>> >> >> I update
>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-85+Flink+Application+Mode
>>> >> >> based on the discussion we had here:
>>> >> >>
>>> >> >>
>>> https://docs.google.com/document/d/1ji72s3FD9DYUyGuKnJoO4ApzV-nSsZa0-bceGXW7Ocw/edit#
>>> >> >>
>>> >> >> Please let me know what you think and please keep the discussion
>>> in the ML :)
>>> >> >>
>>> >> >> Thanks for starting the discussion and I hope that soon we will be
>>> >> >> able to vote on the FLIP.
>>> >> >>
>>> >> >> Cheers,
>>> >> >> Kostas
>>> >> >>
>>> >> >> On Thu, Jan 16, 2020 at 3:40 AM Yang Wang <[hidden email]>
>>> wrote:
>>> >> >> >
>>> >> >> > Hi all,
>>> >> >> >
>>> >> >> > Thanks a lot for the feedback from @Kostas Kloudas. Your all
>>> concerns are
>>> >> >> > on point. The FLIP-85 is mainly
>>> >> >> > focused on supporting cluster mode for per-job. Since it is more
>>> urgent and
>>> >> >> > have much more use
>>> >> >> > cases both in Yarn and Kubernetes deployment. For session
>>> cluster, we could
>>> >> >> > have more discussion
>>> >> >> > in a new thread later.
>>> >> >> >
>>> >> >> > #1, How to download the user jars and dependencies for per-job
>>> in cluster
>>> >> >> > mode?
>>> >> >> > For Yarn, we could register the user jars and dependencies as
>>> >> >> > LocalResource. They will be distributed
>>> >> >> > by Yarn. And once the JobManager and TaskManager launched, the
>>> jars are
>>> >> >> > already exists.
>>> >> >> > For Standalone per-job and K8s, we expect that the user jars
>>> >> >> > and dependencies are built into the image.
>>> >> >> > Or the InitContainer could be used for downloading. It is
>>> natively
>>> >> >> > distributed and we will not have bottleneck.
>>> >> >> >
>>> >> >> > #2, Job graph recovery
>>> >> >> > We could have an optimization to store job graph on the DFS.
>>> However, i
>>> >> >> > suggest building a new jobgraph
>>> >> >> > from the configuration is the default option. Since we will not
>>> always have
>>> >> >> > a DFS store when deploying a
>>> >> >> > Flink per-job cluster. Of course, we assume that using the same
>>> >> >> > configuration(e.g. job_id, user_jar, main_class,
>>> >> >> > main_args, parallelism, savepoint_settings, etc.) will get a
>>> same job
>>> >> >> > graph. I think the standalone per-job
>>> >> >> > already has the similar behavior.
>>> >> >> >
>>> >> >> > #3, What happens with jobs that have multiple execute calls?
>>> >> >> > Currently, it is really a problem. Even we use a local client on
>>> Flink
>>> >> >> > master side, it will have different behavior with
>>> >> >> > client mode. For client mode, if we execute multiple times, then
>>> we will
>>> >> >> > deploy multiple Flink clusters for each execute.
>>> >> >> > I am not pretty sure whether it is reasonable. However, i still
>>> think using
>>> >> >> > the local client is a good choice. We could
>>> >> >> > continue the discussion in a new thread. @Zili Chen <
>>> [hidden email]> Do
>>> >> >> > you want to drive this?
>>> >> >> >
>>> >> >> >
>>> >> >> >
>>> >> >> > Best,
>>> >> >> > Yang
>>> >> >> >
>>> >> >> > Peter Huang <[hidden email]> 于2020年1月16日周四 上午1:55写道:
>>> >> >> >
>>> >> >> > > Hi Kostas,
>>> >> >> > >
>>> >> >> > > Thanks for this feedback. I can't agree more about the
>>> opinion. The
>>> >> >> > > cluster mode should be added
>>> >> >> > > first in per job cluster.
>>> >> >> > >
>>> >> >> > > 1) For job cluster implementation
>>> >> >> > > 1. Job graph recovery from configuration or store as static
>>> job graph as
>>> >> >> > > session cluster. I think the static one will be better for
>>> less recovery
>>> >> >> > > time.
>>> >> >> > > Let me update the doc for details.
>>> >> >> > >
>>> >> >> > > 2. For job execute multiple times, I think @Zili Chen
>>> >> >> > > <[hidden email]> has proposed the local client solution
>>> that can
>>> >> >> > > the run program actually in the cluster entry point. We can
>>> put the
>>> >> >> > > implementation in the second stage,
>>> >> >> > > or even a new FLIP for further discussion.
>>> >> >> > >
>>> >> >> > > 2) For session cluster implementation
>>> >> >> > > We can disable the cluster mode for the session cluster in the
>>> first
>>> >> >> > > stage. I agree the jar downloading will be a painful thing.
>>> >> >> > > We can consider about PoC and performance evaluation first. If
>>> the end to
>>> >> >> > > end experience is good enough, then we can consider
>>> >> >> > > proceeding with the solution.
>>> >> >> > >
>>> >> >> > > Looking forward to more opinions from @Yang Wang <
>>> [hidden email]> @Zili
>>> >> >> > > Chen <[hidden email]> @Dian Fu <[hidden email]>.
>>> >> >> > >
>>> >> >> > >
>>> >> >> > > Best Regards
>>> >> >> > > Peter Huang
>>> >> >> > >
>>> >> >> > > On Wed, Jan 15, 2020 at 7:50 AM Kostas Kloudas <
>>> [hidden email]> wrote:
>>> >> >> > >
>>> >> >> > >> Hi all,
>>> >> >> > >>
>>> >> >> > >> I am writing here as the discussion on the Google Doc seems
>>> to be a
>>> >> >> > >> bit difficult to follow.
>>> >> >> > >>
>>> >> >> > >> I think that in order to be able to make progress, it would
>>> be helpful
>>> >> >> > >> to focus on per-job mode for now.
>>> >> >> > >> The reason is that:
>>> >> >> > >>  1) making the (unique) JobSubmitHandler responsible for
>>> creating the
>>> >> >> > >> jobgraphs,
>>> >> >> > >>   which includes downloading dependencies, is not an optimal
>>> solution
>>> >> >> > >>  2) even if we put the responsibility on the JobMaster,
>>> currently each
>>> >> >> > >> job has its own
>>> >> >> > >>   JobMaster but they all run on the same process, so we have
>>> again a
>>> >> >> > >> single entity.
>>> >> >> > >>
>>> >> >> > >> Of course after this is done, and if we feel comfortable with
>>> the
>>> >> >> > >> solution, then we can go to the session mode.
>>> >> >> > >>
>>> >> >> > >> A second comment has to do with fault-tolerance in the
>>> per-job,
>>> >> >> > >> cluster-deploy mode.
>>> >> >> > >> In the document, it is suggested that upon recovery, the
>>> JobMaster of
>>> >> >> > >> each job re-creates the JobGraph.
>>> >> >> > >> I am just wondering if it is better to create and store the
>>> jobGraph
>>> >> >> > >> upon submission and only fetch it
>>> >> >> > >> upon recovery so that we have a static jobGraph.
>>> >> >> > >>
>>> >> >> > >> Finally, I have a question which is what happens with jobs
>>> that have
>>> >> >> > >> multiple execute calls?
>>> >> >> > >> The semantics seem to change compared to the current
>>> behaviour, right?
>>> >> >> > >>
>>> >> >> > >> Cheers,
>>> >> >> > >> Kostas
>>> >> >> > >>
>>> >> >> > >> On Wed, Jan 8, 2020 at 8:05 PM tison <[hidden email]>
>>> wrote:
>>> >> >> > >> >
>>> >> >> > >> > not always, Yang Wang is also not yet a committer but he
>>> can join the
>>> >> >> > >> > channel. I cannot find the id by clicking “Add new member
>>> in channel” so
>>> >> >> > >> > come to you and ask for try out the link. Possibly I will
>>> find other
>>> >> >> > >> ways
>>> >> >> > >> > but the original purpose is that the slack channel is a
>>> public area we
>>> >> >> > >> > discuss about developing...
>>> >> >> > >> > Best,
>>> >> >> > >> > tison.
>>> >> >> > >> >
>>> >> >> > >> >
>>> >> >> > >> > Peter Huang <[hidden email]> 于2020年1月9日周四
>>> 上午2:44写道:
>>> >> >> > >> >
>>> >> >> > >> > > Hi Tison,
>>> >> >> > >> > >
>>> >> >> > >> > > I am not the committer of Flink yet. I think I can't join
>>> it also.
>>> >> >> > >> > >
>>> >> >> > >> > >
>>> >> >> > >> > > Best Regards
>>> >> >> > >> > > Peter Huang
>>> >> >> > >> > >
>>> >> >> > >> > > On Wed, Jan 8, 2020 at 9:39 AM tison <
>>> [hidden email]> wrote:
>>> >> >> > >> > >
>>> >> >> > >> > > > Hi Peter,
>>> >> >> > >> > > >
>>> >> >> > >> > > > Could you try out this link?
>>> >> >> > >> > > https://the-asf.slack.com/messages/CNA3ADZPH
>>> >> >> > >> > > >
>>> >> >> > >> > > > Best,
>>> >> >> > >> > > > tison.
>>> >> >> > >> > > >
>>> >> >> > >> > > >
>>> >> >> > >> > > > Peter Huang <[hidden email]> 于2020年1月9日周四
>>> 上午1:22写道:
>>> >> >> > >> > > >
>>> >> >> > >> > > > > Hi Tison,
>>> >> >> > >> > > > >
>>> >> >> > >> > > > > I can't join the group with shared link. Would you
>>> please add me
>>> >> >> > >> into
>>> >> >> > >> > > the
>>> >> >> > >> > > > > group? My slack account is huangzhenqiu0825.
>>> >> >> > >> > > > > Thank you in advance.
>>> >> >> > >> > > > >
>>> >> >> > >> > > > >
>>> >> >> > >> > > > > Best Regards
>>> >> >> > >> > > > > Peter Huang
>>> >> >> > >> > > > >
>>> >> >> > >> > > > > On Wed, Jan 8, 2020 at 12:02 AM tison <
>>> [hidden email]>
>>> >> >> > >> wrote:
>>> >> >> > >> > > > >
>>> >> >> > >> > > > > > Hi Peter,
>>> >> >> > >> > > > > >
>>> >> >> > >> > > > > > As described above, this effort should get
>>> attention from people
>>> >> >> > >> > > > > developing
>>> >> >> > >> > > > > > FLIP-73 a.k.a. Executor abstractions. I recommend
>>> you to join
>>> >> >> > >> the
>>> >> >> > >> > > > public
>>> >> >> > >> > > > > > slack channel[1] for Flink Client API Enhancement
>>> and you can
>>> >> >> > >> try to
>>> >> >> > >> > > > > share
>>> >> >> > >> > > > > > you detailed thoughts there. It possibly gets more
>>> concrete
>>> >> >> > >> > > attentions.
>>> >> >> > >> > > > > >
>>> >> >> > >> > > > > > Best,
>>> >> >> > >> > > > > > tison.
>>> >> >> > >> > > > > >
>>> >> >> > >> > > > > > [1]
>>> >> >> > >> > > > > >
>>> >> >> > >> > > > > >
>>> >> >> > >> > > > >
>>> >> >> > >> > > >
>>> >> >> > >> > >
>>> >> >> > >>
>>> https://slack.com/share/IS21SJ75H/Rk8HhUly9FuEHb7oGwBZ33uL/enQtODg2MDYwNjE5MTg3LTA2MjIzNDc1M2ZjZDVlMjdlZjk1M2RkYmJhNjAwMTk2ZDZkODQ4NmY5YmI4OGRhNWJkYTViMTM1NzlmMzc4OWM
>>> >> >> > >> > > > > >
>>> >> >> > >> > > > > >
>>> >> >> > >> > > > > > Peter Huang <[hidden email]>
>>> 于2020年1月7日周二 上午5:09写道:
>>> >> >> > >> > > > > >
>>> >> >> > >> > > > > > > Dear All,
>>> >> >> > >> > > > > > >
>>> >> >> > >> > > > > > > Happy new year! According to existing feedback
>>> from the
>>> >> >> > >> community,
>>> >> >> > >> > > we
>>> >> >> > >> > > > > > > revised the doc with the consideration of session
>>> cluster
>>> >> >> > >> support,
>>> >> >> > >> > > > and
>>> >> >> > >> > > > > > > concrete interface changes needed and execution
>>> plan. Please
>>> >> >> > >> take
>>> >> >> > >> > > one
>>> >> >> > >> > > > > > more
>>> >> >> > >> > > > > > > round of review at your most convenient time.
>>> >> >> > >> > > > > > >
>>> >> >> > >> > > > > > >
>>> >> >> > >> > > > > > >
>>> >> >> > >> > > > > >
>>> >> >> > >> > > > >
>>> >> >> > >> > > >
>>> >> >> > >> > >
>>> >> >> > >>
>>> https://docs.google.com/document/d/1aAwVjdZByA-0CHbgv16Me-vjaaDMCfhX7TzVVTuifYM/edit#
>>> >> >> > >> > > > > > >
>>> >> >> > >> > > > > > >
>>> >> >> > >> > > > > > > Best Regards
>>> >> >> > >> > > > > > > Peter Huang
>>> >> >> > >> > > > > > >
>>> >> >> > >> > > > > > >
>>> >> >> > >> > > > > > >
>>> >> >> > >> > > > > > >
>>> >> >> > >> > > > > > >
>>> >> >> > >> > > > > > > On Thu, Jan 2, 2020 at 11:29 AM Peter Huang <
>>> >> >> > >> > > > > [hidden email]>
>>> >> >> > >> > > > > > > wrote:
>>> >> >> > >> > > > > > >
>>> >> >> > >> > > > > > > > Hi Dian,
>>> >> >> > >> > > > > > > > Thanks for giving us valuable feedbacks.
>>> >> >> > >> > > > > > > >
>>> >> >> > >> > > > > > > > 1) It's better to have a whole design for this
>>> feature
>>> >> >> > >> > > > > > > > For the suggestion of enabling the cluster mode
>>> also session
>>> >> >> > >> > > > > cluster, I
>>> >> >> > >> > > > > > > > think Flink already supported it.
>>> WebSubmissionExtension
>>> >> >> > >> already
>>> >> >> > >> > > > > allows
>>> >> >> > >> > > > > > > > users to start a job with the specified jar by
>>> using web UI.
>>> >> >> > >> > > > > > > > But we need to enable the feature from CLI for
>>> both local
>>> >> >> > >> jar,
>>> >> >> > >> > > > remote
>>> >> >> > >> > > > > > > jar.
>>> >> >> > >> > > > > > > > I will align with Yang Wang first about the
>>> details and
>>> >> >> > >> update
>>> >> >> > >> > > the
>>> >> >> > >> > > > > > design
>>> >> >> > >> > > > > > > > doc.
>>> >> >> > >> > > > > > > >
>>> >> >> > >> > > > > > > > 2) It's better to consider the convenience for
>>> users, such
>>> >> >> > >> as
>>> >> >> > >> > > > > debugging
>>> >> >> > >> > > > > > > >
>>> >> >> > >> > > > > > > > I am wondering whether we can store the
>>> exception in
>>> >> >> > >> jobgragh
>>> >> >> > >> > > > > > > > generation in application master. As no
>>> streaming graph can
>>> >> >> > >> be
>>> >> >> > >> > > > > > scheduled
>>> >> >> > >> > > > > > > in
>>> >> >> > >> > > > > > > > this case, there will be no more TM will be
>>> requested from
>>> >> >> > >> > > FlinkRM.
>>> >> >> > >> > > > > > > > If the AM is still running, users can still
>>> query it from
>>> >> >> > >> CLI. As
>>> >> >> > >> > > > it
>>> >> >> > >> > > > > > > > requires more change, we can get some feedback
>>> from <
>>> >> >> > >> > > > > > [hidden email]
>>> >> >> > >> > > > > > > >
>>> >> >> > >> > > > > > > > and @[hidden email] <[hidden email]>.
>>> >> >> > >> > > > > > > >
>>> >> >> > >> > > > > > > > 3) It's better to consider the impact to the
>>> stability of
>>> >> >> > >> the
>>> >> >> > >> > > > cluster
>>> >> >> > >> > > > > > > >
>>> >> >> > >> > > > > > > > I agree with Yang Wang's opinion.
>>> >> >> > >> > > > > > > >
>>> >> >> > >> > > > > > > >
>>> >> >> > >> > > > > > > >
>>> >> >> > >> > > > > > > > Best Regards
>>> >> >> > >> > > > > > > > Peter Huang
>>> >> >> > >> > > > > > > >
>>> >> >> > >> > > > > > > >
>>> >> >> > >> > > > > > > > On Sun, Dec 29, 2019 at 9:44 PM Dian Fu <
>>> >> >> > >> [hidden email]>
>>> >> >> > >> > > > > wrote:
>>> >> >> > >> > > > > > > >
>>> >> >> > >> > > > > > > >> Hi all,
>>> >> >> > >> > > > > > > >>
>>> >> >> > >> > > > > > > >> Sorry to jump into this discussion. Thanks
>>> everyone for the
>>> >> >> > >> > > > > > discussion.
>>> >> >> > >> > > > > > > >> I'm very interested in this topic although I'm
>>> not an
>>> >> >> > >> expert in
>>> >> >> > >> > > > this
>>> >> >> > >> > > > > > > part.
>>> >> >> > >> > > > > > > >> So I'm glad to share my thoughts as following:
>>> >> >> > >> > > > > > > >>
>>> >> >> > >> > > > > > > >> 1) It's better to have a whole design for this
>>> feature
>>> >> >> > >> > > > > > > >> As we know, there are two deployment modes:
>>> per-job mode
>>> >> >> > >> and
>>> >> >> > >> > > > session
>>> >> >> > >> > > > > > > >> mode. I'm wondering which mode really needs
>>> this feature.
>>> >> >> > >> As the
>>> >> >> > >> > > > > > design
>>> >> >> > >> > > > > > > doc
>>> >> >> > >> > > > > > > >> mentioned, per-job mode is more used for
>>> streaming jobs and
>>> >> >> > >> > > > session
>>> >> >> > >> > > > > > > mode is
>>> >> >> > >> > > > > > > >> usually used for batch jobs(Of course, the job
>>> types and
>>> >> >> > >> the
>>> >> >> > >> > > > > > deployment
>>> >> >> > >> > > > > > > >> modes are orthogonal). Usually streaming job
>>> is only
>>> >> >> > >> needed to
>>> >> >> > >> > > be
>>> >> >> > >> > > > > > > submitted
>>> >> >> > >> > > > > > > >> once and it will run for days or weeks, while
>>> batch jobs
>>> >> >> > >> will be
>>> >> >> > >> > > > > > > submitted
>>> >> >> > >> > > > > > > >> more frequently compared with streaming jobs.
>>> This means
>>> >> >> > >> that
>>> >> >> > >> > > > maybe
>>> >> >> > >> > > > > > > session
>>> >> >> > >> > > > > > > >> mode also needs this feature. However, if we
>>> support this
>>> >> >> > >> > > feature
>>> >> >> > >> > > > in
>>> >> >> > >> > > > > > > >> session mode, the application master will
>>> become the new
>>> >> >> > >> > > > centralized
>>> >> >> > >> > > > > > > >> service(which should be solved). So in this
>>> case, it's
>>> >> >> > >> better to
>>> >> >> > >> > > > > have
>>> >> >> > >> > > > > > a
>>> >> >> > >> > > > > > > >> complete design for both per-job mode and
>>> session mode.
>>> >> >> > >> > > > Furthermore,
>>> >> >> > >> > > > > > > even
>>> >> >> > >> > > > > > > >> if we can do it phase by phase, we need to
>>> have a whole
>>> >> >> > >> picture
>>> >> >> > >> > > of
>>> >> >> > >> > > > > how
>>> >> >> > >> > > > > > > it
>>> >> >> > >> > > > > > > >> works in both per-job mode and session mode.
>>> >> >> > >> > > > > > > >>
>>> >> >> > >> > > > > > > >> 2) It's better to consider the convenience for
>>> users, such
>>> >> >> > >> as
>>> >> >> > >> > > > > > debugging
>>> >> >> > >> > > > > > > >> After we finish this feature, the job graph
>>> will be
>>> >> >> > >> compiled in
>>> >> >> > >> > > > the
>>> >> >> > >> > > > > > > >> application master, which means that users
>>> cannot easily
>>> >> >> > >> get the
>>> >> >> > >> > > > > > > exception
>>> >> >> > >> > > > > > > >> message synchorousely in the job client if
>>> there are
>>> >> >> > >> problems
>>> >> >> > >> > > > during
>>> >> >> > >> > > > > > the
>>> >> >> > >> > > > > > > >> job graph compiling (especially for platform
>>> users), such
>>> >> >> > >> as the
>>> >> >> > >> > > > > > > resource
>>> >> >> > >> > > > > > > >> path is incorrect, the user program itself has
>>> some
>>> >> >> > >> problems,
>>> >> >> > >> > > etc.
>>> >> >> > >> > > > > > What
>>> >> >> > >> > > > > > > I'm
>>> >> >> > >> > > > > > > >> thinking is that maybe we should throw the
>>> exceptions as
>>> >> >> > >> early
>>> >> >> > >> > > as
>>> >> >> > >> > > > > > > possible
>>> >> >> > >> > > > > > > >> (during job submission stage).
>>> >> >> > >> > > > > > > >>
>>> >> >> > >> > > > > > > >> 3) It's better to consider the impact to the
>>> stability of
>>> >> >> > >> the
>>> >> >> > >> > > > > cluster
>>> >> >> > >> > > > > > > >> If we perform the compiling in the application
>>> master, we
>>> >> >> > >> should
>>> >> >> > >> > > > > > > consider
>>> >> >> > >> > > > > > > >> the impact of the compiling errors. Although
>>> YARN could
>>> >> >> > >> resume
>>> >> >> > >> > > the
>>> >> >> > >> > > > > > > >> application master in case of failures, but in
>>> some case
>>> >> >> > >> the
>>> >> >> > >> > > > > compiling
>>> >> >> > >> > > > > > > >> failure may be a waste of cluster resource and
>>> may impact
>>> >> >> > >> the
>>> >> >> > >> > > > > > stability
>>> >> >> > >> > > > > > > the
>>> >> >> > >> > > > > > > >> cluster and the other jobs in the cluster,
>>> such as the
>>> >> >> > >> resource
>>> >> >> > >> > > > path
>>> >> >> > >> > > > > > is
>>> >> >> > >> > > > > > > >> incorrect, the user program itself has some
>>> problems(in
>>> >> >> > >> this
>>> >> >> > >> > > case,
>>> >> >> > >> > > > > job
>>> >> >> > >> > > > > > > >> failover cannot solve this kind of problems)
>>> etc. In the
>>> >> >> > >> current
>>> >> >> > >> > > > > > > >> implemention, the compiling errors are handled
>>> in the
>>> >> >> > >> client
>>> >> >> > >> > > side
>>> >> >> > >> > > > > and
>>> >> >> > >> > > > > > > there
>>> >> >> > >> > > > > > > >> is no impact to the cluster at all.
>>> >> >> > >> > > > > > > >>
>>> >> >> > >> > > > > > > >> Regarding to 1), it's clearly pointed in the
>>> design doc
>>> >> >> > >> that
>>> >> >> > >> > > only
>>> >> >> > >> > > > > > > per-job
>>> >> >> > >> > > > > > > >> mode will be supported. However, I think it's
>>> better to
>>> >> >> > >> also
>>> >> >> > >> > > > > consider
>>> >> >> > >> > > > > > > the
>>> >> >> > >> > > > > > > >> session mode in the design doc.
>>> >> >> > >> > > > > > > >> Regarding to 2) and 3), I have not seen
>>> related sections
>>> >> >> > >> in the
>>> >> >> > >> > > > > design
>>> >> >> > >> > > > > > > >> doc. It will be good if we can cover them in
>>> the design
>>> >> >> > >> doc.
>>> >> >> > >> > > > > > > >>
>>> >> >> > >> > > > > > > >> Feel free to correct me If there is anything I
>>> >> >> > >> misunderstand.
>>> >> >> > >> > > > > > > >>
>>> >> >> > >> > > > > > > >> Regards,
>>> >> >> > >> > > > > > > >> Dian
>>> >> >> > >> > > > > > > >>
>>> >> >> > >> > > > > > > >>
>>> >> >> > >> > > > > > > >> > 在 2019年12月27日,上午3:13,Peter Huang <
>>> >> >> > >> [hidden email]>
>>> >> >> > >> > > > 写道:
>>> >> >> > >> > > > > > > >> >
>>> >> >> > >> > > > > > > >> > Hi Yang,
>>> >> >> > >> > > > > > > >> >
>>> >> >> > >> > > > > > > >> > I can't agree more. The effort definitely
>>> needs to align
>>> >> >> > >> with
>>> >> >> > >> > > > the
>>> >> >> > >> > > > > > > final
>>> >> >> > >> > > > > > > >> > goal of FLIP-73.
>>> >> >> > >> > > > > > > >> > I am thinking about whether we can achieve
>>> the goal with
>>> >> >> > >> two
>>> >> >> > >> > > > > phases.
>>> >> >> > >> > > > > > > >> >
>>> >> >> > >> > > > > > > >> > 1) Phase I
>>> >> >> > >> > > > > > > >> > As the CLiFrontend will not be depreciated
>>> soon. We can
>>> >> >> > >> still
>>> >> >> > >> > > > use
>>> >> >> > >> > > > > > the
>>> >> >> > >> > > > > > > >> > deployMode flag there,
>>> >> >> > >> > > > > > > >> > pass the program info through Flink
>>> configuration,  use
>>> >> >> > >> the
>>> >> >> > >> > > > > > > >> > ClassPathJobGraphRetriever
>>> >> >> > >> > > > > > > >> > to generate the job graph in
>>> ClusterEntrypoints of yarn
>>> >> >> > >> and
>>> >> >> > >> > > > > > > Kubernetes.
>>> >> >> > >> > > > > > > >> >
>>> >> >> > >> > > > > > > >> > 2) Phase II
>>> >> >> > >> > > > > > > >> > In  AbstractJobClusterExecutor, the job
>>> graph is
>>> >> >> > >> generated in
>>> >> >> > >> > > > the
>>> >> >> > >> > > > > > > >> execute
>>> >> >> > >> > > > > > > >> > function. We can still
>>> >> >> > >> > > > > > > >> > use the deployMode in it. With deployMode =
>>> cluster, the
>>> >> >> > >> > > execute
>>> >> >> > >> > > > > > > >> function
>>> >> >> > >> > > > > > > >> > only starts the cluster.
>>> >> >> > >> > > > > > > >> >
>>> >> >> > >> > > > > > > >> > When
>>> {Yarn/Kuberneates}PerJobClusterEntrypoint starts,
>>> >> >> > >> It will
>>> >> >> > >> > > > > start
>>> >> >> > >> > > > > > > the
>>> >> >> > >> > > > > > > >> > dispatch first, then we can use
>>> >> >> > >> > > > > > > >> > a ClusterEnvironment similar to
>>> ContextEnvironment to
>>> >> >> > >> submit
>>> >> >> > >> > > the
>>> >> >> > >> > > > > job
>>> >> >> > >> > > > > > > >> with
>>> >> >> > >> > > > > > > >> > jobName the local
>>> >> >> > >> > > > > > > >> > dispatcher. For the details, we need more
>>> investigation.
>>> >> >> > >> Let's
>>> >> >> > >> > > > > wait
>>> >> >> > >> > > > > > > >> > for @Aljoscha
>>> >> >> > >> > > > > > > >> > Krettek <[hidden email]> @Till
>>> Rohrmann <
>>> >> >> > >> > > > > [hidden email]
>>> >> >> > >> > > > > > >'s
>>> >> >> > >> > > > > > > >> > feedback after the holiday season.
>>> >> >> > >> > > > > > > >> >
>>> >> >> > >> > > > > > > >> > Thank you in advance. Merry Chrismas and
>>> Happy New
>>> >> >> > >> Year!!!
>>> >> >> > >> > > > > > > >> >
>>> >> >> > >> > > > > > > >> >
>>> >> >> > >> > > > > > > >> > Best Regards
>>> >> >> > >> > > > > > > >> > Peter Huang
>>> >> >> > >> > > > > > > >> >
>>> >> >> > >> > > > > > > >> >
>>> >> >> > >> > > > > > > >> >
>>> >> >> > >> > > > > > > >> >
>>> >> >> > >> > > > > > > >> >
>>> >> >> > >> > > > > > > >> >
>>> >> >> > >> > > > > > > >> >
>>> >> >> > >> > > > > > > >> >
>>> >> >> > >> > > > > > > >> > On Wed, Dec 25, 2019 at 1:08 AM Yang Wang <
>>> >> >> > >> > > > [hidden email]>
>>> >> >> > >> > > > > > > >> wrote:
>>> >> >> > >> > > > > > > >> >
>>> >> >> > >> > > > > > > >> >> Hi Peter,
>>> >> >> > >> > > > > > > >> >>
>>> >> >> > >> > > > > > > >> >> I think we need to reconsider tison's
>>> suggestion
>>> >> >> > >> seriously.
>>> >> >> > >> > > > After
>>> >> >> > >> > > > > > > >> FLIP-73,
>>> >> >> > >> > > > > > > >> >> the deployJobCluster has
>>> >> >> > >> > > > > > > >> >> beenmoved into
>>> `JobClusterExecutor#execute`. It should
>>> >> >> > >> not be
>>> >> >> > >> > > > > > > perceived
>>> >> >> > >> > > > > > > >> >> for `CliFrontend`. That
>>> >> >> > >> > > > > > > >> >> means the user program will *ALWAYS* be
>>> executed on
>>> >> >> > >> client
>>> >> >> > >> > > > side.
>>> >> >> > >> > > > > > This
>>> >> >> > >> > > > > > > >> is
>>> >> >> > >> > > > > > > >> >> the by design behavior.
>>> >> >> > >> > > > > > > >> >> So, we could not just add `if(client mode)
>>> .. else
>>> >> >> > >> if(cluster
>>> >> >> > >> > > > > mode)
>>> >> >> > >> > > > > > > >> ...`
>>> >> >> > >> > > > > > > >> >> codes in `CliFrontend` to bypass
>>> >> >> > >> > > > > > > >> >> the executor. We need to find a clean way
>>> to decouple
>>> >> >> > >> > > executing
>>> >> >> > >> > > > > > user
>>> >> >> > >> > > > > > > >> >> program and deploying per-job
>>> >> >> > >> > > > > > > >> >> cluster. Based on this, we could support to
>>> execute user
>>> >> >> > >> > > > program
>>> >> >> > >> > > > > on
>>> >> >> > >> > > > > > > >> client
>>> >> >> > >> > > > > > > >> >> or master side.
>>> >> >> > >> > > > > > > >> >>
>>> >> >> > >> > > > > > > >> >> Maybe Aljoscha and Jeff could give some good
>>> >> >> > >> suggestions.
>>> >> >> > >> > > > > > > >> >>
>>> >> >> > >> > > > > > > >> >>
>>> >> >> > >> > > > > > > >> >>
>>> >> >> > >> > > > > > > >> >> Best,
>>> >> >> > >> > > > > > > >> >> Yang
>>> >> >> > >> > > > > > > >> >>
>>> >> >> > >> > > > > > > >> >> Peter Huang <[hidden email]>
>>> 于2019年12月25日周三
>>> >> >> > >> > > > > 上午4:03写道:
>>> >> >> > >> > > > > > > >> >>
>>> >> >> > >> > > > > > > >> >>> Hi Jingjing,
>>> >> >> > >> > > > > > > >> >>>
>>> >> >> > >> > > > > > > >> >>> The improvement proposed is a deployment
>>> option for
>>> >> >> > >> CLI. For
>>> >> >> > >> > > > SQL
>>> >> >> > >> > > > > > > based
>>> >> >> > >> > > > > > > >> >>> Flink application, It is more convenient
>>> to use the
>>> >> >> > >> existing
>>> >> >> > >> > > > > model
>>> >> >> > >> > > > > > > in
>>> >> >> > >> > > > > > > >> >>> SqlClient in which
>>> >> >> > >> > > > > > > >> >>> the job graph is generated within
>>> SqlClient. After
>>> >> >> > >> adding
>>> >> >> > >> > > the
>>> >> >> > >> > > > > > > delayed
>>> >> >> > >> > > > > > > >> job
>>> >> >> > >> > > > > > > >> >>> graph generation, I think there is no
>>> change is needed
>>> >> >> > >> for
>>> >> >> > >> > > > your
>>> >> >> > >> > > > > > > side.
>>> >> >> > >> > > > > > > >> >>>
>>> >> >> > >> > > > > > > >> >>>
>>> >> >> > >> > > > > > > >> >>> Best Regards
>>> >> >> > >> > > > > > > >> >>> Peter Huang
>>> >> >> > >> > > > > > > >> >>>
>>> >> >> > >> > > > > > > >> >>>
>>> >> >> > >> > > > > > > >> >>> On Wed, Dec 18, 2019 at 6:01 AM jingjing
>>> bai <
>>> >> >> > >> > > > > > > >> [hidden email]>
>>> >> >> > >> > > > > > > >> >>> wrote:
>>> >> >> > >> > > > > > > >> >>>
>>> >> >> > >> > > > > > > >> >>>> hi peter:
>>> >> >> > >> > > > > > > >> >>>>    we had extension SqlClent to support
>>> sql job
>>> >> >> > >> submit in
>>> >> >> > >> > > web
>>> >> >> > >> > > > > > base
>>> >> >> > >> > > > > > > on
>>> >> >> > >> > > > > > > >> >>>> flink 1.9.   we support submit to yarn on
>>> per job
>>> >> >> > >> mode too.
>>> >> >> > >> > > > > > > >> >>>>    in this case, the job graph generated
>>> on client
>>> >> >> > >> side
>>> >> >> > >> > > .  I
>>> >> >> > >> > > > > > think
>>> >> >> > >> > > > > > > >> >>> this
>>> >> >> > >> > > > > > > >> >>>> discuss Mainly to improve api programme.
>>> but in my
>>> >> >> > >> case ,
>>> >> >> > >> > > > > there
>>> >> >> > >> > > > > > is
>>> >> >> > >> > > > > > > >> no
>>> >> >> > >> > > > > > > >> >>>> jar to upload but only a sql string .
>>> >> >> > >> > > > > > > >> >>>>    do u had more suggestion to improve
>>> for sql mode
>>> >> >> > >> or it
>>> >> >> > >> > > is
>>> >> >> > >> > > > > > only a
>>> >> >> > >> > > > > > > >> >>>> switch for api programme?
>>> >> >> > >> > > > > > > >> >>>>
>>> >> >> > >> > > > > > > >> >>>>
>>> >> >> > >> > > > > > > >> >>>> best
>>> >> >> > >> > > > > > > >> >>>> bai jj
>>> >> >> > >> > > > > > > >> >>>>
>>> >> >> > >> > > > > > > >> >>>>
>>> >> >> > >> > > > > > > >> >>>> Yang Wang <[hidden email]>
>>> 于2019年12月18日周三
>>> >> >> > >> 下午7:21写道:
>>> >> >> > >> > > > > > > >> >>>>
>>> >> >> > >> > > > > > > >> >>>>> I just want to revive this discussion.
>>> >> >> > >> > > > > > > >> >>>>>
>>> >> >> > >> > > > > > > >> >>>>> Recently, i am thinking about how to
>>> natively run
>>> >> >> > >> flink
>>> >> >> > >> > > > > per-job
>>> >> >> > >> > > > > > > >> >>> cluster on
>>> >> >> > >> > > > > > > >> >>>>> Kubernetes.
>>> >> >> > >> > > > > > > >> >>>>> The per-job mode on Kubernetes is very
>>> different
>>> >> >> > >> from on
>>> >> >> > >> > > > Yarn.
>>> >> >> > >> > > > > > And
>>> >> >> > >> > > > > > > >> we
>>> >> >> > >> > > > > > > >> >>> will
>>> >> >> > >> > > > > > > >> >>>>> have
>>> >> >> > >> > > > > > > >> >>>>> the same deployment requirements to the
>>> client and
>>> >> >> > >> entry
>>> >> >> > >> > > > > point.
>>> >> >> > >> > > > > > > >> >>>>>
>>> >> >> > >> > > > > > > >> >>>>> 1. Flink client not always need a local
>>> jar to start
>>> >> >> > >> a
>>> >> >> > >> > > Flink
>>> >> >> > >> > > > > > > per-job
>>> >> >> > >> > > > > > > >> >>>>> cluster. We could
>>> >> >> > >> > > > > > > >> >>>>> support multiple schemas. For example,
>>> >> >> > >> > > > file:///path/of/my.jar
>>> >> >> > >> > > > > > > means
>>> >> >> > >> > > > > > > >> a
>>> >> >> > >> > > > > > > >> >>> jar
>>> >> >> > >> > > > > > > >> >>>>> located
>>> >> >> > >> > > > > > > >> >>>>> at client side,
>>> >> >> > >> hdfs://myhdfs/user/myname/flink/my.jar
>>> >> >> > >> > > > means a
>>> >> >> > >> > > > > > jar
>>> >> >> > >> > > > > > > >> >>> located
>>> >> >> > >> > > > > > > >> >>>>> at
>>> >> >> > >> > > > > > > >> >>>>> remote hdfs,
>>> local:///path/in/image/my.jar means a
>>> >> >> > >> jar
>>> >> >> > >> > > > located
>>> >> >> > >> > > > > > at
>>> >> >> > >> > > > > > > >> >>>>> jobmanager side.
>>> >> >> > >> > > > > > > >> >>>>>
>>> >> >> > >> > > > > > > >> >>>>> 2. Support running user program on
>>> master side. This
>>> >> >> > >> also
>>> >> >> > >> > > > > means
>>> >> >> > >> > > > > > > the
>>> >> >> > >> > > > > > > >> >>> entry
>>> >> >> > >> > > > > > > >> >>>>> point
>>> >> >> > >> > > > > > > >> >>>>> will generate the job graph on master
>>> side. We could
>>> >> >> > >> use
>>> >> >> > >> > > the
>>> >> >> > >> > > > > > > >> >>>>> ClasspathJobGraphRetriever
>>> >> >> > >> > > > > > > >> >>>>> or start a local Flink client to achieve
>>> this
>>> >> >> > >> purpose.
>>> >> >> > >> > > > > > > >> >>>>>
>>> >> >> > >> > > > > > > >> >>>>>
>>> >> >> > >> > > > > > > >> >>>>> cc tison, Aljoscha & Kostas Do you think
>>> this is the
>>> >> >> > >> right
>>> >> >> > >> > > > > > > >> direction we
>>> >> >> > >> > > > > > > >> >>>>> need to work?
>>> >> >> > >> > > > > > > >> >>>>>
>>> >> >> > >> > > > > > > >> >>>>> tison <[hidden email]>
>>> 于2019年12月12日周四
>>> >> >> > >> 下午4:48写道:
>>> >> >> > >> > > > > > > >> >>>>>
>>> >> >> > >> > > > > > > >> >>>>>> A quick idea is that we separate the
>>> deployment
>>> >> >> > >> from user
>>> >> >> > >> > > > > > program
>>> >> >> > >> > > > > > > >> >>> that
>>> >> >> > >> > > > > > > >> >>>>> it
>>> >> >> > >> > > > > > > >> >>>>>> has always been done
>>> >> >> > >> > > > > > > >> >>>>>> outside the program. On user program
>>> executed there
>>> >> >> > >> is
>>> >> >> > >> > > > > always a
>>> >> >> > >> > > > > > > >> >>>>>> ClusterClient that communicates with
>>> >> >> > >> > > > > > > >> >>>>>> an existing cluster, remote or local.
>>> It will be
>>> >> >> > >> another
>>> >> >> > >> > > > > thread
>>> >> >> > >> > > > > > > so
>>> >> >> > >> > > > > > > >> >>> just
>>> >> >> > >> > > > > > > >> >>>>> for
>>> >> >> > >> > > > > > > >> >>>>>> your information.
>>> >> >> > >> > > > > > > >> >>>>>>
>>> >> >> > >> > > > > > > >> >>>>>> Best,
>>> >> >> > >> > > > > > > >> >>>>>> tison.
>>> >> >> > >> > > > > > > >> >>>>>>
>>> >> >> > >> > > > > > > >> >>>>>>
>>> >> >> > >> > > > > > > >> >>>>>> tison <[hidden email]>
>>> 于2019年12月12日周四
>>> >> >> > >> 下午4:40写道:
>>> >> >> > >> > > > > > > >> >>>>>>
>>> >> >> > >> > > > > > > >> >>>>>>> Hi Peter,
>>> >> >> > >> > > > > > > >> >>>>>>>
>>> >> >> > >> > > > > > > >> >>>>>>> Another concern I realized recently is
>>> that with
>>> >> >> > >> current
>>> >> >> > >> > > > > > > Executors
>>> >> >> > >> > > > > > > >> >>>>>>> abstraction(FLIP-73)
>>> >> >> > >> > > > > > > >> >>>>>>> I'm afraid that user program is
>>> designed to ALWAYS
>>> >> >> > >> run
>>> >> >> > >> > > on
>>> >> >> > >> > > > > the
>>> >> >> > >> > > > > > > >> >>> client
>>> >> >> > >> > > > > > > >> >>>>>> side.
>>> >> >> > >> > > > > > > >> >>>>>>> Specifically,
>>> >> >> > >> > > > > > > >> >>>>>>> we deploy the job in executor when
>>> env.execute
>>> >> >> > >> called.
>>> >> >> > >> > > > This
>>> >> >> > >> > > > > > > >> >>>>> abstraction
>>> >> >> > >> > > > > > > >> >>>>>>> possibly prevents
>>> >> >> > >> > > > > > > >> >>>>>>> Flink runs user program on the cluster
>>> side.
>>> >> >> > >> > > > > > > >> >>>>>>>
>>> >> >> > >> > > > > > > >> >>>>>>> For your proposal, in this case we
>>> already
>>> >> >> > >> compiled the
>>> >> >> > >> > > > > > program
>>> >> >> > >> > > > > > > >> and
>>> >> >> > >> > > > > > > >> >>>>> run
>>> >> >> > >> > > > > > > >> >>>>>> on
>>> >> >> > >> > > > > > > >> >>>>>>> the client side,
>>> >> >> > >> > > > > > > >> >>>>>>> even we deploy a cluster and retrieve
>>> job graph
>>> >> >> > >> from
>>> >> >> > >> > > > program
>>> >> >> > >> > > > > > > >> >>>>> metadata, it
>>> >> >> > >> > > > > > > >> >>>>>>> doesn't make
>>> >> >> > >> > > > > > > >> >>>>>>> many sense.
>>> >> >> > >> > > > > > > >> >>>>>>>
>>> >> >> > >> > > > > > > >> >>>>>>> cc Aljoscha & Kostas what do you think
>>> about this
>>> >> >> > >> > > > > constraint?
>>> >> >> > >> > > > > > > >> >>>>>>>
>>> >> >> > >> > > > > > > >> >>>>>>> Best,
>>> >> >> > >> > > > > > > >> >>>>>>> tison.
>>> >> >> > >> > > > > > > >> >>>>>>>
>>> >> >> > >> > > > > > > >> >>>>>>>
>>> >> >> > >> > > > > > > >> >>>>>>> Peter Huang <
>>> [hidden email]>
>>> >> >> > >> 于2019年12月10日周二
>>> >> >> > >> > > > > > > >> 下午12:45写道:
>>> >> >> > >> > > > > > > >> >>>>>>>
>>> >> >> > >> > > > > > > >> >>>>>>>> Hi Tison,
>>> >> >> > >> > > > > > > >> >>>>>>>>
>>> >> >> > >> > > > > > > >> >>>>>>>> Yes, you are right. I think I made
>>> the wrong
>>> >> >> > >> argument
>>> >> >> > >> > > in
>>> >> >> > >> > > > > the
>>> >> >> > >> > > > > > > doc.
>>> >> >> > >> > > > > > > >> >>>>>>>> Basically, the packaging jar problem
>>> is only for
>>> >> >> > >> > > platform
>>> >> >> > >> > > > > > > users.
>>> >> >> > >> > > > > > > >> >>> In
>>> >> >> > >> > > > > > > >> >>>>> our
>>> >> >> > >> > > > > > > >> >>>>>>>> internal deploy service,
>>> >> >> > >> > > > > > > >> >>>>>>>> we further optimized the deployment
>>> latency by
>>> >> >> > >> letting
>>> >> >> > >> > > > > users
>>> >> >> > >> > > > > > to
>>> >> >> > >> > > > > > > >> >>>>>> packaging
>>> >> >> > >> > > > > > > >> >>>>>>>> flink-runtime together with the uber
>>> jar, so that
>>> >> >> > >> we
>>> >> >> > >> > > > don't
>>> >> >> > >> > > > > > need
>>> >> >> > >> > > > > > > >> to
>>> >> >> > >> > > > > > > >> >>>>>>>> consider
>>> >> >> > >> > > > > > > >> >>>>>>>> multiple flink version
>>> >> >> > >> > > > > > > >> >>>>>>>> support for now. In the session
>>> client mode, as
>>> >> >> > >> Flink
>>> >> >> > >> > > > libs
>>> >> >> > >> > > > > > will
>>> >> >> > >> > > > > > > >> be
>>> >> >> > >> > > > > > > >> >>>>>> shipped
>>> >> >> > >> > > > > > > >> >>>>>>>> anyway as local resources of yarn.
>>> Users actually
>>> >> >> > >> don't
>>> >> >> > >> > > > > need
>>> >> >> > >> > > > > > to
>>> >> >> > >> > > > > > > >> >>>>> package
>>> >> >> > >> > > > > > > >> >>>>>>>> those libs into job jar.
>>> >> >> > >> > > > > > > >> >>>>>>>>
>>> >> >> > >> > > > > > > >> >>>>>>>>
>>> >> >> > >> > > > > > > >> >>>>>>>>
>>> >> >> > >> > > > > > > >> >>>>>>>> Best Regards
>>> >> >> > >> > > > > > > >> >>>>>>>> Peter Huang
>>> >> >> > >> > > > > > > >> >>>>>>>>
>>> >> >> > >> > > > > > > >> >>>>>>>> On Mon, Dec 9, 2019 at 8:35 PM tison <
>>> >> >> > >> > > > [hidden email]
>>> >> >> > >> > > > > >
>>> >> >> > >> > > > > > > >> >>> wrote:
>>> >> >> > >> > > > > > > >> >>>>>>>>
>>> >> >> > >> > > > > > > >> >>>>>>>>>> 3. What do you mean about the
>>> package? Do users
>>> >> >> > >> need
>>> >> >> > >> > > to
>>> >> >> > >> > > > > > > >> >>> compile
>>> >> >> > >> > > > > > > >> >>>>>> their
>>> >> >> > >> > > > > > > >> >>>>>>>>> jars
>>> >> >> > >> > > > > > > >> >>>>>>>>> inlcuding flink-clients,
>>> flink-optimizer,
>>> >> >> > >> flink-table
>>> >> >> > >> > > > > codes?
>>> >> >> > >> > > > > > > >> >>>>>>>>>
>>> >> >> > >> > > > > > > >> >>>>>>>>> The answer should be no because they
>>> exist in
>>> >> >> > >> system
>>> >> >> > >> > > > > > > classpath.
>>> >> >> > >> > > > > > > >> >>>>>>>>>
>>> >> >> > >> > > > > > > >> >>>>>>>>> Best,
>>> >> >> > >> > > > > > > >> >>>>>>>>> tison.
>>> >> >> > >> > > > > > > >> >>>>>>>>>
>>> >> >> > >> > > > > > > >> >>>>>>>>>
>>> >> >> > >> > > > > > > >> >>>>>>>>> Yang Wang <[hidden email]>
>>> 于2019年12月10日周二
>>> >> >> > >> > > > > 下午12:18写道:
>>> >> >> > >> > > > > > > >> >>>>>>>>>
>>> >> >> > >> > > > > > > >> >>>>>>>>>> Hi Peter,
>>> >> >> > >> > > > > > > >> >>>>>>>>>>
>>> >> >> > >> > > > > > > >> >>>>>>>>>> Thanks a lot for starting this
>>> discussion. I
>>> >> >> > >> think
>>> >> >> > >> > > this
>>> >> >> > >> > > > > is
>>> >> >> > >> > > > > > a
>>> >> >> > >> > > > > > > >> >>> very
>>> >> >> > >> > > > > > > >> >>>>>>>> useful
>>> >> >> > >> > > > > > > >> >>>>>>>>>> feature.
>>> >> >> > >> > > > > > > >> >>>>>>>>>>
>>> >> >> > >> > > > > > > >> >>>>>>>>>> Not only for Yarn, i am focused on
>>> flink on
>>> >> >> > >> > > Kubernetes
>>> >> >> > >> > > > > > > >> >>>>> integration
>>> >> >> > >> > > > > > > >> >>>>>> and
>>> >> >> > >> > > > > > > >> >>>>>>>>> come
>>> >> >> > >> > > > > > > >> >>>>>>>>>> across the same
>>> >> >> > >> > > > > > > >> >>>>>>>>>> problem. I do not want the job
>>> graph generated
>>> >> >> > >> on
>>> >> >> > >> > > > client
>>> >> >> > >> > > > > > > side.
>>> >> >> > >> > > > > > > >> >>>>>>>> Instead,
>>> >> >> > >> > > > > > > >> >>>>>>>>> the
>>> >> >> > >> > > > > > > >> >>>>>>>>>> user jars are built in
>>> >> >> > >> > > > > > > >> >>>>>>>>>> a user-defined image. When the job
>>> manager
>>> >> >> > >> launched,
>>> >> >> > >> > > we
>>> >> >> > >> > > > > > just
>>> >> >> > >> > > > > > > >> >>>>> need to
>>> >> >> > >> > > > > > > >> >>>>>>>>>> generate the job graph
>>> >> >> > >> > > > > > > >> >>>>>>>>>> based on local user jars.
>>> >> >> > >> > > > > > > >> >>>>>>>>>>
>>> >> >> > >> > > > > > > >> >>>>>>>>>> I have some small suggestion about
>>> this.
>>> >> >> > >> > > > > > > >> >>>>>>>>>>
>>> >> >> > >> > > > > > > >> >>>>>>>>>> 1. `ProgramJobGraphRetriever` is
>>> very similar to
>>> >> >> > >> > > > > > > >> >>>>>>>>>> `ClasspathJobGraphRetriever`, the
>>> differences
>>> >> >> > >> > > > > > > >> >>>>>>>>>> are the former needs
>>> `ProgramMetadata` and the
>>> >> >> > >> latter
>>> >> >> > >> > > > > needs
>>> >> >> > >> > > > > > > >> >>> some
>>> >> >> > >> > > > > > > >> >>>>>>>>> arguments.
>>> >> >> > >> > > > > > > >> >>>>>>>>>> Is it possible to
>>> >> >> > >> > > > > > > >> >>>>>>>>>> have an unified `JobGraphRetriever`
>>> to support
>>> >> >> > >> both?
>>> >> >> > >> > > > > > > >> >>>>>>>>>> 2. Is it possible to not use a
>>> local user jar to
>>> >> >> > >> > > start
>>> >> >> > >> > > > a
>>> >> >> > >> > > > > > > >> >>> per-job
>>> >> >> > >> > > > > > > >> >>>>>>>> cluster?
>>> >> >> > >> > > > > > > >> >>>>>>>>>> In your case, the user jars has
>>> >> >> > >> > > > > > > >> >>>>>>>>>> existed on hdfs already and we do
>>> need to
>>> >> >> > >> download
>>> >> >> > >> > > the
>>> >> >> > >> > > > > jars
>>> >> >> > >> > > > > > > to
>>> >> >> > >> > > > > > > >> >>>>>>>> deployer
>>> >> >> > >> > > > > > > >> >>>>>>>>>> service. Currently, we
>>> >> >> > >> > > > > > > >> >>>>>>>>>> always need a local user jar to
>>> start a flink
>>> >> >> > >> > > cluster.
>>> >> >> > >> > > > It
>>> >> >> > >> > > > > > is
>>> >> >> > >> > > > > > > >> >>> be
>>> >> >> > >> > > > > > > >> >>>>>> great
>>> >> >> > >> > > > > > > >> >>>>>>>> if
>>> >> >> > >> > > > > > > >> >>>>>>>>> we
>>> >> >> > >> > > > > > > >> >>>>>>>>>> could support remote user jars.
>>> >> >> > >> > > > > > > >> >>>>>>>>>>>> In the implementation, we assume
>>> users package
>>> >> >> > >> > > > > > > >> >>> flink-clients,
>>> >> >> > >> > > > > > > >> >>>>>>>>>> flink-optimizer, flink-table
>>> together within
>>> >> >> > >> the job
>>> >> >> > >> > > > jar.
>>> >> >> > >> > > > > > > >> >>>>> Otherwise,
>>> >> >> > >> > > > > > > >> >>>>>>>> the
>>> >> >> > >> > > > > > > >> >>>>>>>>>> job graph generation within
>>> >> >> > >> JobClusterEntryPoint will
>>> >> >> > >> > > > > fail.
>>> >> >> > >> > > > > > > >> >>>>>>>>>> 3. What do you mean about the
>>> package? Do users
>>> >> >> > >> need
>>> >> >> > >> > > to
>>> >> >> > >> > > > > > > >> >>> compile
>>> >> >> > >> > > > > > > >> >>>>>> their
>>> >> >> > >> > > > > > > >> >>>>>>>>> jars
>>> >> >> > >> > > > > > > >> >>>>>>>>>> inlcuding flink-clients,
>>> flink-optimizer,
>>> >> >> > >> flink-table
>>> >> >> > >> > > > > > codes?
>>> >> >> > >> > > > > > > >> >>>>>>>>>>
>>> >> >> > >> > > > > > > >> >>>>>>>>>>
>>> >> >> > >> > > > > > > >> >>>>>>>>>>
>>> >> >> > >> > > > > > > >> >>>>>>>>>> Best,
>>> >> >> > >> > > > > > > >> >>>>>>>>>> Yang
>>> >> >> > >> > > > > > > >> >>>>>>>>>>
>>> >> >> > >> > > > > > > >> >>>>>>>>>> Peter Huang <
>>> [hidden email]>
>>> >> >> > >> > > > 于2019年12月10日周二
>>> >> >> > >> > > > > > > >> >>>>> 上午2:37写道:
>>> >> >> > >> > > > > > > >> >>>>>>>>>>
>>> >> >> > >> > > > > > > >> >>>>>>>>>>> Dear All,
>>> >> >> > >> > > > > > > >> >>>>>>>>>>>
>>> >> >> > >> > > > > > > >> >>>>>>>>>>> Recently, the Flink community
>>> starts to
>>> >> >> > >> improve the
>>> >> >> > >> > > > yarn
>>> >> >> > >> > > > > > > >> >>>>> cluster
>>> >> >> > >> > > > > > > >> >>>>>>>>>> descriptor
>>> >> >> > >> > > > > > > >> >>>>>>>>>>> to make job jar and config files
>>> configurable
>>> >> >> > >> from
>>> >> >> > >> > > > CLI.
>>> >> >> > >> > > > > It
>>> >> >> > >> > > > > > > >> >>>>>> improves
>>> >> >> > >> > > > > > > >> >>>>>>>> the
>>> >> >> > >> > > > > > > >> >>>>>>>>>>> flexibility of  Flink deployment
>>> Yarn Per Job
>>> >> >> > >> Mode.
>>> >> >> > >> > > > For
>>> >> >> > >> > > > > > > >> >>>>> platform
>>> >> >> > >> > > > > > > >> >>>>>>>> users
>>> >> >> > >> > > > > > > >> >>>>>>>>>> who
>>> >> >> > >> > > > > > > >> >>>>>>>>>>> manage tens of hundreds of
>>> streaming pipelines
>>> >> >> > >> for
>>> >> >> > >> > > the
>>> >> >> > >> > > > > > whole
>>> >> >> > >> > > > > > > >> >>>>> org
>>> >> >> > >> > > > > > > >> >>>>>> or
>>> >> >> > >> > > > > > > >> >>>>>>>>>>> company, we found the job graph
>>> generation in
>>> >> >> > >> > > > > client-side
>>> >> >> > >> > > > > > is
>>> >> >> > >> > > > > > > >> >>>>>> another
>>> >> >> > >> > > > > > > >> >>>>>>>>>>> pinpoint. Thus, we want to propose
>>> a
>>> >> >> > >> configurable
>>> >> >> > >> > > > > feature
>>> >> >> > >> > > > > > > >> >>> for
>>> >> >> > >> > > > > > > >> >>>>>>>>>>> FlinkYarnSessionCli. The feature
>>> can allow
>>> >> >> > >> users to
>>> >> >> > >> > > > > choose
>>> >> >> > >> > > > > > > >> >>> the
>>> >> >> > >> > > > > > > >> >>>>> job
>>> >> >> > >> > > > > > > >> >>>>>>>>> graph
>>> >> >> > >> > > > > > > >> >>>>>>>>>>> generation in Flink
>>> ClusterEntryPoint so that
>>> >> >> > >> the
>>> >> >> > >> > > job
>>> >> >> > >> > > > > jar
>>> >> >> > >> > > > > > > >> >>>>> doesn't
>>> >> >> > >> > > > > > > >> >>>>>>>> need
>>> >> >> > >> > > > > > > >> >>>>>>>>> to
>>> >> >> > >> > > > > > > >> >>>>>>>>>>> be locally for the job graph
>>> generation. The
>>> >> >> > >> > > proposal
>>> >> >> > >> > > > is
>>> >> >> > >> > > > > > > >> >>>>> organized
>>> >> >> > >> > > > > > > >> >>>>>>>> as a
>>> >> >> > >> > > > > > > >> >>>>>>>>>>> FLIP
>>> >> >> > >> > > > > > > >> >>>>>>>>>>>
>>> >> >> > >> > > > > > > >> >>>>>>>>>>>
>>> >> >> > >> > > > > > > >> >>>>>>>>>>
>>> >> >> > >> > > > > > > >> >>>>>>>>>
>>> >> >> > >> > > > > > > >> >>>>>>>>
>>> >> >> > >> > > > > > > >> >>>>>>
>>> >> >> > >> > > > > > > >> >>>>>
>>> >> >> > >> > > > > > > >> >>>
>>> >> >> > >> > > > > > > >>
>>> >> >> > >> > > > > > >
>>> >> >> > >> > > > > >
>>> >> >> > >> > > > >
>>> >> >> > >> > > >
>>> >> >> > >> > >
>>> >> >> > >>
>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-85+Delayed+JobGraph+Generation
>>> >> >> > >> > > > > > > >> >>>>>>>>>>> .
>>> >> >> > >> > > > > > > >> >>>>>>>>>>>
>>> >> >> > >> > > > > > > >> >>>>>>>>>>> Any questions and suggestions are
>>> welcomed.
>>> >> >> > >> Thank
>>> >> >> > >> > > you
>>> >> >> > >> > > > in
>>> >> >> > >> > > > > > > >> >>>>> advance.
>>> >> >> > >> > > > > > > >> >>>>>>>>>>>
>>> >> >> > >> > > > > > > >> >>>>>>>>>>>
>>> >> >> > >> > > > > > > >> >>>>>>>>>>> Best Regards
>>> >> >> > >> > > > > > > >> >>>>>>>>>>> Peter Huang
>>> >> >> > >> > > > > > > >> >>>>>>>>>>>
>>> >> >> > >> > > > > > > >> >>>>>>>>>>
>>> >> >> > >> > > > > > > >> >>>>>>>>>
>>> >> >> > >> > > > > > > >> >>>>>>>>
>>> >> >> > >> > > > > > > >> >>>>>>>
>>> >> >> > >> > > > > > > >> >>>>>>
>>> >> >> > >> > > > > > > >> >>>>>
>>> >> >> > >> > > > > > > >> >>>>
>>> >> >> > >> > > > > > > >> >>>
>>> >> >> > >> > > > > > > >> >>
>>> >> >> > >> > > > > > > >>
>>> >> >> > >> > > > > > > >>
>>> >> >> > >> > > > > > >
>>> >> >> > >> > > > > >
>>> >> >> > >> > > > >
>>> >> >> > >> > > >
>>> >> >> > >> > >
>>> >> >> > >>
>>> >> >> > >
>>> >> >>
>>>
>>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] FLIP-85: Delayed Job Graph Generation

tison
+1 to star voting.

Best,
tison.


Yang Wang <[hidden email]> 于2020年3月5日周四 下午2:29写道:

> Hi Peter,
> Really thanks for your response.
>
> Hi all @Kostas Kloudas <[hidden email]> @Zili Chen
> <[hidden email]> @Peter Huang <[hidden email]> @Rong
> Rong <[hidden email]>
> It seems that we have reached an agreement. The “application mode”
> is regarded as the enhanced “per-job”. It is
> orthogonal with “cluster deploy”. Currently, we bind the “per-job” to
> `run-user-main-on-client` and “application mode”
> to `run-user-main-on-cluster`.
>
> Do you have other concerns to moving FLIP-85 to voting?
>
>
> Best,
> Yang
>
> Peter Huang <[hidden email]> 于2020年3月5日周四 下午12:48写道:
>
>> Hi Yang and Kostas,
>>
>> Thanks for the clarification. It makes more sense to me if the long term
>> goal is to replace per job mode to application mode
>>  in the future (at the time that multiple execute can be supported).
>> Before that, It will be better to keep the concept of
>>  application mode internally. As Yang suggested, User only need to use a
>> `-R/-- remote-deploy` cli option to launch
>> a per job cluster with the main function executed in cluster
>> entry-point.  +1 for the execution plan.
>>
>>
>>
>> Best Regards
>> Peter Huang
>>
>>
>>
>>
>> On Tue, Mar 3, 2020 at 7:11 AM Yang Wang <[hidden email]> wrote:
>>
>>> Hi Peter,
>>>
>>> Having the application mode does not mean we will drop the cluster-deploy
>>> option. I just want to share some thoughts about “Application Mode”.
>>>
>>>
>>> 1. The application mode could cover the per-job sematic. Its lifecyle is
>>> bound
>>> to the user `main()`. And all the jobs in the user main will be executed
>>> in a same
>>> Flink cluster. In first phase of FLIP-85 implementation, running user
>>> main on the
>>> cluster side could be supported in application mode.
>>>
>>> 2. Maybe in the future, we also need to support multiple `execute()` on
>>> client side
>>> in a same Flink cluster. Then the per-job mode will evolve to
>>> application mode.
>>>
>>> 3. From user perspective, only a `-R/-- remote-deploy` cli option is
>>> visible. They
>>> are not aware of the application mode.
>>>
>>> 4. In the first phase, the application mode is working as “per-job”(only
>>> one job in
>>> the user main). We just leave more potential for the future.
>>>
>>>
>>> I am not against with calling it “cluster deploy mode” if you all think
>>> it is clearer for users.
>>>
>>>
>>>
>>> Best,
>>> Yang
>>>
>>> Kostas Kloudas <[hidden email]> 于2020年3月3日周二 下午6:49写道:
>>>
>>>> Hi Peter,
>>>>
>>>> I understand your point. This is why I was also a bit torn about the
>>>> name and my proposal was a bit aligned with yours (something along the
>>>> lines of "cluster deploy" mode).
>>>>
>>>> But many of the other participants in the discussion suggested the
>>>> "Application Mode". I think that the reasoning is that now the user's
>>>> Application is more self-contained.
>>>> It will be submitted to the cluster and the user can just disconnect.
>>>> In addition, as discussed briefly in the doc, in the future there may
>>>> be better support for multi-execute applications which will bring us
>>>> one step closer to the true "Application Mode". But this is how I
>>>> interpreted their arguments, of course they can also express their
>>>> thoughts on the topic :)
>>>>
>>>> Cheers,
>>>> Kostas
>>>>
>>>> On Mon, Mar 2, 2020 at 6:15 PM Peter Huang <[hidden email]>
>>>> wrote:
>>>> >
>>>> > Hi Kostas,
>>>> >
>>>> > Thanks for updating the wiki. We have aligned with the
>>>> implementations in the doc. But I feel it is still a little bit confusing
>>>> of the naming from a user's perspective. It is well known that Flink
>>>> support per job cluster and session cluster. The concept is in the layer of
>>>> how a job is managed within Flink. The method introduced util now is a kind
>>>> of mixing job and session cluster to promising the implementation
>>>> complexity. We probably don't need to label it as Application Model as the
>>>> same layer of per job cluster and session cluster. Conceptually, I think it
>>>> is still a cluster mode implementation for per job cluster.
>>>> >
>>>> > To minimize the confusion of users, I think it would be better just
>>>> an option of per job cluster for each type of cluster manager. How do you
>>>> think?
>>>> >
>>>> >
>>>> > Best Regards
>>>> > Peter Huang
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> > On Mon, Mar 2, 2020 at 7:22 AM Kostas Kloudas <[hidden email]>
>>>> wrote:
>>>> >>
>>>> >> Hi Yang,
>>>> >>
>>>> >> The difference between per-job and application mode is that, as you
>>>> >> described, in the per-job mode the main is executed on the client
>>>> >> while in the application mode, the main is executed on the cluster.
>>>> >> I do not think we have to offer "application mode" with running the
>>>> >> main on the client side as this is exactly what the per-job mode does
>>>> >> currently and, as you described also, it would be redundant.
>>>> >>
>>>> >> Sorry if this was not clear in the document.
>>>> >>
>>>> >> Cheers,
>>>> >> Kostas
>>>> >>
>>>> >> On Mon, Mar 2, 2020 at 3:17 PM Yang Wang <[hidden email]>
>>>> wrote:
>>>> >> >
>>>> >> > Hi Kostas,
>>>> >> >
>>>> >> > Thanks a lot for your conclusion and updating the FLIP-85 WIKI.
>>>> Currently, i have no more
>>>> >> > questions about motivation, approach, fault tolerance and the
>>>> first phase implementation.
>>>> >> >
>>>> >> > I think the new title "Flink Application Mode" makes a lot senses
>>>> to me. Especially for the
>>>> >> > containerized environment, the cluster deploy option will be very
>>>> useful.
>>>> >> >
>>>> >> > Just one concern, how do we introduce this new application mode to
>>>> our users?
>>>> >> > Each user program(i.e. `main()`) is an application. Currently, we
>>>> intend to only support one
>>>> >> > `execute()`. So what's the difference between per-job and
>>>> application mode?
>>>> >> >
>>>> >> > For per-job, user `main()` is always executed on client side. And
>>>> For application mode, user
>>>> >> > `main()` could be executed on client or master side(configured via
>>>> cli option).
>>>> >> > Right? We need to have a clear concept. Otherwise, the users will
>>>> be more and more confusing.
>>>> >> >
>>>> >> >
>>>> >> > Best,
>>>> >> > Yang
>>>> >> >
>>>> >> > Kostas Kloudas <[hidden email]> 于2020年3月2日周一 下午5:58写道:
>>>> >> >>
>>>> >> >> Hi all,
>>>> >> >>
>>>> >> >> I update
>>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-85+Flink+Application+Mode
>>>> >> >> based on the discussion we had here:
>>>> >> >>
>>>> >> >>
>>>> https://docs.google.com/document/d/1ji72s3FD9DYUyGuKnJoO4ApzV-nSsZa0-bceGXW7Ocw/edit#
>>>> >> >>
>>>> >> >> Please let me know what you think and please keep the discussion
>>>> in the ML :)
>>>> >> >>
>>>> >> >> Thanks for starting the discussion and I hope that soon we will be
>>>> >> >> able to vote on the FLIP.
>>>> >> >>
>>>> >> >> Cheers,
>>>> >> >> Kostas
>>>> >> >>
>>>> >> >> On Thu, Jan 16, 2020 at 3:40 AM Yang Wang <[hidden email]>
>>>> wrote:
>>>> >> >> >
>>>> >> >> > Hi all,
>>>> >> >> >
>>>> >> >> > Thanks a lot for the feedback from @Kostas Kloudas. Your all
>>>> concerns are
>>>> >> >> > on point. The FLIP-85 is mainly
>>>> >> >> > focused on supporting cluster mode for per-job. Since it is
>>>> more urgent and
>>>> >> >> > have much more use
>>>> >> >> > cases both in Yarn and Kubernetes deployment. For session
>>>> cluster, we could
>>>> >> >> > have more discussion
>>>> >> >> > in a new thread later.
>>>> >> >> >
>>>> >> >> > #1, How to download the user jars and dependencies for per-job
>>>> in cluster
>>>> >> >> > mode?
>>>> >> >> > For Yarn, we could register the user jars and dependencies as
>>>> >> >> > LocalResource. They will be distributed
>>>> >> >> > by Yarn. And once the JobManager and TaskManager launched, the
>>>> jars are
>>>> >> >> > already exists.
>>>> >> >> > For Standalone per-job and K8s, we expect that the user jars
>>>> >> >> > and dependencies are built into the image.
>>>> >> >> > Or the InitContainer could be used for downloading. It is
>>>> natively
>>>> >> >> > distributed and we will not have bottleneck.
>>>> >> >> >
>>>> >> >> > #2, Job graph recovery
>>>> >> >> > We could have an optimization to store job graph on the DFS.
>>>> However, i
>>>> >> >> > suggest building a new jobgraph
>>>> >> >> > from the configuration is the default option. Since we will not
>>>> always have
>>>> >> >> > a DFS store when deploying a
>>>> >> >> > Flink per-job cluster. Of course, we assume that using the same
>>>> >> >> > configuration(e.g. job_id, user_jar, main_class,
>>>> >> >> > main_args, parallelism, savepoint_settings, etc.) will get a
>>>> same job
>>>> >> >> > graph. I think the standalone per-job
>>>> >> >> > already has the similar behavior.
>>>> >> >> >
>>>> >> >> > #3, What happens with jobs that have multiple execute calls?
>>>> >> >> > Currently, it is really a problem. Even we use a local client
>>>> on Flink
>>>> >> >> > master side, it will have different behavior with
>>>> >> >> > client mode. For client mode, if we execute multiple times,
>>>> then we will
>>>> >> >> > deploy multiple Flink clusters for each execute.
>>>> >> >> > I am not pretty sure whether it is reasonable. However, i still
>>>> think using
>>>> >> >> > the local client is a good choice. We could
>>>> >> >> > continue the discussion in a new thread. @Zili Chen <
>>>> [hidden email]> Do
>>>> >> >> > you want to drive this?
>>>> >> >> >
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > Best,
>>>> >> >> > Yang
>>>> >> >> >
>>>> >> >> > Peter Huang <[hidden email]> 于2020年1月16日周四
>>>> 上午1:55写道:
>>>> >> >> >
>>>> >> >> > > Hi Kostas,
>>>> >> >> > >
>>>> >> >> > > Thanks for this feedback. I can't agree more about the
>>>> opinion. The
>>>> >> >> > > cluster mode should be added
>>>> >> >> > > first in per job cluster.
>>>> >> >> > >
>>>> >> >> > > 1) For job cluster implementation
>>>> >> >> > > 1. Job graph recovery from configuration or store as static
>>>> job graph as
>>>> >> >> > > session cluster. I think the static one will be better for
>>>> less recovery
>>>> >> >> > > time.
>>>> >> >> > > Let me update the doc for details.
>>>> >> >> > >
>>>> >> >> > > 2. For job execute multiple times, I think @Zili Chen
>>>> >> >> > > <[hidden email]> has proposed the local client
>>>> solution that can
>>>> >> >> > > the run program actually in the cluster entry point. We can
>>>> put the
>>>> >> >> > > implementation in the second stage,
>>>> >> >> > > or even a new FLIP for further discussion.
>>>> >> >> > >
>>>> >> >> > > 2) For session cluster implementation
>>>> >> >> > > We can disable the cluster mode for the session cluster in
>>>> the first
>>>> >> >> > > stage. I agree the jar downloading will be a painful thing.
>>>> >> >> > > We can consider about PoC and performance evaluation first.
>>>> If the end to
>>>> >> >> > > end experience is good enough, then we can consider
>>>> >> >> > > proceeding with the solution.
>>>> >> >> > >
>>>> >> >> > > Looking forward to more opinions from @Yang Wang <
>>>> [hidden email]> @Zili
>>>> >> >> > > Chen <[hidden email]> @Dian Fu <[hidden email]>.
>>>> >> >> > >
>>>> >> >> > >
>>>> >> >> > > Best Regards
>>>> >> >> > > Peter Huang
>>>> >> >> > >
>>>> >> >> > > On Wed, Jan 15, 2020 at 7:50 AM Kostas Kloudas <
>>>> [hidden email]> wrote:
>>>> >> >> > >
>>>> >> >> > >> Hi all,
>>>> >> >> > >>
>>>> >> >> > >> I am writing here as the discussion on the Google Doc seems
>>>> to be a
>>>> >> >> > >> bit difficult to follow.
>>>> >> >> > >>
>>>> >> >> > >> I think that in order to be able to make progress, it would
>>>> be helpful
>>>> >> >> > >> to focus on per-job mode for now.
>>>> >> >> > >> The reason is that:
>>>> >> >> > >>  1) making the (unique) JobSubmitHandler responsible for
>>>> creating the
>>>> >> >> > >> jobgraphs,
>>>> >> >> > >>   which includes downloading dependencies, is not an optimal
>>>> solution
>>>> >> >> > >>  2) even if we put the responsibility on the JobMaster,
>>>> currently each
>>>> >> >> > >> job has its own
>>>> >> >> > >>   JobMaster but they all run on the same process, so we have
>>>> again a
>>>> >> >> > >> single entity.
>>>> >> >> > >>
>>>> >> >> > >> Of course after this is done, and if we feel comfortable
>>>> with the
>>>> >> >> > >> solution, then we can go to the session mode.
>>>> >> >> > >>
>>>> >> >> > >> A second comment has to do with fault-tolerance in the
>>>> per-job,
>>>> >> >> > >> cluster-deploy mode.
>>>> >> >> > >> In the document, it is suggested that upon recovery, the
>>>> JobMaster of
>>>> >> >> > >> each job re-creates the JobGraph.
>>>> >> >> > >> I am just wondering if it is better to create and store the
>>>> jobGraph
>>>> >> >> > >> upon submission and only fetch it
>>>> >> >> > >> upon recovery so that we have a static jobGraph.
>>>> >> >> > >>
>>>> >> >> > >> Finally, I have a question which is what happens with jobs
>>>> that have
>>>> >> >> > >> multiple execute calls?
>>>> >> >> > >> The semantics seem to change compared to the current
>>>> behaviour, right?
>>>> >> >> > >>
>>>> >> >> > >> Cheers,
>>>> >> >> > >> Kostas
>>>> >> >> > >>
>>>> >> >> > >> On Wed, Jan 8, 2020 at 8:05 PM tison <[hidden email]>
>>>> wrote:
>>>> >> >> > >> >
>>>> >> >> > >> > not always, Yang Wang is also not yet a committer but he
>>>> can join the
>>>> >> >> > >> > channel. I cannot find the id by clicking “Add new member
>>>> in channel” so
>>>> >> >> > >> > come to you and ask for try out the link. Possibly I will
>>>> find other
>>>> >> >> > >> ways
>>>> >> >> > >> > but the original purpose is that the slack channel is a
>>>> public area we
>>>> >> >> > >> > discuss about developing...
>>>> >> >> > >> > Best,
>>>> >> >> > >> > tison.
>>>> >> >> > >> >
>>>> >> >> > >> >
>>>> >> >> > >> > Peter Huang <[hidden email]> 于2020年1月9日周四
>>>> 上午2:44写道:
>>>> >> >> > >> >
>>>> >> >> > >> > > Hi Tison,
>>>> >> >> > >> > >
>>>> >> >> > >> > > I am not the committer of Flink yet. I think I can't
>>>> join it also.
>>>> >> >> > >> > >
>>>> >> >> > >> > >
>>>> >> >> > >> > > Best Regards
>>>> >> >> > >> > > Peter Huang
>>>> >> >> > >> > >
>>>> >> >> > >> > > On Wed, Jan 8, 2020 at 9:39 AM tison <
>>>> [hidden email]> wrote:
>>>> >> >> > >> > >
>>>> >> >> > >> > > > Hi Peter,
>>>> >> >> > >> > > >
>>>> >> >> > >> > > > Could you try out this link?
>>>> >> >> > >> > > https://the-asf.slack.com/messages/CNA3ADZPH
>>>> >> >> > >> > > >
>>>> >> >> > >> > > > Best,
>>>> >> >> > >> > > > tison.
>>>> >> >> > >> > > >
>>>> >> >> > >> > > >
>>>> >> >> > >> > > > Peter Huang <[hidden email]> 于2020年1月9日周四
>>>> 上午1:22写道:
>>>> >> >> > >> > > >
>>>> >> >> > >> > > > > Hi Tison,
>>>> >> >> > >> > > > >
>>>> >> >> > >> > > > > I can't join the group with shared link. Would you
>>>> please add me
>>>> >> >> > >> into
>>>> >> >> > >> > > the
>>>> >> >> > >> > > > > group? My slack account is huangzhenqiu0825.
>>>> >> >> > >> > > > > Thank you in advance.
>>>> >> >> > >> > > > >
>>>> >> >> > >> > > > >
>>>> >> >> > >> > > > > Best Regards
>>>> >> >> > >> > > > > Peter Huang
>>>> >> >> > >> > > > >
>>>> >> >> > >> > > > > On Wed, Jan 8, 2020 at 12:02 AM tison <
>>>> [hidden email]>
>>>> >> >> > >> wrote:
>>>> >> >> > >> > > > >
>>>> >> >> > >> > > > > > Hi Peter,
>>>> >> >> > >> > > > > >
>>>> >> >> > >> > > > > > As described above, this effort should get
>>>> attention from people
>>>> >> >> > >> > > > > developing
>>>> >> >> > >> > > > > > FLIP-73 a.k.a. Executor abstractions. I recommend
>>>> you to join
>>>> >> >> > >> the
>>>> >> >> > >> > > > public
>>>> >> >> > >> > > > > > slack channel[1] for Flink Client API Enhancement
>>>> and you can
>>>> >> >> > >> try to
>>>> >> >> > >> > > > > share
>>>> >> >> > >> > > > > > you detailed thoughts there. It possibly gets more
>>>> concrete
>>>> >> >> > >> > > attentions.
>>>> >> >> > >> > > > > >
>>>> >> >> > >> > > > > > Best,
>>>> >> >> > >> > > > > > tison.
>>>> >> >> > >> > > > > >
>>>> >> >> > >> > > > > > [1]
>>>> >> >> > >> > > > > >
>>>> >> >> > >> > > > > >
>>>> >> >> > >> > > > >
>>>> >> >> > >> > > >
>>>> >> >> > >> > >
>>>> >> >> > >>
>>>> https://slack.com/share/IS21SJ75H/Rk8HhUly9FuEHb7oGwBZ33uL/enQtODg2MDYwNjE5MTg3LTA2MjIzNDc1M2ZjZDVlMjdlZjk1M2RkYmJhNjAwMTk2ZDZkODQ4NmY5YmI4OGRhNWJkYTViMTM1NzlmMzc4OWM
>>>> >> >> > >> > > > > >
>>>> >> >> > >> > > > > >
>>>> >> >> > >> > > > > > Peter Huang <[hidden email]>
>>>> 于2020年1月7日周二 上午5:09写道:
>>>> >> >> > >> > > > > >
>>>> >> >> > >> > > > > > > Dear All,
>>>> >> >> > >> > > > > > >
>>>> >> >> > >> > > > > > > Happy new year! According to existing feedback
>>>> from the
>>>> >> >> > >> community,
>>>> >> >> > >> > > we
>>>> >> >> > >> > > > > > > revised the doc with the consideration of
>>>> session cluster
>>>> >> >> > >> support,
>>>> >> >> > >> > > > and
>>>> >> >> > >> > > > > > > concrete interface changes needed and execution
>>>> plan. Please
>>>> >> >> > >> take
>>>> >> >> > >> > > one
>>>> >> >> > >> > > > > > more
>>>> >> >> > >> > > > > > > round of review at your most convenient time.
>>>> >> >> > >> > > > > > >
>>>> >> >> > >> > > > > > >
>>>> >> >> > >> > > > > > >
>>>> >> >> > >> > > > > >
>>>> >> >> > >> > > > >
>>>> >> >> > >> > > >
>>>> >> >> > >> > >
>>>> >> >> > >>
>>>> https://docs.google.com/document/d/1aAwVjdZByA-0CHbgv16Me-vjaaDMCfhX7TzVVTuifYM/edit#
>>>> >> >> > >> > > > > > >
>>>> >> >> > >> > > > > > >
>>>> >> >> > >> > > > > > > Best Regards
>>>> >> >> > >> > > > > > > Peter Huang
>>>> >> >> > >> > > > > > >
>>>> >> >> > >> > > > > > >
>>>> >> >> > >> > > > > > >
>>>> >> >> > >> > > > > > >
>>>> >> >> > >> > > > > > >
>>>> >> >> > >> > > > > > > On Thu, Jan 2, 2020 at 11:29 AM Peter Huang <
>>>> >> >> > >> > > > > [hidden email]>
>>>> >> >> > >> > > > > > > wrote:
>>>> >> >> > >> > > > > > >
>>>> >> >> > >> > > > > > > > Hi Dian,
>>>> >> >> > >> > > > > > > > Thanks for giving us valuable feedbacks.
>>>> >> >> > >> > > > > > > >
>>>> >> >> > >> > > > > > > > 1) It's better to have a whole design for this
>>>> feature
>>>> >> >> > >> > > > > > > > For the suggestion of enabling the cluster
>>>> mode also session
>>>> >> >> > >> > > > > cluster, I
>>>> >> >> > >> > > > > > > > think Flink already supported it.
>>>> WebSubmissionExtension
>>>> >> >> > >> already
>>>> >> >> > >> > > > > allows
>>>> >> >> > >> > > > > > > > users to start a job with the specified jar by
>>>> using web UI.
>>>> >> >> > >> > > > > > > > But we need to enable the feature from CLI for
>>>> both local
>>>> >> >> > >> jar,
>>>> >> >> > >> > > > remote
>>>> >> >> > >> > > > > > > jar.
>>>> >> >> > >> > > > > > > > I will align with Yang Wang first about the
>>>> details and
>>>> >> >> > >> update
>>>> >> >> > >> > > the
>>>> >> >> > >> > > > > > design
>>>> >> >> > >> > > > > > > > doc.
>>>> >> >> > >> > > > > > > >
>>>> >> >> > >> > > > > > > > 2) It's better to consider the convenience for
>>>> users, such
>>>> >> >> > >> as
>>>> >> >> > >> > > > > debugging
>>>> >> >> > >> > > > > > > >
>>>> >> >> > >> > > > > > > > I am wondering whether we can store the
>>>> exception in
>>>> >> >> > >> jobgragh
>>>> >> >> > >> > > > > > > > generation in application master. As no
>>>> streaming graph can
>>>> >> >> > >> be
>>>> >> >> > >> > > > > > scheduled
>>>> >> >> > >> > > > > > > in
>>>> >> >> > >> > > > > > > > this case, there will be no more TM will be
>>>> requested from
>>>> >> >> > >> > > FlinkRM.
>>>> >> >> > >> > > > > > > > If the AM is still running, users can still
>>>> query it from
>>>> >> >> > >> CLI. As
>>>> >> >> > >> > > > it
>>>> >> >> > >> > > > > > > > requires more change, we can get some feedback
>>>> from <
>>>> >> >> > >> > > > > > [hidden email]
>>>> >> >> > >> > > > > > > >
>>>> >> >> > >> > > > > > > > and @[hidden email] <[hidden email]>.
>>>> >> >> > >> > > > > > > >
>>>> >> >> > >> > > > > > > > 3) It's better to consider the impact to the
>>>> stability of
>>>> >> >> > >> the
>>>> >> >> > >> > > > cluster
>>>> >> >> > >> > > > > > > >
>>>> >> >> > >> > > > > > > > I agree with Yang Wang's opinion.
>>>> >> >> > >> > > > > > > >
>>>> >> >> > >> > > > > > > >
>>>> >> >> > >> > > > > > > >
>>>> >> >> > >> > > > > > > > Best Regards
>>>> >> >> > >> > > > > > > > Peter Huang
>>>> >> >> > >> > > > > > > >
>>>> >> >> > >> > > > > > > >
>>>> >> >> > >> > > > > > > > On Sun, Dec 29, 2019 at 9:44 PM Dian Fu <
>>>> >> >> > >> [hidden email]>
>>>> >> >> > >> > > > > wrote:
>>>> >> >> > >> > > > > > > >
>>>> >> >> > >> > > > > > > >> Hi all,
>>>> >> >> > >> > > > > > > >>
>>>> >> >> > >> > > > > > > >> Sorry to jump into this discussion. Thanks
>>>> everyone for the
>>>> >> >> > >> > > > > > discussion.
>>>> >> >> > >> > > > > > > >> I'm very interested in this topic although
>>>> I'm not an
>>>> >> >> > >> expert in
>>>> >> >> > >> > > > this
>>>> >> >> > >> > > > > > > part.
>>>> >> >> > >> > > > > > > >> So I'm glad to share my thoughts as following:
>>>> >> >> > >> > > > > > > >>
>>>> >> >> > >> > > > > > > >> 1) It's better to have a whole design for
>>>> this feature
>>>> >> >> > >> > > > > > > >> As we know, there are two deployment modes:
>>>> per-job mode
>>>> >> >> > >> and
>>>> >> >> > >> > > > session
>>>> >> >> > >> > > > > > > >> mode. I'm wondering which mode really needs
>>>> this feature.
>>>> >> >> > >> As the
>>>> >> >> > >> > > > > > design
>>>> >> >> > >> > > > > > > doc
>>>> >> >> > >> > > > > > > >> mentioned, per-job mode is more used for
>>>> streaming jobs and
>>>> >> >> > >> > > > session
>>>> >> >> > >> > > > > > > mode is
>>>> >> >> > >> > > > > > > >> usually used for batch jobs(Of course, the
>>>> job types and
>>>> >> >> > >> the
>>>> >> >> > >> > > > > > deployment
>>>> >> >> > >> > > > > > > >> modes are orthogonal). Usually streaming job
>>>> is only
>>>> >> >> > >> needed to
>>>> >> >> > >> > > be
>>>> >> >> > >> > > > > > > submitted
>>>> >> >> > >> > > > > > > >> once and it will run for days or weeks, while
>>>> batch jobs
>>>> >> >> > >> will be
>>>> >> >> > >> > > > > > > submitted
>>>> >> >> > >> > > > > > > >> more frequently compared with streaming jobs.
>>>> This means
>>>> >> >> > >> that
>>>> >> >> > >> > > > maybe
>>>> >> >> > >> > > > > > > session
>>>> >> >> > >> > > > > > > >> mode also needs this feature. However, if we
>>>> support this
>>>> >> >> > >> > > feature
>>>> >> >> > >> > > > in
>>>> >> >> > >> > > > > > > >> session mode, the application master will
>>>> become the new
>>>> >> >> > >> > > > centralized
>>>> >> >> > >> > > > > > > >> service(which should be solved). So in this
>>>> case, it's
>>>> >> >> > >> better to
>>>> >> >> > >> > > > > have
>>>> >> >> > >> > > > > > a
>>>> >> >> > >> > > > > > > >> complete design for both per-job mode and
>>>> session mode.
>>>> >> >> > >> > > > Furthermore,
>>>> >> >> > >> > > > > > > even
>>>> >> >> > >> > > > > > > >> if we can do it phase by phase, we need to
>>>> have a whole
>>>> >> >> > >> picture
>>>> >> >> > >> > > of
>>>> >> >> > >> > > > > how
>>>> >> >> > >> > > > > > > it
>>>> >> >> > >> > > > > > > >> works in both per-job mode and session mode.
>>>> >> >> > >> > > > > > > >>
>>>> >> >> > >> > > > > > > >> 2) It's better to consider the convenience
>>>> for users, such
>>>> >> >> > >> as
>>>> >> >> > >> > > > > > debugging
>>>> >> >> > >> > > > > > > >> After we finish this feature, the job graph
>>>> will be
>>>> >> >> > >> compiled in
>>>> >> >> > >> > > > the
>>>> >> >> > >> > > > > > > >> application master, which means that users
>>>> cannot easily
>>>> >> >> > >> get the
>>>> >> >> > >> > > > > > > exception
>>>> >> >> > >> > > > > > > >> message synchorousely in the job client if
>>>> there are
>>>> >> >> > >> problems
>>>> >> >> > >> > > > during
>>>> >> >> > >> > > > > > the
>>>> >> >> > >> > > > > > > >> job graph compiling (especially for platform
>>>> users), such
>>>> >> >> > >> as the
>>>> >> >> > >> > > > > > > resource
>>>> >> >> > >> > > > > > > >> path is incorrect, the user program itself
>>>> has some
>>>> >> >> > >> problems,
>>>> >> >> > >> > > etc.
>>>> >> >> > >> > > > > > What
>>>> >> >> > >> > > > > > > I'm
>>>> >> >> > >> > > > > > > >> thinking is that maybe we should throw the
>>>> exceptions as
>>>> >> >> > >> early
>>>> >> >> > >> > > as
>>>> >> >> > >> > > > > > > possible
>>>> >> >> > >> > > > > > > >> (during job submission stage).
>>>> >> >> > >> > > > > > > >>
>>>> >> >> > >> > > > > > > >> 3) It's better to consider the impact to the
>>>> stability of
>>>> >> >> > >> the
>>>> >> >> > >> > > > > cluster
>>>> >> >> > >> > > > > > > >> If we perform the compiling in the
>>>> application master, we
>>>> >> >> > >> should
>>>> >> >> > >> > > > > > > consider
>>>> >> >> > >> > > > > > > >> the impact of the compiling errors. Although
>>>> YARN could
>>>> >> >> > >> resume
>>>> >> >> > >> > > the
>>>> >> >> > >> > > > > > > >> application master in case of failures, but
>>>> in some case
>>>> >> >> > >> the
>>>> >> >> > >> > > > > compiling
>>>> >> >> > >> > > > > > > >> failure may be a waste of cluster resource
>>>> and may impact
>>>> >> >> > >> the
>>>> >> >> > >> > > > > > stability
>>>> >> >> > >> > > > > > > the
>>>> >> >> > >> > > > > > > >> cluster and the other jobs in the cluster,
>>>> such as the
>>>> >> >> > >> resource
>>>> >> >> > >> > > > path
>>>> >> >> > >> > > > > > is
>>>> >> >> > >> > > > > > > >> incorrect, the user program itself has some
>>>> problems(in
>>>> >> >> > >> this
>>>> >> >> > >> > > case,
>>>> >> >> > >> > > > > job
>>>> >> >> > >> > > > > > > >> failover cannot solve this kind of problems)
>>>> etc. In the
>>>> >> >> > >> current
>>>> >> >> > >> > > > > > > >> implemention, the compiling errors are
>>>> handled in the
>>>> >> >> > >> client
>>>> >> >> > >> > > side
>>>> >> >> > >> > > > > and
>>>> >> >> > >> > > > > > > there
>>>> >> >> > >> > > > > > > >> is no impact to the cluster at all.
>>>> >> >> > >> > > > > > > >>
>>>> >> >> > >> > > > > > > >> Regarding to 1), it's clearly pointed in the
>>>> design doc
>>>> >> >> > >> that
>>>> >> >> > >> > > only
>>>> >> >> > >> > > > > > > per-job
>>>> >> >> > >> > > > > > > >> mode will be supported. However, I think it's
>>>> better to
>>>> >> >> > >> also
>>>> >> >> > >> > > > > consider
>>>> >> >> > >> > > > > > > the
>>>> >> >> > >> > > > > > > >> session mode in the design doc.
>>>> >> >> > >> > > > > > > >> Regarding to 2) and 3), I have not seen
>>>> related sections
>>>> >> >> > >> in the
>>>> >> >> > >> > > > > design
>>>> >> >> > >> > > > > > > >> doc. It will be good if we can cover them in
>>>> the design
>>>> >> >> > >> doc.
>>>> >> >> > >> > > > > > > >>
>>>> >> >> > >> > > > > > > >> Feel free to correct me If there is anything I
>>>> >> >> > >> misunderstand.
>>>> >> >> > >> > > > > > > >>
>>>> >> >> > >> > > > > > > >> Regards,
>>>> >> >> > >> > > > > > > >> Dian
>>>> >> >> > >> > > > > > > >>
>>>> >> >> > >> > > > > > > >>
>>>> >> >> > >> > > > > > > >> > 在 2019年12月27日,上午3:13,Peter Huang <
>>>> >> >> > >> [hidden email]>
>>>> >> >> > >> > > > 写道:
>>>> >> >> > >> > > > > > > >> >
>>>> >> >> > >> > > > > > > >> > Hi Yang,
>>>> >> >> > >> > > > > > > >> >
>>>> >> >> > >> > > > > > > >> > I can't agree more. The effort definitely
>>>> needs to align
>>>> >> >> > >> with
>>>> >> >> > >> > > > the
>>>> >> >> > >> > > > > > > final
>>>> >> >> > >> > > > > > > >> > goal of FLIP-73.
>>>> >> >> > >> > > > > > > >> > I am thinking about whether we can achieve
>>>> the goal with
>>>> >> >> > >> two
>>>> >> >> > >> > > > > phases.
>>>> >> >> > >> > > > > > > >> >
>>>> >> >> > >> > > > > > > >> > 1) Phase I
>>>> >> >> > >> > > > > > > >> > As the CLiFrontend will not be depreciated
>>>> soon. We can
>>>> >> >> > >> still
>>>> >> >> > >> > > > use
>>>> >> >> > >> > > > > > the
>>>> >> >> > >> > > > > > > >> > deployMode flag there,
>>>> >> >> > >> > > > > > > >> > pass the program info through Flink
>>>> configuration,  use
>>>> >> >> > >> the
>>>> >> >> > >> > > > > > > >> > ClassPathJobGraphRetriever
>>>> >> >> > >> > > > > > > >> > to generate the job graph in
>>>> ClusterEntrypoints of yarn
>>>> >> >> > >> and
>>>> >> >> > >> > > > > > > Kubernetes.
>>>> >> >> > >> > > > > > > >> >
>>>> >> >> > >> > > > > > > >> > 2) Phase II
>>>> >> >> > >> > > > > > > >> > In  AbstractJobClusterExecutor, the job
>>>> graph is
>>>> >> >> > >> generated in
>>>> >> >> > >> > > > the
>>>> >> >> > >> > > > > > > >> execute
>>>> >> >> > >> > > > > > > >> > function. We can still
>>>> >> >> > >> > > > > > > >> > use the deployMode in it. With deployMode =
>>>> cluster, the
>>>> >> >> > >> > > execute
>>>> >> >> > >> > > > > > > >> function
>>>> >> >> > >> > > > > > > >> > only starts the cluster.
>>>> >> >> > >> > > > > > > >> >
>>>> >> >> > >> > > > > > > >> > When
>>>> {Yarn/Kuberneates}PerJobClusterEntrypoint starts,
>>>> >> >> > >> It will
>>>> >> >> > >> > > > > start
>>>> >> >> > >> > > > > > > the
>>>> >> >> > >> > > > > > > >> > dispatch first, then we can use
>>>> >> >> > >> > > > > > > >> > a ClusterEnvironment similar to
>>>> ContextEnvironment to
>>>> >> >> > >> submit
>>>> >> >> > >> > > the
>>>> >> >> > >> > > > > job
>>>> >> >> > >> > > > > > > >> with
>>>> >> >> > >> > > > > > > >> > jobName the local
>>>> >> >> > >> > > > > > > >> > dispatcher. For the details, we need more
>>>> investigation.
>>>> >> >> > >> Let's
>>>> >> >> > >> > > > > wait
>>>> >> >> > >> > > > > > > >> > for @Aljoscha
>>>> >> >> > >> > > > > > > >> > Krettek <[hidden email]> @Till
>>>> Rohrmann <
>>>> >> >> > >> > > > > [hidden email]
>>>> >> >> > >> > > > > > >'s
>>>> >> >> > >> > > > > > > >> > feedback after the holiday season.
>>>> >> >> > >> > > > > > > >> >
>>>> >> >> > >> > > > > > > >> > Thank you in advance. Merry Chrismas and
>>>> Happy New
>>>> >> >> > >> Year!!!
>>>> >> >> > >> > > > > > > >> >
>>>> >> >> > >> > > > > > > >> >
>>>> >> >> > >> > > > > > > >> > Best Regards
>>>> >> >> > >> > > > > > > >> > Peter Huang
>>>> >> >> > >> > > > > > > >> >
>>>> >> >> > >> > > > > > > >> >
>>>> >> >> > >> > > > > > > >> >
>>>> >> >> > >> > > > > > > >> >
>>>> >> >> > >> > > > > > > >> >
>>>> >> >> > >> > > > > > > >> >
>>>> >> >> > >> > > > > > > >> >
>>>> >> >> > >> > > > > > > >> >
>>>> >> >> > >> > > > > > > >> > On Wed, Dec 25, 2019 at 1:08 AM Yang Wang <
>>>> >> >> > >> > > > [hidden email]>
>>>> >> >> > >> > > > > > > >> wrote:
>>>> >> >> > >> > > > > > > >> >
>>>> >> >> > >> > > > > > > >> >> Hi Peter,
>>>> >> >> > >> > > > > > > >> >>
>>>> >> >> > >> > > > > > > >> >> I think we need to reconsider tison's
>>>> suggestion
>>>> >> >> > >> seriously.
>>>> >> >> > >> > > > After
>>>> >> >> > >> > > > > > > >> FLIP-73,
>>>> >> >> > >> > > > > > > >> >> the deployJobCluster has
>>>> >> >> > >> > > > > > > >> >> beenmoved into
>>>> `JobClusterExecutor#execute`. It should
>>>> >> >> > >> not be
>>>> >> >> > >> > > > > > > perceived
>>>> >> >> > >> > > > > > > >> >> for `CliFrontend`. That
>>>> >> >> > >> > > > > > > >> >> means the user program will *ALWAYS* be
>>>> executed on
>>>> >> >> > >> client
>>>> >> >> > >> > > > side.
>>>> >> >> > >> > > > > > This
>>>> >> >> > >> > > > > > > >> is
>>>> >> >> > >> > > > > > > >> >> the by design behavior.
>>>> >> >> > >> > > > > > > >> >> So, we could not just add `if(client mode)
>>>> .. else
>>>> >> >> > >> if(cluster
>>>> >> >> > >> > > > > mode)
>>>> >> >> > >> > > > > > > >> ...`
>>>> >> >> > >> > > > > > > >> >> codes in `CliFrontend` to bypass
>>>> >> >> > >> > > > > > > >> >> the executor. We need to find a clean way
>>>> to decouple
>>>> >> >> > >> > > executing
>>>> >> >> > >> > > > > > user
>>>> >> >> > >> > > > > > > >> >> program and deploying per-job
>>>> >> >> > >> > > > > > > >> >> cluster. Based on this, we could support
>>>> to execute user
>>>> >> >> > >> > > > program
>>>> >> >> > >> > > > > on
>>>> >> >> > >> > > > > > > >> client
>>>> >> >> > >> > > > > > > >> >> or master side.
>>>> >> >> > >> > > > > > > >> >>
>>>> >> >> > >> > > > > > > >> >> Maybe Aljoscha and Jeff could give some
>>>> good
>>>> >> >> > >> suggestions.
>>>> >> >> > >> > > > > > > >> >>
>>>> >> >> > >> > > > > > > >> >>
>>>> >> >> > >> > > > > > > >> >>
>>>> >> >> > >> > > > > > > >> >> Best,
>>>> >> >> > >> > > > > > > >> >> Yang
>>>> >> >> > >> > > > > > > >> >>
>>>> >> >> > >> > > > > > > >> >> Peter Huang <[hidden email]>
>>>> 于2019年12月25日周三
>>>> >> >> > >> > > > > 上午4:03写道:
>>>> >> >> > >> > > > > > > >> >>
>>>> >> >> > >> > > > > > > >> >>> Hi Jingjing,
>>>> >> >> > >> > > > > > > >> >>>
>>>> >> >> > >> > > > > > > >> >>> The improvement proposed is a deployment
>>>> option for
>>>> >> >> > >> CLI. For
>>>> >> >> > >> > > > SQL
>>>> >> >> > >> > > > > > > based
>>>> >> >> > >> > > > > > > >> >>> Flink application, It is more convenient
>>>> to use the
>>>> >> >> > >> existing
>>>> >> >> > >> > > > > model
>>>> >> >> > >> > > > > > > in
>>>> >> >> > >> > > > > > > >> >>> SqlClient in which
>>>> >> >> > >> > > > > > > >> >>> the job graph is generated within
>>>> SqlClient. After
>>>> >> >> > >> adding
>>>> >> >> > >> > > the
>>>> >> >> > >> > > > > > > delayed
>>>> >> >> > >> > > > > > > >> job
>>>> >> >> > >> > > > > > > >> >>> graph generation, I think there is no
>>>> change is needed
>>>> >> >> > >> for
>>>> >> >> > >> > > > your
>>>> >> >> > >> > > > > > > side.
>>>> >> >> > >> > > > > > > >> >>>
>>>> >> >> > >> > > > > > > >> >>>
>>>> >> >> > >> > > > > > > >> >>> Best Regards
>>>> >> >> > >> > > > > > > >> >>> Peter Huang
>>>> >> >> > >> > > > > > > >> >>>
>>>> >> >> > >> > > > > > > >> >>>
>>>> >> >> > >> > > > > > > >> >>> On Wed, Dec 18, 2019 at 6:01 AM jingjing
>>>> bai <
>>>> >> >> > >> > > > > > > >> [hidden email]>
>>>> >> >> > >> > > > > > > >> >>> wrote:
>>>> >> >> > >> > > > > > > >> >>>
>>>> >> >> > >> > > > > > > >> >>>> hi peter:
>>>> >> >> > >> > > > > > > >> >>>>    we had extension SqlClent to support
>>>> sql job
>>>> >> >> > >> submit in
>>>> >> >> > >> > > web
>>>> >> >> > >> > > > > > base
>>>> >> >> > >> > > > > > > on
>>>> >> >> > >> > > > > > > >> >>>> flink 1.9.   we support submit to yarn
>>>> on per job
>>>> >> >> > >> mode too.
>>>> >> >> > >> > > > > > > >> >>>>    in this case, the job graph
>>>> generated  on client
>>>> >> >> > >> side
>>>> >> >> > >> > > .  I
>>>> >> >> > >> > > > > > think
>>>> >> >> > >> > > > > > > >> >>> this
>>>> >> >> > >> > > > > > > >> >>>> discuss Mainly to improve api
>>>> programme.  but in my
>>>> >> >> > >> case ,
>>>> >> >> > >> > > > > there
>>>> >> >> > >> > > > > > is
>>>> >> >> > >> > > > > > > >> no
>>>> >> >> > >> > > > > > > >> >>>> jar to upload but only a sql string .
>>>> >> >> > >> > > > > > > >> >>>>    do u had more suggestion to improve
>>>> for sql mode
>>>> >> >> > >> or it
>>>> >> >> > >> > > is
>>>> >> >> > >> > > > > > only a
>>>> >> >> > >> > > > > > > >> >>>> switch for api programme?
>>>> >> >> > >> > > > > > > >> >>>>
>>>> >> >> > >> > > > > > > >> >>>>
>>>> >> >> > >> > > > > > > >> >>>> best
>>>> >> >> > >> > > > > > > >> >>>> bai jj
>>>> >> >> > >> > > > > > > >> >>>>
>>>> >> >> > >> > > > > > > >> >>>>
>>>> >> >> > >> > > > > > > >> >>>> Yang Wang <[hidden email]>
>>>> 于2019年12月18日周三
>>>> >> >> > >> 下午7:21写道:
>>>> >> >> > >> > > > > > > >> >>>>
>>>> >> >> > >> > > > > > > >> >>>>> I just want to revive this discussion.
>>>> >> >> > >> > > > > > > >> >>>>>
>>>> >> >> > >> > > > > > > >> >>>>> Recently, i am thinking about how to
>>>> natively run
>>>> >> >> > >> flink
>>>> >> >> > >> > > > > per-job
>>>> >> >> > >> > > > > > > >> >>> cluster on
>>>> >> >> > >> > > > > > > >> >>>>> Kubernetes.
>>>> >> >> > >> > > > > > > >> >>>>> The per-job mode on Kubernetes is very
>>>> different
>>>> >> >> > >> from on
>>>> >> >> > >> > > > Yarn.
>>>> >> >> > >> > > > > > And
>>>> >> >> > >> > > > > > > >> we
>>>> >> >> > >> > > > > > > >> >>> will
>>>> >> >> > >> > > > > > > >> >>>>> have
>>>> >> >> > >> > > > > > > >> >>>>> the same deployment requirements to the
>>>> client and
>>>> >> >> > >> entry
>>>> >> >> > >> > > > > point.
>>>> >> >> > >> > > > > > > >> >>>>>
>>>> >> >> > >> > > > > > > >> >>>>> 1. Flink client not always need a local
>>>> jar to start
>>>> >> >> > >> a
>>>> >> >> > >> > > Flink
>>>> >> >> > >> > > > > > > per-job
>>>> >> >> > >> > > > > > > >> >>>>> cluster. We could
>>>> >> >> > >> > > > > > > >> >>>>> support multiple schemas. For example,
>>>> >> >> > >> > > > file:///path/of/my.jar
>>>> >> >> > >> > > > > > > means
>>>> >> >> > >> > > > > > > >> a
>>>> >> >> > >> > > > > > > >> >>> jar
>>>> >> >> > >> > > > > > > >> >>>>> located
>>>> >> >> > >> > > > > > > >> >>>>> at client side,
>>>> >> >> > >> hdfs://myhdfs/user/myname/flink/my.jar
>>>> >> >> > >> > > > means a
>>>> >> >> > >> > > > > > jar
>>>> >> >> > >> > > > > > > >> >>> located
>>>> >> >> > >> > > > > > > >> >>>>> at
>>>> >> >> > >> > > > > > > >> >>>>> remote hdfs,
>>>> local:///path/in/image/my.jar means a
>>>> >> >> > >> jar
>>>> >> >> > >> > > > located
>>>> >> >> > >> > > > > > at
>>>> >> >> > >> > > > > > > >> >>>>> jobmanager side.
>>>> >> >> > >> > > > > > > >> >>>>>
>>>> >> >> > >> > > > > > > >> >>>>> 2. Support running user program on
>>>> master side. This
>>>> >> >> > >> also
>>>> >> >> > >> > > > > means
>>>> >> >> > >> > > > > > > the
>>>> >> >> > >> > > > > > > >> >>> entry
>>>> >> >> > >> > > > > > > >> >>>>> point
>>>> >> >> > >> > > > > > > >> >>>>> will generate the job graph on master
>>>> side. We could
>>>> >> >> > >> use
>>>> >> >> > >> > > the
>>>> >> >> > >> > > > > > > >> >>>>> ClasspathJobGraphRetriever
>>>> >> >> > >> > > > > > > >> >>>>> or start a local Flink client to
>>>> achieve this
>>>> >> >> > >> purpose.
>>>> >> >> > >> > > > > > > >> >>>>>
>>>> >> >> > >> > > > > > > >> >>>>>
>>>> >> >> > >> > > > > > > >> >>>>> cc tison, Aljoscha & Kostas Do you
>>>> think this is the
>>>> >> >> > >> right
>>>> >> >> > >> > > > > > > >> direction we
>>>> >> >> > >> > > > > > > >> >>>>> need to work?
>>>> >> >> > >> > > > > > > >> >>>>>
>>>> >> >> > >> > > > > > > >> >>>>> tison <[hidden email]>
>>>> 于2019年12月12日周四
>>>> >> >> > >> 下午4:48写道:
>>>> >> >> > >> > > > > > > >> >>>>>
>>>> >> >> > >> > > > > > > >> >>>>>> A quick idea is that we separate the
>>>> deployment
>>>> >> >> > >> from user
>>>> >> >> > >> > > > > > program
>>>> >> >> > >> > > > > > > >> >>> that
>>>> >> >> > >> > > > > > > >> >>>>> it
>>>> >> >> > >> > > > > > > >> >>>>>> has always been done
>>>> >> >> > >> > > > > > > >> >>>>>> outside the program. On user program
>>>> executed there
>>>> >> >> > >> is
>>>> >> >> > >> > > > > always a
>>>> >> >> > >> > > > > > > >> >>>>>> ClusterClient that communicates with
>>>> >> >> > >> > > > > > > >> >>>>>> an existing cluster, remote or local.
>>>> It will be
>>>> >> >> > >> another
>>>> >> >> > >> > > > > thread
>>>> >> >> > >> > > > > > > so
>>>> >> >> > >> > > > > > > >> >>> just
>>>> >> >> > >> > > > > > > >> >>>>> for
>>>> >> >> > >> > > > > > > >> >>>>>> your information.
>>>> >> >> > >> > > > > > > >> >>>>>>
>>>> >> >> > >> > > > > > > >> >>>>>> Best,
>>>> >> >> > >> > > > > > > >> >>>>>> tison.
>>>> >> >> > >> > > > > > > >> >>>>>>
>>>> >> >> > >> > > > > > > >> >>>>>>
>>>> >> >> > >> > > > > > > >> >>>>>> tison <[hidden email]>
>>>> 于2019年12月12日周四
>>>> >> >> > >> 下午4:40写道:
>>>> >> >> > >> > > > > > > >> >>>>>>
>>>> >> >> > >> > > > > > > >> >>>>>>> Hi Peter,
>>>> >> >> > >> > > > > > > >> >>>>>>>
>>>> >> >> > >> > > > > > > >> >>>>>>> Another concern I realized recently
>>>> is that with
>>>> >> >> > >> current
>>>> >> >> > >> > > > > > > Executors
>>>> >> >> > >> > > > > > > >> >>>>>>> abstraction(FLIP-73)
>>>> >> >> > >> > > > > > > >> >>>>>>> I'm afraid that user program is
>>>> designed to ALWAYS
>>>> >> >> > >> run
>>>> >> >> > >> > > on
>>>> >> >> > >> > > > > the
>>>> >> >> > >> > > > > > > >> >>> client
>>>> >> >> > >> > > > > > > >> >>>>>> side.
>>>> >> >> > >> > > > > > > >> >>>>>>> Specifically,
>>>> >> >> > >> > > > > > > >> >>>>>>> we deploy the job in executor when
>>>> env.execute
>>>> >> >> > >> called.
>>>> >> >> > >> > > > This
>>>> >> >> > >> > > > > > > >> >>>>> abstraction
>>>> >> >> > >> > > > > > > >> >>>>>>> possibly prevents
>>>> >> >> > >> > > > > > > >> >>>>>>> Flink runs user program on the
>>>> cluster side.
>>>> >> >> > >> > > > > > > >> >>>>>>>
>>>> >> >> > >> > > > > > > >> >>>>>>> For your proposal, in this case we
>>>> already
>>>> >> >> > >> compiled the
>>>> >> >> > >> > > > > > program
>>>> >> >> > >> > > > > > > >> and
>>>> >> >> > >> > > > > > > >> >>>>> run
>>>> >> >> > >> > > > > > > >> >>>>>> on
>>>> >> >> > >> > > > > > > >> >>>>>>> the client side,
>>>> >> >> > >> > > > > > > >> >>>>>>> even we deploy a cluster and retrieve
>>>> job graph
>>>> >> >> > >> from
>>>> >> >> > >> > > > program
>>>> >> >> > >> > > > > > > >> >>>>> metadata, it
>>>> >> >> > >> > > > > > > >> >>>>>>> doesn't make
>>>> >> >> > >> > > > > > > >> >>>>>>> many sense.
>>>> >> >> > >> > > > > > > >> >>>>>>>
>>>> >> >> > >> > > > > > > >> >>>>>>> cc Aljoscha & Kostas what do you
>>>> think about this
>>>> >> >> > >> > > > > constraint?
>>>> >> >> > >> > > > > > > >> >>>>>>>
>>>> >> >> > >> > > > > > > >> >>>>>>> Best,
>>>> >> >> > >> > > > > > > >> >>>>>>> tison.
>>>> >> >> > >> > > > > > > >> >>>>>>>
>>>> >> >> > >> > > > > > > >> >>>>>>>
>>>> >> >> > >> > > > > > > >> >>>>>>> Peter Huang <
>>>> [hidden email]>
>>>> >> >> > >> 于2019年12月10日周二
>>>> >> >> > >> > > > > > > >> 下午12:45写道:
>>>> >> >> > >> > > > > > > >> >>>>>>>
>>>> >> >> > >> > > > > > > >> >>>>>>>> Hi Tison,
>>>> >> >> > >> > > > > > > >> >>>>>>>>
>>>> >> >> > >> > > > > > > >> >>>>>>>> Yes, you are right. I think I made
>>>> the wrong
>>>> >> >> > >> argument
>>>> >> >> > >> > > in
>>>> >> >> > >> > > > > the
>>>> >> >> > >> > > > > > > doc.
>>>> >> >> > >> > > > > > > >> >>>>>>>> Basically, the packaging jar problem
>>>> is only for
>>>> >> >> > >> > > platform
>>>> >> >> > >> > > > > > > users.
>>>> >> >> > >> > > > > > > >> >>> In
>>>> >> >> > >> > > > > > > >> >>>>> our
>>>> >> >> > >> > > > > > > >> >>>>>>>> internal deploy service,
>>>> >> >> > >> > > > > > > >> >>>>>>>> we further optimized the deployment
>>>> latency by
>>>> >> >> > >> letting
>>>> >> >> > >> > > > > users
>>>> >> >> > >> > > > > > to
>>>> >> >> > >> > > > > > > >> >>>>>> packaging
>>>> >> >> > >> > > > > > > >> >>>>>>>> flink-runtime together with the uber
>>>> jar, so that
>>>> >> >> > >> we
>>>> >> >> > >> > > > don't
>>>> >> >> > >> > > > > > need
>>>> >> >> > >> > > > > > > >> to
>>>> >> >> > >> > > > > > > >> >>>>>>>> consider
>>>> >> >> > >> > > > > > > >> >>>>>>>> multiple flink version
>>>> >> >> > >> > > > > > > >> >>>>>>>> support for now. In the session
>>>> client mode, as
>>>> >> >> > >> Flink
>>>> >> >> > >> > > > libs
>>>> >> >> > >> > > > > > will
>>>> >> >> > >> > > > > > > >> be
>>>> >> >> > >> > > > > > > >> >>>>>> shipped
>>>> >> >> > >> > > > > > > >> >>>>>>>> anyway as local resources of yarn.
>>>> Users actually
>>>> >> >> > >> don't
>>>> >> >> > >> > > > > need
>>>> >> >> > >> > > > > > to
>>>> >> >> > >> > > > > > > >> >>>>> package
>>>> >> >> > >> > > > > > > >> >>>>>>>> those libs into job jar.
>>>> >> >> > >> > > > > > > >> >>>>>>>>
>>>> >> >> > >> > > > > > > >> >>>>>>>>
>>>> >> >> > >> > > > > > > >> >>>>>>>>
>>>> >> >> > >> > > > > > > >> >>>>>>>> Best Regards
>>>> >> >> > >> > > > > > > >> >>>>>>>> Peter Huang
>>>> >> >> > >> > > > > > > >> >>>>>>>>
>>>> >> >> > >> > > > > > > >> >>>>>>>> On Mon, Dec 9, 2019 at 8:35 PM tison
>>>> <
>>>> >> >> > >> > > > [hidden email]
>>>> >> >> > >> > > > > >
>>>> >> >> > >> > > > > > > >> >>> wrote:
>>>> >> >> > >> > > > > > > >> >>>>>>>>
>>>> >> >> > >> > > > > > > >> >>>>>>>>>> 3. What do you mean about the
>>>> package? Do users
>>>> >> >> > >> need
>>>> >> >> > >> > > to
>>>> >> >> > >> > > > > > > >> >>> compile
>>>> >> >> > >> > > > > > > >> >>>>>> their
>>>> >> >> > >> > > > > > > >> >>>>>>>>> jars
>>>> >> >> > >> > > > > > > >> >>>>>>>>> inlcuding flink-clients,
>>>> flink-optimizer,
>>>> >> >> > >> flink-table
>>>> >> >> > >> > > > > codes?
>>>> >> >> > >> > > > > > > >> >>>>>>>>>
>>>> >> >> > >> > > > > > > >> >>>>>>>>> The answer should be no because
>>>> they exist in
>>>> >> >> > >> system
>>>> >> >> > >> > > > > > > classpath.
>>>> >> >> > >> > > > > > > >> >>>>>>>>>
>>>> >> >> > >> > > > > > > >> >>>>>>>>> Best,
>>>> >> >> > >> > > > > > > >> >>>>>>>>> tison.
>>>> >> >> > >> > > > > > > >> >>>>>>>>>
>>>> >> >> > >> > > > > > > >> >>>>>>>>>
>>>> >> >> > >> > > > > > > >> >>>>>>>>> Yang Wang <[hidden email]>
>>>> 于2019年12月10日周二
>>>> >> >> > >> > > > > 下午12:18写道:
>>>> >> >> > >> > > > > > > >> >>>>>>>>>
>>>> >> >> > >> > > > > > > >> >>>>>>>>>> Hi Peter,
>>>> >> >> > >> > > > > > > >> >>>>>>>>>>
>>>> >> >> > >> > > > > > > >> >>>>>>>>>> Thanks a lot for starting this
>>>> discussion. I
>>>> >> >> > >> think
>>>> >> >> > >> > > this
>>>> >> >> > >> > > > > is
>>>> >> >> > >> > > > > > a
>>>> >> >> > >> > > > > > > >> >>> very
>>>> >> >> > >> > > > > > > >> >>>>>>>> useful
>>>> >> >> > >> > > > > > > >> >>>>>>>>>> feature.
>>>> >> >> > >> > > > > > > >> >>>>>>>>>>
>>>> >> >> > >> > > > > > > >> >>>>>>>>>> Not only for Yarn, i am focused on
>>>> flink on
>>>> >> >> > >> > > Kubernetes
>>>> >> >> > >> > > > > > > >> >>>>> integration
>>>> >> >> > >> > > > > > > >> >>>>>> and
>>>> >> >> > >> > > > > > > >> >>>>>>>>> come
>>>> >> >> > >> > > > > > > >> >>>>>>>>>> across the same
>>>> >> >> > >> > > > > > > >> >>>>>>>>>> problem. I do not want the job
>>>> graph generated
>>>> >> >> > >> on
>>>> >> >> > >> > > > client
>>>> >> >> > >> > > > > > > side.
>>>> >> >> > >> > > > > > > >> >>>>>>>> Instead,
>>>> >> >> > >> > > > > > > >> >>>>>>>>> the
>>>> >> >> > >> > > > > > > >> >>>>>>>>>> user jars are built in
>>>> >> >> > >> > > > > > > >> >>>>>>>>>> a user-defined image. When the job
>>>> manager
>>>> >> >> > >> launched,
>>>> >> >> > >> > > we
>>>> >> >> > >> > > > > > just
>>>> >> >> > >> > > > > > > >> >>>>> need to
>>>> >> >> > >> > > > > > > >> >>>>>>>>>> generate the job graph
>>>> >> >> > >> > > > > > > >> >>>>>>>>>> based on local user jars.
>>>> >> >> > >> > > > > > > >> >>>>>>>>>>
>>>> >> >> > >> > > > > > > >> >>>>>>>>>> I have some small suggestion about
>>>> this.
>>>> >> >> > >> > > > > > > >> >>>>>>>>>>
>>>> >> >> > >> > > > > > > >> >>>>>>>>>> 1. `ProgramJobGraphRetriever` is
>>>> very similar to
>>>> >> >> > >> > > > > > > >> >>>>>>>>>> `ClasspathJobGraphRetriever`, the
>>>> differences
>>>> >> >> > >> > > > > > > >> >>>>>>>>>> are the former needs
>>>> `ProgramMetadata` and the
>>>> >> >> > >> latter
>>>> >> >> > >> > > > > needs
>>>> >> >> > >> > > > > > > >> >>> some
>>>> >> >> > >> > > > > > > >> >>>>>>>>> arguments.
>>>> >> >> > >> > > > > > > >> >>>>>>>>>> Is it possible to
>>>> >> >> > >> > > > > > > >> >>>>>>>>>> have an unified
>>>> `JobGraphRetriever` to support
>>>> >> >> > >> both?
>>>> >> >> > >> > > > > > > >> >>>>>>>>>> 2. Is it possible to not use a
>>>> local user jar to
>>>> >> >> > >> > > start
>>>> >> >> > >> > > > a
>>>> >> >> > >> > > > > > > >> >>> per-job
>>>> >> >> > >> > > > > > > >> >>>>>>>> cluster?
>>>> >> >> > >> > > > > > > >> >>>>>>>>>> In your case, the user jars has
>>>> >> >> > >> > > > > > > >> >>>>>>>>>> existed on hdfs already and we do
>>>> need to
>>>> >> >> > >> download
>>>> >> >> > >> > > the
>>>> >> >> > >> > > > > jars
>>>> >> >> > >> > > > > > > to
>>>> >> >> > >> > > > > > > >> >>>>>>>> deployer
>>>> >> >> > >> > > > > > > >> >>>>>>>>>> service. Currently, we
>>>> >> >> > >> > > > > > > >> >>>>>>>>>> always need a local user jar to
>>>> start a flink
>>>> >> >> > >> > > cluster.
>>>> >> >> > >> > > > It
>>>> >> >> > >> > > > > > is
>>>> >> >> > >> > > > > > > >> >>> be
>>>> >> >> > >> > > > > > > >> >>>>>> great
>>>> >> >> > >> > > > > > > >> >>>>>>>> if
>>>> >> >> > >> > > > > > > >> >>>>>>>>> we
>>>> >> >> > >> > > > > > > >> >>>>>>>>>> could support remote user jars.
>>>> >> >> > >> > > > > > > >> >>>>>>>>>>>> In the implementation, we assume
>>>> users package
>>>> >> >> > >> > > > > > > >> >>> flink-clients,
>>>> >> >> > >> > > > > > > >> >>>>>>>>>> flink-optimizer, flink-table
>>>> together within
>>>> >> >> > >> the job
>>>> >> >> > >> > > > jar.
>>>> >> >> > >> > > > > > > >> >>>>> Otherwise,
>>>> >> >> > >> > > > > > > >> >>>>>>>> the
>>>> >> >> > >> > > > > > > >> >>>>>>>>>> job graph generation within
>>>> >> >> > >> JobClusterEntryPoint will
>>>> >> >> > >> > > > > fail.
>>>> >> >> > >> > > > > > > >> >>>>>>>>>> 3. What do you mean about the
>>>> package? Do users
>>>> >> >> > >> need
>>>> >> >> > >> > > to
>>>> >> >> > >> > > > > > > >> >>> compile
>>>> >> >> > >> > > > > > > >> >>>>>> their
>>>> >> >> > >> > > > > > > >> >>>>>>>>> jars
>>>> >> >> > >> > > > > > > >> >>>>>>>>>> inlcuding flink-clients,
>>>> flink-optimizer,
>>>> >> >> > >> flink-table
>>>> >> >> > >> > > > > > codes?
>>>> >> >> > >> > > > > > > >> >>>>>>>>>>
>>>> >> >> > >> > > > > > > >> >>>>>>>>>>
>>>> >> >> > >> > > > > > > >> >>>>>>>>>>
>>>> >> >> > >> > > > > > > >> >>>>>>>>>> Best,
>>>> >> >> > >> > > > > > > >> >>>>>>>>>> Yang
>>>> >> >> > >> > > > > > > >> >>>>>>>>>>
>>>> >> >> > >> > > > > > > >> >>>>>>>>>> Peter Huang <
>>>> [hidden email]>
>>>> >> >> > >> > > > 于2019年12月10日周二
>>>> >> >> > >> > > > > > > >> >>>>> 上午2:37写道:
>>>> >> >> > >> > > > > > > >> >>>>>>>>>>
>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> Dear All,
>>>> >> >> > >> > > > > > > >> >>>>>>>>>>>
>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> Recently, the Flink community
>>>> starts to
>>>> >> >> > >> improve the
>>>> >> >> > >> > > > yarn
>>>> >> >> > >> > > > > > > >> >>>>> cluster
>>>> >> >> > >> > > > > > > >> >>>>>>>>>> descriptor
>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> to make job jar and config files
>>>> configurable
>>>> >> >> > >> from
>>>> >> >> > >> > > > CLI.
>>>> >> >> > >> > > > > It
>>>> >> >> > >> > > > > > > >> >>>>>> improves
>>>> >> >> > >> > > > > > > >> >>>>>>>> the
>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> flexibility of  Flink deployment
>>>> Yarn Per Job
>>>> >> >> > >> Mode.
>>>> >> >> > >> > > > For
>>>> >> >> > >> > > > > > > >> >>>>> platform
>>>> >> >> > >> > > > > > > >> >>>>>>>> users
>>>> >> >> > >> > > > > > > >> >>>>>>>>>> who
>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> manage tens of hundreds of
>>>> streaming pipelines
>>>> >> >> > >> for
>>>> >> >> > >> > > the
>>>> >> >> > >> > > > > > whole
>>>> >> >> > >> > > > > > > >> >>>>> org
>>>> >> >> > >> > > > > > > >> >>>>>> or
>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> company, we found the job graph
>>>> generation in
>>>> >> >> > >> > > > > client-side
>>>> >> >> > >> > > > > > is
>>>> >> >> > >> > > > > > > >> >>>>>> another
>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> pinpoint. Thus, we want to
>>>> propose a
>>>> >> >> > >> configurable
>>>> >> >> > >> > > > > feature
>>>> >> >> > >> > > > > > > >> >>> for
>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> FlinkYarnSessionCli. The feature
>>>> can allow
>>>> >> >> > >> users to
>>>> >> >> > >> > > > > choose
>>>> >> >> > >> > > > > > > >> >>> the
>>>> >> >> > >> > > > > > > >> >>>>> job
>>>> >> >> > >> > > > > > > >> >>>>>>>>> graph
>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> generation in Flink
>>>> ClusterEntryPoint so that
>>>> >> >> > >> the
>>>> >> >> > >> > > job
>>>> >> >> > >> > > > > jar
>>>> >> >> > >> > > > > > > >> >>>>> doesn't
>>>> >> >> > >> > > > > > > >> >>>>>>>> need
>>>> >> >> > >> > > > > > > >> >>>>>>>>> to
>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> be locally for the job graph
>>>> generation. The
>>>> >> >> > >> > > proposal
>>>> >> >> > >> > > > is
>>>> >> >> > >> > > > > > > >> >>>>> organized
>>>> >> >> > >> > > > > > > >> >>>>>>>> as a
>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> FLIP
>>>> >> >> > >> > > > > > > >> >>>>>>>>>>>
>>>> >> >> > >> > > > > > > >> >>>>>>>>>>>
>>>> >> >> > >> > > > > > > >> >>>>>>>>>>
>>>> >> >> > >> > > > > > > >> >>>>>>>>>
>>>> >> >> > >> > > > > > > >> >>>>>>>>
>>>> >> >> > >> > > > > > > >> >>>>>>
>>>> >> >> > >> > > > > > > >> >>>>>
>>>> >> >> > >> > > > > > > >> >>>
>>>> >> >> > >> > > > > > > >>
>>>> >> >> > >> > > > > > >
>>>> >> >> > >> > > > > >
>>>> >> >> > >> > > > >
>>>> >> >> > >> > > >
>>>> >> >> > >> > >
>>>> >> >> > >>
>>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-85+Delayed+JobGraph+Generation
>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> .
>>>> >> >> > >> > > > > > > >> >>>>>>>>>>>
>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> Any questions and suggestions are
>>>> welcomed.
>>>> >> >> > >> Thank
>>>> >> >> > >> > > you
>>>> >> >> > >> > > > in
>>>> >> >> > >> > > > > > > >> >>>>> advance.
>>>> >> >> > >> > > > > > > >> >>>>>>>>>>>
>>>> >> >> > >> > > > > > > >> >>>>>>>>>>>
>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> Best Regards
>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> Peter Huang
>>>> >> >> > >> > > > > > > >> >>>>>>>>>>>
>>>> >> >> > >> > > > > > > >> >>>>>>>>>>
>>>> >> >> > >> > > > > > > >> >>>>>>>>>
>>>> >> >> > >> > > > > > > >> >>>>>>>>
>>>> >> >> > >> > > > > > > >> >>>>>>>
>>>> >> >> > >> > > > > > > >> >>>>>>
>>>> >> >> > >> > > > > > > >> >>>>>
>>>> >> >> > >> > > > > > > >> >>>>
>>>> >> >> > >> > > > > > > >> >>>
>>>> >> >> > >> > > > > > > >> >>
>>>> >> >> > >> > > > > > > >>
>>>> >> >> > >> > > > > > > >>
>>>> >> >> > >> > > > > > >
>>>> >> >> > >> > > > > >
>>>> >> >> > >> > > > >
>>>> >> >> > >> > > >
>>>> >> >> > >> > >
>>>> >> >> > >>
>>>> >> >> > >
>>>> >> >>
>>>>
>>>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] FLIP-85: Delayed Job Graph Generation

Kostas Kloudas-4
Also from my side +1  to start voting.

Cheers,
Kostas

On Thu, Mar 5, 2020 at 7:45 AM tison <[hidden email]> wrote:

>
> +1 to star voting.
>
> Best,
> tison.
>
>
> Yang Wang <[hidden email]> 于2020年3月5日周四 下午2:29写道:
>>
>> Hi Peter,
>> Really thanks for your response.
>>
>> Hi all @Kostas Kloudas @Zili Chen @Peter Huang @Rong Rong
>> It seems that we have reached an agreement. The “application mode” is regarded as the enhanced “per-job”. It is
>> orthogonal with “cluster deploy”. Currently, we bind the “per-job” to `run-user-main-on-client` and “application mode”
>> to `run-user-main-on-cluster`.
>>
>> Do you have other concerns to moving FLIP-85 to voting?
>>
>>
>> Best,
>> Yang
>>
>> Peter Huang <[hidden email]> 于2020年3月5日周四 下午12:48写道:
>>>
>>> Hi Yang and Kostas,
>>>
>>> Thanks for the clarification. It makes more sense to me if the long term goal is to replace per job mode to application mode
>>>  in the future (at the time that multiple execute can be supported). Before that, It will be better to keep the concept of
>>>  application mode internally. As Yang suggested, User only need to use a `-R/-- remote-deploy` cli option to launch
>>> a per job cluster with the main function executed in cluster entry-point.  +1 for the execution plan.
>>>
>>>
>>>
>>> Best Regards
>>> Peter Huang
>>>
>>>
>>>
>>>
>>> On Tue, Mar 3, 2020 at 7:11 AM Yang Wang <[hidden email]> wrote:
>>>>
>>>> Hi Peter,
>>>>
>>>> Having the application mode does not mean we will drop the cluster-deploy
>>>> option. I just want to share some thoughts about “Application Mode”.
>>>>
>>>>
>>>> 1. The application mode could cover the per-job sematic. Its lifecyle is bound
>>>> to the user `main()`. And all the jobs in the user main will be executed in a same
>>>> Flink cluster. In first phase of FLIP-85 implementation, running user main on the
>>>> cluster side could be supported in application mode.
>>>>
>>>> 2. Maybe in the future, we also need to support multiple `execute()` on client side
>>>> in a same Flink cluster. Then the per-job mode will evolve to application mode.
>>>>
>>>> 3. From user perspective, only a `-R/-- remote-deploy` cli option is visible. They
>>>> are not aware of the application mode.
>>>>
>>>> 4. In the first phase, the application mode is working as “per-job”(only one job in
>>>> the user main). We just leave more potential for the future.
>>>>
>>>>
>>>> I am not against with calling it “cluster deploy mode” if you all think it is clearer for users.
>>>>
>>>>
>>>>
>>>> Best,
>>>> Yang
>>>>
>>>> Kostas Kloudas <[hidden email]> 于2020年3月3日周二 下午6:49写道:
>>>>>
>>>>> Hi Peter,
>>>>>
>>>>> I understand your point. This is why I was also a bit torn about the
>>>>> name and my proposal was a bit aligned with yours (something along the
>>>>> lines of "cluster deploy" mode).
>>>>>
>>>>> But many of the other participants in the discussion suggested the
>>>>> "Application Mode". I think that the reasoning is that now the user's
>>>>> Application is more self-contained.
>>>>> It will be submitted to the cluster and the user can just disconnect.
>>>>> In addition, as discussed briefly in the doc, in the future there may
>>>>> be better support for multi-execute applications which will bring us
>>>>> one step closer to the true "Application Mode". But this is how I
>>>>> interpreted their arguments, of course they can also express their
>>>>> thoughts on the topic :)
>>>>>
>>>>> Cheers,
>>>>> Kostas
>>>>>
>>>>> On Mon, Mar 2, 2020 at 6:15 PM Peter Huang <[hidden email]> wrote:
>>>>> >
>>>>> > Hi Kostas,
>>>>> >
>>>>> > Thanks for updating the wiki. We have aligned with the implementations in the doc. But I feel it is still a little bit confusing of the naming from a user's perspective. It is well known that Flink support per job cluster and session cluster. The concept is in the layer of how a job is managed within Flink. The method introduced util now is a kind of mixing job and session cluster to promising the implementation complexity. We probably don't need to label it as Application Model as the same layer of per job cluster and session cluster. Conceptually, I think it is still a cluster mode implementation for per job cluster.
>>>>> >
>>>>> > To minimize the confusion of users, I think it would be better just an option of per job cluster for each type of cluster manager. How do you think?
>>>>> >
>>>>> >
>>>>> > Best Regards
>>>>> > Peter Huang
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> >
>>>>> > On Mon, Mar 2, 2020 at 7:22 AM Kostas Kloudas <[hidden email]> wrote:
>>>>> >>
>>>>> >> Hi Yang,
>>>>> >>
>>>>> >> The difference between per-job and application mode is that, as you
>>>>> >> described, in the per-job mode the main is executed on the client
>>>>> >> while in the application mode, the main is executed on the cluster.
>>>>> >> I do not think we have to offer "application mode" with running the
>>>>> >> main on the client side as this is exactly what the per-job mode does
>>>>> >> currently and, as you described also, it would be redundant.
>>>>> >>
>>>>> >> Sorry if this was not clear in the document.
>>>>> >>
>>>>> >> Cheers,
>>>>> >> Kostas
>>>>> >>
>>>>> >> On Mon, Mar 2, 2020 at 3:17 PM Yang Wang <[hidden email]> wrote:
>>>>> >> >
>>>>> >> > Hi Kostas,
>>>>> >> >
>>>>> >> > Thanks a lot for your conclusion and updating the FLIP-85 WIKI. Currently, i have no more
>>>>> >> > questions about motivation, approach, fault tolerance and the first phase implementation.
>>>>> >> >
>>>>> >> > I think the new title "Flink Application Mode" makes a lot senses to me. Especially for the
>>>>> >> > containerized environment, the cluster deploy option will be very useful.
>>>>> >> >
>>>>> >> > Just one concern, how do we introduce this new application mode to our users?
>>>>> >> > Each user program(i.e. `main()`) is an application. Currently, we intend to only support one
>>>>> >> > `execute()`. So what's the difference between per-job and application mode?
>>>>> >> >
>>>>> >> > For per-job, user `main()` is always executed on client side. And For application mode, user
>>>>> >> > `main()` could be executed on client or master side(configured via cli option).
>>>>> >> > Right? We need to have a clear concept. Otherwise, the users will be more and more confusing.
>>>>> >> >
>>>>> >> >
>>>>> >> > Best,
>>>>> >> > Yang
>>>>> >> >
>>>>> >> > Kostas Kloudas <[hidden email]> 于2020年3月2日周一 下午5:58写道:
>>>>> >> >>
>>>>> >> >> Hi all,
>>>>> >> >>
>>>>> >> >> I update https://cwiki.apache.org/confluence/display/FLINK/FLIP-85+Flink+Application+Mode
>>>>> >> >> based on the discussion we had here:
>>>>> >> >>
>>>>> >> >> https://docs.google.com/document/d/1ji72s3FD9DYUyGuKnJoO4ApzV-nSsZa0-bceGXW7Ocw/edit#
>>>>> >> >>
>>>>> >> >> Please let me know what you think and please keep the discussion in the ML :)
>>>>> >> >>
>>>>> >> >> Thanks for starting the discussion and I hope that soon we will be
>>>>> >> >> able to vote on the FLIP.
>>>>> >> >>
>>>>> >> >> Cheers,
>>>>> >> >> Kostas
>>>>> >> >>
>>>>> >> >> On Thu, Jan 16, 2020 at 3:40 AM Yang Wang <[hidden email]> wrote:
>>>>> >> >> >
>>>>> >> >> > Hi all,
>>>>> >> >> >
>>>>> >> >> > Thanks a lot for the feedback from @Kostas Kloudas. Your all concerns are
>>>>> >> >> > on point. The FLIP-85 is mainly
>>>>> >> >> > focused on supporting cluster mode for per-job. Since it is more urgent and
>>>>> >> >> > have much more use
>>>>> >> >> > cases both in Yarn and Kubernetes deployment. For session cluster, we could
>>>>> >> >> > have more discussion
>>>>> >> >> > in a new thread later.
>>>>> >> >> >
>>>>> >> >> > #1, How to download the user jars and dependencies for per-job in cluster
>>>>> >> >> > mode?
>>>>> >> >> > For Yarn, we could register the user jars and dependencies as
>>>>> >> >> > LocalResource. They will be distributed
>>>>> >> >> > by Yarn. And once the JobManager and TaskManager launched, the jars are
>>>>> >> >> > already exists.
>>>>> >> >> > For Standalone per-job and K8s, we expect that the user jars
>>>>> >> >> > and dependencies are built into the image.
>>>>> >> >> > Or the InitContainer could be used for downloading. It is natively
>>>>> >> >> > distributed and we will not have bottleneck.
>>>>> >> >> >
>>>>> >> >> > #2, Job graph recovery
>>>>> >> >> > We could have an optimization to store job graph on the DFS. However, i
>>>>> >> >> > suggest building a new jobgraph
>>>>> >> >> > from the configuration is the default option. Since we will not always have
>>>>> >> >> > a DFS store when deploying a
>>>>> >> >> > Flink per-job cluster. Of course, we assume that using the same
>>>>> >> >> > configuration(e.g. job_id, user_jar, main_class,
>>>>> >> >> > main_args, parallelism, savepoint_settings, etc.) will get a same job
>>>>> >> >> > graph. I think the standalone per-job
>>>>> >> >> > already has the similar behavior.
>>>>> >> >> >
>>>>> >> >> > #3, What happens with jobs that have multiple execute calls?
>>>>> >> >> > Currently, it is really a problem. Even we use a local client on Flink
>>>>> >> >> > master side, it will have different behavior with
>>>>> >> >> > client mode. For client mode, if we execute multiple times, then we will
>>>>> >> >> > deploy multiple Flink clusters for each execute.
>>>>> >> >> > I am not pretty sure whether it is reasonable. However, i still think using
>>>>> >> >> > the local client is a good choice. We could
>>>>> >> >> > continue the discussion in a new thread. @Zili Chen <[hidden email]> Do
>>>>> >> >> > you want to drive this?
>>>>> >> >> >
>>>>> >> >> >
>>>>> >> >> >
>>>>> >> >> > Best,
>>>>> >> >> > Yang
>>>>> >> >> >
>>>>> >> >> > Peter Huang <[hidden email]> 于2020年1月16日周四 上午1:55写道:
>>>>> >> >> >
>>>>> >> >> > > Hi Kostas,
>>>>> >> >> > >
>>>>> >> >> > > Thanks for this feedback. I can't agree more about the opinion. The
>>>>> >> >> > > cluster mode should be added
>>>>> >> >> > > first in per job cluster.
>>>>> >> >> > >
>>>>> >> >> > > 1) For job cluster implementation
>>>>> >> >> > > 1. Job graph recovery from configuration or store as static job graph as
>>>>> >> >> > > session cluster. I think the static one will be better for less recovery
>>>>> >> >> > > time.
>>>>> >> >> > > Let me update the doc for details.
>>>>> >> >> > >
>>>>> >> >> > > 2. For job execute multiple times, I think @Zili Chen
>>>>> >> >> > > <[hidden email]> has proposed the local client solution that can
>>>>> >> >> > > the run program actually in the cluster entry point. We can put the
>>>>> >> >> > > implementation in the second stage,
>>>>> >> >> > > or even a new FLIP for further discussion.
>>>>> >> >> > >
>>>>> >> >> > > 2) For session cluster implementation
>>>>> >> >> > > We can disable the cluster mode for the session cluster in the first
>>>>> >> >> > > stage. I agree the jar downloading will be a painful thing.
>>>>> >> >> > > We can consider about PoC and performance evaluation first. If the end to
>>>>> >> >> > > end experience is good enough, then we can consider
>>>>> >> >> > > proceeding with the solution.
>>>>> >> >> > >
>>>>> >> >> > > Looking forward to more opinions from @Yang Wang <[hidden email]> @Zili
>>>>> >> >> > > Chen <[hidden email]> @Dian Fu <[hidden email]>.
>>>>> >> >> > >
>>>>> >> >> > >
>>>>> >> >> > > Best Regards
>>>>> >> >> > > Peter Huang
>>>>> >> >> > >
>>>>> >> >> > > On Wed, Jan 15, 2020 at 7:50 AM Kostas Kloudas <[hidden email]> wrote:
>>>>> >> >> > >
>>>>> >> >> > >> Hi all,
>>>>> >> >> > >>
>>>>> >> >> > >> I am writing here as the discussion on the Google Doc seems to be a
>>>>> >> >> > >> bit difficult to follow.
>>>>> >> >> > >>
>>>>> >> >> > >> I think that in order to be able to make progress, it would be helpful
>>>>> >> >> > >> to focus on per-job mode for now.
>>>>> >> >> > >> The reason is that:
>>>>> >> >> > >>  1) making the (unique) JobSubmitHandler responsible for creating the
>>>>> >> >> > >> jobgraphs,
>>>>> >> >> > >>   which includes downloading dependencies, is not an optimal solution
>>>>> >> >> > >>  2) even if we put the responsibility on the JobMaster, currently each
>>>>> >> >> > >> job has its own
>>>>> >> >> > >>   JobMaster but they all run on the same process, so we have again a
>>>>> >> >> > >> single entity.
>>>>> >> >> > >>
>>>>> >> >> > >> Of course after this is done, and if we feel comfortable with the
>>>>> >> >> > >> solution, then we can go to the session mode.
>>>>> >> >> > >>
>>>>> >> >> > >> A second comment has to do with fault-tolerance in the per-job,
>>>>> >> >> > >> cluster-deploy mode.
>>>>> >> >> > >> In the document, it is suggested that upon recovery, the JobMaster of
>>>>> >> >> > >> each job re-creates the JobGraph.
>>>>> >> >> > >> I am just wondering if it is better to create and store the jobGraph
>>>>> >> >> > >> upon submission and only fetch it
>>>>> >> >> > >> upon recovery so that we have a static jobGraph.
>>>>> >> >> > >>
>>>>> >> >> > >> Finally, I have a question which is what happens with jobs that have
>>>>> >> >> > >> multiple execute calls?
>>>>> >> >> > >> The semantics seem to change compared to the current behaviour, right?
>>>>> >> >> > >>
>>>>> >> >> > >> Cheers,
>>>>> >> >> > >> Kostas
>>>>> >> >> > >>
>>>>> >> >> > >> On Wed, Jan 8, 2020 at 8:05 PM tison <[hidden email]> wrote:
>>>>> >> >> > >> >
>>>>> >> >> > >> > not always, Yang Wang is also not yet a committer but he can join the
>>>>> >> >> > >> > channel. I cannot find the id by clicking “Add new member in channel” so
>>>>> >> >> > >> > come to you and ask for try out the link. Possibly I will find other
>>>>> >> >> > >> ways
>>>>> >> >> > >> > but the original purpose is that the slack channel is a public area we
>>>>> >> >> > >> > discuss about developing...
>>>>> >> >> > >> > Best,
>>>>> >> >> > >> > tison.
>>>>> >> >> > >> >
>>>>> >> >> > >> >
>>>>> >> >> > >> > Peter Huang <[hidden email]> 于2020年1月9日周四 上午2:44写道:
>>>>> >> >> > >> >
>>>>> >> >> > >> > > Hi Tison,
>>>>> >> >> > >> > >
>>>>> >> >> > >> > > I am not the committer of Flink yet. I think I can't join it also.
>>>>> >> >> > >> > >
>>>>> >> >> > >> > >
>>>>> >> >> > >> > > Best Regards
>>>>> >> >> > >> > > Peter Huang
>>>>> >> >> > >> > >
>>>>> >> >> > >> > > On Wed, Jan 8, 2020 at 9:39 AM tison <[hidden email]> wrote:
>>>>> >> >> > >> > >
>>>>> >> >> > >> > > > Hi Peter,
>>>>> >> >> > >> > > >
>>>>> >> >> > >> > > > Could you try out this link?
>>>>> >> >> > >> > > https://the-asf.slack.com/messages/CNA3ADZPH
>>>>> >> >> > >> > > >
>>>>> >> >> > >> > > > Best,
>>>>> >> >> > >> > > > tison.
>>>>> >> >> > >> > > >
>>>>> >> >> > >> > > >
>>>>> >> >> > >> > > > Peter Huang <[hidden email]> 于2020年1月9日周四 上午1:22写道:
>>>>> >> >> > >> > > >
>>>>> >> >> > >> > > > > Hi Tison,
>>>>> >> >> > >> > > > >
>>>>> >> >> > >> > > > > I can't join the group with shared link. Would you please add me
>>>>> >> >> > >> into
>>>>> >> >> > >> > > the
>>>>> >> >> > >> > > > > group? My slack account is huangzhenqiu0825.
>>>>> >> >> > >> > > > > Thank you in advance.
>>>>> >> >> > >> > > > >
>>>>> >> >> > >> > > > >
>>>>> >> >> > >> > > > > Best Regards
>>>>> >> >> > >> > > > > Peter Huang
>>>>> >> >> > >> > > > >
>>>>> >> >> > >> > > > > On Wed, Jan 8, 2020 at 12:02 AM tison <[hidden email]>
>>>>> >> >> > >> wrote:
>>>>> >> >> > >> > > > >
>>>>> >> >> > >> > > > > > Hi Peter,
>>>>> >> >> > >> > > > > >
>>>>> >> >> > >> > > > > > As described above, this effort should get attention from people
>>>>> >> >> > >> > > > > developing
>>>>> >> >> > >> > > > > > FLIP-73 a.k.a. Executor abstractions. I recommend you to join
>>>>> >> >> > >> the
>>>>> >> >> > >> > > > public
>>>>> >> >> > >> > > > > > slack channel[1] for Flink Client API Enhancement and you can
>>>>> >> >> > >> try to
>>>>> >> >> > >> > > > > share
>>>>> >> >> > >> > > > > > you detailed thoughts there. It possibly gets more concrete
>>>>> >> >> > >> > > attentions.
>>>>> >> >> > >> > > > > >
>>>>> >> >> > >> > > > > > Best,
>>>>> >> >> > >> > > > > > tison.
>>>>> >> >> > >> > > > > >
>>>>> >> >> > >> > > > > > [1]
>>>>> >> >> > >> > > > > >
>>>>> >> >> > >> > > > > >
>>>>> >> >> > >> > > > >
>>>>> >> >> > >> > > >
>>>>> >> >> > >> > >
>>>>> >> >> > >> https://slack.com/share/IS21SJ75H/Rk8HhUly9FuEHb7oGwBZ33uL/enQtODg2MDYwNjE5MTg3LTA2MjIzNDc1M2ZjZDVlMjdlZjk1M2RkYmJhNjAwMTk2ZDZkODQ4NmY5YmI4OGRhNWJkYTViMTM1NzlmMzc4OWM
>>>>> >> >> > >> > > > > >
>>>>> >> >> > >> > > > > >
>>>>> >> >> > >> > > > > > Peter Huang <[hidden email]> 于2020年1月7日周二 上午5:09写道:
>>>>> >> >> > >> > > > > >
>>>>> >> >> > >> > > > > > > Dear All,
>>>>> >> >> > >> > > > > > >
>>>>> >> >> > >> > > > > > > Happy new year! According to existing feedback from the
>>>>> >> >> > >> community,
>>>>> >> >> > >> > > we
>>>>> >> >> > >> > > > > > > revised the doc with the consideration of session cluster
>>>>> >> >> > >> support,
>>>>> >> >> > >> > > > and
>>>>> >> >> > >> > > > > > > concrete interface changes needed and execution plan. Please
>>>>> >> >> > >> take
>>>>> >> >> > >> > > one
>>>>> >> >> > >> > > > > > more
>>>>> >> >> > >> > > > > > > round of review at your most convenient time.
>>>>> >> >> > >> > > > > > >
>>>>> >> >> > >> > > > > > >
>>>>> >> >> > >> > > > > > >
>>>>> >> >> > >> > > > > >
>>>>> >> >> > >> > > > >
>>>>> >> >> > >> > > >
>>>>> >> >> > >> > >
>>>>> >> >> > >> https://docs.google.com/document/d/1aAwVjdZByA-0CHbgv16Me-vjaaDMCfhX7TzVVTuifYM/edit#
>>>>> >> >> > >> > > > > > >
>>>>> >> >> > >> > > > > > >
>>>>> >> >> > >> > > > > > > Best Regards
>>>>> >> >> > >> > > > > > > Peter Huang
>>>>> >> >> > >> > > > > > >
>>>>> >> >> > >> > > > > > >
>>>>> >> >> > >> > > > > > >
>>>>> >> >> > >> > > > > > >
>>>>> >> >> > >> > > > > > >
>>>>> >> >> > >> > > > > > > On Thu, Jan 2, 2020 at 11:29 AM Peter Huang <
>>>>> >> >> > >> > > > > [hidden email]>
>>>>> >> >> > >> > > > > > > wrote:
>>>>> >> >> > >> > > > > > >
>>>>> >> >> > >> > > > > > > > Hi Dian,
>>>>> >> >> > >> > > > > > > > Thanks for giving us valuable feedbacks.
>>>>> >> >> > >> > > > > > > >
>>>>> >> >> > >> > > > > > > > 1) It's better to have a whole design for this feature
>>>>> >> >> > >> > > > > > > > For the suggestion of enabling the cluster mode also session
>>>>> >> >> > >> > > > > cluster, I
>>>>> >> >> > >> > > > > > > > think Flink already supported it. WebSubmissionExtension
>>>>> >> >> > >> already
>>>>> >> >> > >> > > > > allows
>>>>> >> >> > >> > > > > > > > users to start a job with the specified jar by using web UI.
>>>>> >> >> > >> > > > > > > > But we need to enable the feature from CLI for both local
>>>>> >> >> > >> jar,
>>>>> >> >> > >> > > > remote
>>>>> >> >> > >> > > > > > > jar.
>>>>> >> >> > >> > > > > > > > I will align with Yang Wang first about the details and
>>>>> >> >> > >> update
>>>>> >> >> > >> > > the
>>>>> >> >> > >> > > > > > design
>>>>> >> >> > >> > > > > > > > doc.
>>>>> >> >> > >> > > > > > > >
>>>>> >> >> > >> > > > > > > > 2) It's better to consider the convenience for users, such
>>>>> >> >> > >> as
>>>>> >> >> > >> > > > > debugging
>>>>> >> >> > >> > > > > > > >
>>>>> >> >> > >> > > > > > > > I am wondering whether we can store the exception in
>>>>> >> >> > >> jobgragh
>>>>> >> >> > >> > > > > > > > generation in application master. As no streaming graph can
>>>>> >> >> > >> be
>>>>> >> >> > >> > > > > > scheduled
>>>>> >> >> > >> > > > > > > in
>>>>> >> >> > >> > > > > > > > this case, there will be no more TM will be requested from
>>>>> >> >> > >> > > FlinkRM.
>>>>> >> >> > >> > > > > > > > If the AM is still running, users can still query it from
>>>>> >> >> > >> CLI. As
>>>>> >> >> > >> > > > it
>>>>> >> >> > >> > > > > > > > requires more change, we can get some feedback from <
>>>>> >> >> > >> > > > > > [hidden email]
>>>>> >> >> > >> > > > > > > >
>>>>> >> >> > >> > > > > > > > and @[hidden email] <[hidden email]>.
>>>>> >> >> > >> > > > > > > >
>>>>> >> >> > >> > > > > > > > 3) It's better to consider the impact to the stability of
>>>>> >> >> > >> the
>>>>> >> >> > >> > > > cluster
>>>>> >> >> > >> > > > > > > >
>>>>> >> >> > >> > > > > > > > I agree with Yang Wang's opinion.
>>>>> >> >> > >> > > > > > > >
>>>>> >> >> > >> > > > > > > >
>>>>> >> >> > >> > > > > > > >
>>>>> >> >> > >> > > > > > > > Best Regards
>>>>> >> >> > >> > > > > > > > Peter Huang
>>>>> >> >> > >> > > > > > > >
>>>>> >> >> > >> > > > > > > >
>>>>> >> >> > >> > > > > > > > On Sun, Dec 29, 2019 at 9:44 PM Dian Fu <
>>>>> >> >> > >> [hidden email]>
>>>>> >> >> > >> > > > > wrote:
>>>>> >> >> > >> > > > > > > >
>>>>> >> >> > >> > > > > > > >> Hi all,
>>>>> >> >> > >> > > > > > > >>
>>>>> >> >> > >> > > > > > > >> Sorry to jump into this discussion. Thanks everyone for the
>>>>> >> >> > >> > > > > > discussion.
>>>>> >> >> > >> > > > > > > >> I'm very interested in this topic although I'm not an
>>>>> >> >> > >> expert in
>>>>> >> >> > >> > > > this
>>>>> >> >> > >> > > > > > > part.
>>>>> >> >> > >> > > > > > > >> So I'm glad to share my thoughts as following:
>>>>> >> >> > >> > > > > > > >>
>>>>> >> >> > >> > > > > > > >> 1) It's better to have a whole design for this feature
>>>>> >> >> > >> > > > > > > >> As we know, there are two deployment modes: per-job mode
>>>>> >> >> > >> and
>>>>> >> >> > >> > > > session
>>>>> >> >> > >> > > > > > > >> mode. I'm wondering which mode really needs this feature.
>>>>> >> >> > >> As the
>>>>> >> >> > >> > > > > > design
>>>>> >> >> > >> > > > > > > doc
>>>>> >> >> > >> > > > > > > >> mentioned, per-job mode is more used for streaming jobs and
>>>>> >> >> > >> > > > session
>>>>> >> >> > >> > > > > > > mode is
>>>>> >> >> > >> > > > > > > >> usually used for batch jobs(Of course, the job types and
>>>>> >> >> > >> the
>>>>> >> >> > >> > > > > > deployment
>>>>> >> >> > >> > > > > > > >> modes are orthogonal). Usually streaming job is only
>>>>> >> >> > >> needed to
>>>>> >> >> > >> > > be
>>>>> >> >> > >> > > > > > > submitted
>>>>> >> >> > >> > > > > > > >> once and it will run for days or weeks, while batch jobs
>>>>> >> >> > >> will be
>>>>> >> >> > >> > > > > > > submitted
>>>>> >> >> > >> > > > > > > >> more frequently compared with streaming jobs. This means
>>>>> >> >> > >> that
>>>>> >> >> > >> > > > maybe
>>>>> >> >> > >> > > > > > > session
>>>>> >> >> > >> > > > > > > >> mode also needs this feature. However, if we support this
>>>>> >> >> > >> > > feature
>>>>> >> >> > >> > > > in
>>>>> >> >> > >> > > > > > > >> session mode, the application master will become the new
>>>>> >> >> > >> > > > centralized
>>>>> >> >> > >> > > > > > > >> service(which should be solved). So in this case, it's
>>>>> >> >> > >> better to
>>>>> >> >> > >> > > > > have
>>>>> >> >> > >> > > > > > a
>>>>> >> >> > >> > > > > > > >> complete design for both per-job mode and session mode.
>>>>> >> >> > >> > > > Furthermore,
>>>>> >> >> > >> > > > > > > even
>>>>> >> >> > >> > > > > > > >> if we can do it phase by phase, we need to have a whole
>>>>> >> >> > >> picture
>>>>> >> >> > >> > > of
>>>>> >> >> > >> > > > > how
>>>>> >> >> > >> > > > > > > it
>>>>> >> >> > >> > > > > > > >> works in both per-job mode and session mode.
>>>>> >> >> > >> > > > > > > >>
>>>>> >> >> > >> > > > > > > >> 2) It's better to consider the convenience for users, such
>>>>> >> >> > >> as
>>>>> >> >> > >> > > > > > debugging
>>>>> >> >> > >> > > > > > > >> After we finish this feature, the job graph will be
>>>>> >> >> > >> compiled in
>>>>> >> >> > >> > > > the
>>>>> >> >> > >> > > > > > > >> application master, which means that users cannot easily
>>>>> >> >> > >> get the
>>>>> >> >> > >> > > > > > > exception
>>>>> >> >> > >> > > > > > > >> message synchorousely in the job client if there are
>>>>> >> >> > >> problems
>>>>> >> >> > >> > > > during
>>>>> >> >> > >> > > > > > the
>>>>> >> >> > >> > > > > > > >> job graph compiling (especially for platform users), such
>>>>> >> >> > >> as the
>>>>> >> >> > >> > > > > > > resource
>>>>> >> >> > >> > > > > > > >> path is incorrect, the user program itself has some
>>>>> >> >> > >> problems,
>>>>> >> >> > >> > > etc.
>>>>> >> >> > >> > > > > > What
>>>>> >> >> > >> > > > > > > I'm
>>>>> >> >> > >> > > > > > > >> thinking is that maybe we should throw the exceptions as
>>>>> >> >> > >> early
>>>>> >> >> > >> > > as
>>>>> >> >> > >> > > > > > > possible
>>>>> >> >> > >> > > > > > > >> (during job submission stage).
>>>>> >> >> > >> > > > > > > >>
>>>>> >> >> > >> > > > > > > >> 3) It's better to consider the impact to the stability of
>>>>> >> >> > >> the
>>>>> >> >> > >> > > > > cluster
>>>>> >> >> > >> > > > > > > >> If we perform the compiling in the application master, we
>>>>> >> >> > >> should
>>>>> >> >> > >> > > > > > > consider
>>>>> >> >> > >> > > > > > > >> the impact of the compiling errors. Although YARN could
>>>>> >> >> > >> resume
>>>>> >> >> > >> > > the
>>>>> >> >> > >> > > > > > > >> application master in case of failures, but in some case
>>>>> >> >> > >> the
>>>>> >> >> > >> > > > > compiling
>>>>> >> >> > >> > > > > > > >> failure may be a waste of cluster resource and may impact
>>>>> >> >> > >> the
>>>>> >> >> > >> > > > > > stability
>>>>> >> >> > >> > > > > > > the
>>>>> >> >> > >> > > > > > > >> cluster and the other jobs in the cluster, such as the
>>>>> >> >> > >> resource
>>>>> >> >> > >> > > > path
>>>>> >> >> > >> > > > > > is
>>>>> >> >> > >> > > > > > > >> incorrect, the user program itself has some problems(in
>>>>> >> >> > >> this
>>>>> >> >> > >> > > case,
>>>>> >> >> > >> > > > > job
>>>>> >> >> > >> > > > > > > >> failover cannot solve this kind of problems) etc. In the
>>>>> >> >> > >> current
>>>>> >> >> > >> > > > > > > >> implemention, the compiling errors are handled in the
>>>>> >> >> > >> client
>>>>> >> >> > >> > > side
>>>>> >> >> > >> > > > > and
>>>>> >> >> > >> > > > > > > there
>>>>> >> >> > >> > > > > > > >> is no impact to the cluster at all.
>>>>> >> >> > >> > > > > > > >>
>>>>> >> >> > >> > > > > > > >> Regarding to 1), it's clearly pointed in the design doc
>>>>> >> >> > >> that
>>>>> >> >> > >> > > only
>>>>> >> >> > >> > > > > > > per-job
>>>>> >> >> > >> > > > > > > >> mode will be supported. However, I think it's better to
>>>>> >> >> > >> also
>>>>> >> >> > >> > > > > consider
>>>>> >> >> > >> > > > > > > the
>>>>> >> >> > >> > > > > > > >> session mode in the design doc.
>>>>> >> >> > >> > > > > > > >> Regarding to 2) and 3), I have not seen related sections
>>>>> >> >> > >> in the
>>>>> >> >> > >> > > > > design
>>>>> >> >> > >> > > > > > > >> doc. It will be good if we can cover them in the design
>>>>> >> >> > >> doc.
>>>>> >> >> > >> > > > > > > >>
>>>>> >> >> > >> > > > > > > >> Feel free to correct me If there is anything I
>>>>> >> >> > >> misunderstand.
>>>>> >> >> > >> > > > > > > >>
>>>>> >> >> > >> > > > > > > >> Regards,
>>>>> >> >> > >> > > > > > > >> Dian
>>>>> >> >> > >> > > > > > > >>
>>>>> >> >> > >> > > > > > > >>
>>>>> >> >> > >> > > > > > > >> > 在 2019年12月27日,上午3:13,Peter Huang <
>>>>> >> >> > >> [hidden email]>
>>>>> >> >> > >> > > > 写道:
>>>>> >> >> > >> > > > > > > >> >
>>>>> >> >> > >> > > > > > > >> > Hi Yang,
>>>>> >> >> > >> > > > > > > >> >
>>>>> >> >> > >> > > > > > > >> > I can't agree more. The effort definitely needs to align
>>>>> >> >> > >> with
>>>>> >> >> > >> > > > the
>>>>> >> >> > >> > > > > > > final
>>>>> >> >> > >> > > > > > > >> > goal of FLIP-73.
>>>>> >> >> > >> > > > > > > >> > I am thinking about whether we can achieve the goal with
>>>>> >> >> > >> two
>>>>> >> >> > >> > > > > phases.
>>>>> >> >> > >> > > > > > > >> >
>>>>> >> >> > >> > > > > > > >> > 1) Phase I
>>>>> >> >> > >> > > > > > > >> > As the CLiFrontend will not be depreciated soon. We can
>>>>> >> >> > >> still
>>>>> >> >> > >> > > > use
>>>>> >> >> > >> > > > > > the
>>>>> >> >> > >> > > > > > > >> > deployMode flag there,
>>>>> >> >> > >> > > > > > > >> > pass the program info through Flink configuration,  use
>>>>> >> >> > >> the
>>>>> >> >> > >> > > > > > > >> > ClassPathJobGraphRetriever
>>>>> >> >> > >> > > > > > > >> > to generate the job graph in ClusterEntrypoints of yarn
>>>>> >> >> > >> and
>>>>> >> >> > >> > > > > > > Kubernetes.
>>>>> >> >> > >> > > > > > > >> >
>>>>> >> >> > >> > > > > > > >> > 2) Phase II
>>>>> >> >> > >> > > > > > > >> > In  AbstractJobClusterExecutor, the job graph is
>>>>> >> >> > >> generated in
>>>>> >> >> > >> > > > the
>>>>> >> >> > >> > > > > > > >> execute
>>>>> >> >> > >> > > > > > > >> > function. We can still
>>>>> >> >> > >> > > > > > > >> > use the deployMode in it. With deployMode = cluster, the
>>>>> >> >> > >> > > execute
>>>>> >> >> > >> > > > > > > >> function
>>>>> >> >> > >> > > > > > > >> > only starts the cluster.
>>>>> >> >> > >> > > > > > > >> >
>>>>> >> >> > >> > > > > > > >> > When {Yarn/Kuberneates}PerJobClusterEntrypoint starts,
>>>>> >> >> > >> It will
>>>>> >> >> > >> > > > > start
>>>>> >> >> > >> > > > > > > the
>>>>> >> >> > >> > > > > > > >> > dispatch first, then we can use
>>>>> >> >> > >> > > > > > > >> > a ClusterEnvironment similar to ContextEnvironment to
>>>>> >> >> > >> submit
>>>>> >> >> > >> > > the
>>>>> >> >> > >> > > > > job
>>>>> >> >> > >> > > > > > > >> with
>>>>> >> >> > >> > > > > > > >> > jobName the local
>>>>> >> >> > >> > > > > > > >> > dispatcher. For the details, we need more investigation.
>>>>> >> >> > >> Let's
>>>>> >> >> > >> > > > > wait
>>>>> >> >> > >> > > > > > > >> > for @Aljoscha
>>>>> >> >> > >> > > > > > > >> > Krettek <[hidden email]> @Till Rohrmann <
>>>>> >> >> > >> > > > > [hidden email]
>>>>> >> >> > >> > > > > > >'s
>>>>> >> >> > >> > > > > > > >> > feedback after the holiday season.
>>>>> >> >> > >> > > > > > > >> >
>>>>> >> >> > >> > > > > > > >> > Thank you in advance. Merry Chrismas and Happy New
>>>>> >> >> > >> Year!!!
>>>>> >> >> > >> > > > > > > >> >
>>>>> >> >> > >> > > > > > > >> >
>>>>> >> >> > >> > > > > > > >> > Best Regards
>>>>> >> >> > >> > > > > > > >> > Peter Huang
>>>>> >> >> > >> > > > > > > >> >
>>>>> >> >> > >> > > > > > > >> >
>>>>> >> >> > >> > > > > > > >> >
>>>>> >> >> > >> > > > > > > >> >
>>>>> >> >> > >> > > > > > > >> >
>>>>> >> >> > >> > > > > > > >> >
>>>>> >> >> > >> > > > > > > >> >
>>>>> >> >> > >> > > > > > > >> >
>>>>> >> >> > >> > > > > > > >> > On Wed, Dec 25, 2019 at 1:08 AM Yang Wang <
>>>>> >> >> > >> > > > [hidden email]>
>>>>> >> >> > >> > > > > > > >> wrote:
>>>>> >> >> > >> > > > > > > >> >
>>>>> >> >> > >> > > > > > > >> >> Hi Peter,
>>>>> >> >> > >> > > > > > > >> >>
>>>>> >> >> > >> > > > > > > >> >> I think we need to reconsider tison's suggestion
>>>>> >> >> > >> seriously.
>>>>> >> >> > >> > > > After
>>>>> >> >> > >> > > > > > > >> FLIP-73,
>>>>> >> >> > >> > > > > > > >> >> the deployJobCluster has
>>>>> >> >> > >> > > > > > > >> >> beenmoved into `JobClusterExecutor#execute`. It should
>>>>> >> >> > >> not be
>>>>> >> >> > >> > > > > > > perceived
>>>>> >> >> > >> > > > > > > >> >> for `CliFrontend`. That
>>>>> >> >> > >> > > > > > > >> >> means the user program will *ALWAYS* be executed on
>>>>> >> >> > >> client
>>>>> >> >> > >> > > > side.
>>>>> >> >> > >> > > > > > This
>>>>> >> >> > >> > > > > > > >> is
>>>>> >> >> > >> > > > > > > >> >> the by design behavior.
>>>>> >> >> > >> > > > > > > >> >> So, we could not just add `if(client mode) .. else
>>>>> >> >> > >> if(cluster
>>>>> >> >> > >> > > > > mode)
>>>>> >> >> > >> > > > > > > >> ...`
>>>>> >> >> > >> > > > > > > >> >> codes in `CliFrontend` to bypass
>>>>> >> >> > >> > > > > > > >> >> the executor. We need to find a clean way to decouple
>>>>> >> >> > >> > > executing
>>>>> >> >> > >> > > > > > user
>>>>> >> >> > >> > > > > > > >> >> program and deploying per-job
>>>>> >> >> > >> > > > > > > >> >> cluster. Based on this, we could support to execute user
>>>>> >> >> > >> > > > program
>>>>> >> >> > >> > > > > on
>>>>> >> >> > >> > > > > > > >> client
>>>>> >> >> > >> > > > > > > >> >> or master side.
>>>>> >> >> > >> > > > > > > >> >>
>>>>> >> >> > >> > > > > > > >> >> Maybe Aljoscha and Jeff could give some good
>>>>> >> >> > >> suggestions.
>>>>> >> >> > >> > > > > > > >> >>
>>>>> >> >> > >> > > > > > > >> >>
>>>>> >> >> > >> > > > > > > >> >>
>>>>> >> >> > >> > > > > > > >> >> Best,
>>>>> >> >> > >> > > > > > > >> >> Yang
>>>>> >> >> > >> > > > > > > >> >>
>>>>> >> >> > >> > > > > > > >> >> Peter Huang <[hidden email]> 于2019年12月25日周三
>>>>> >> >> > >> > > > > 上午4:03写道:
>>>>> >> >> > >> > > > > > > >> >>
>>>>> >> >> > >> > > > > > > >> >>> Hi Jingjing,
>>>>> >> >> > >> > > > > > > >> >>>
>>>>> >> >> > >> > > > > > > >> >>> The improvement proposed is a deployment option for
>>>>> >> >> > >> CLI. For
>>>>> >> >> > >> > > > SQL
>>>>> >> >> > >> > > > > > > based
>>>>> >> >> > >> > > > > > > >> >>> Flink application, It is more convenient to use the
>>>>> >> >> > >> existing
>>>>> >> >> > >> > > > > model
>>>>> >> >> > >> > > > > > > in
>>>>> >> >> > >> > > > > > > >> >>> SqlClient in which
>>>>> >> >> > >> > > > > > > >> >>> the job graph is generated within SqlClient. After
>>>>> >> >> > >> adding
>>>>> >> >> > >> > > the
>>>>> >> >> > >> > > > > > > delayed
>>>>> >> >> > >> > > > > > > >> job
>>>>> >> >> > >> > > > > > > >> >>> graph generation, I think there is no change is needed
>>>>> >> >> > >> for
>>>>> >> >> > >> > > > your
>>>>> >> >> > >> > > > > > > side.
>>>>> >> >> > >> > > > > > > >> >>>
>>>>> >> >> > >> > > > > > > >> >>>
>>>>> >> >> > >> > > > > > > >> >>> Best Regards
>>>>> >> >> > >> > > > > > > >> >>> Peter Huang
>>>>> >> >> > >> > > > > > > >> >>>
>>>>> >> >> > >> > > > > > > >> >>>
>>>>> >> >> > >> > > > > > > >> >>> On Wed, Dec 18, 2019 at 6:01 AM jingjing bai <
>>>>> >> >> > >> > > > > > > >> [hidden email]>
>>>>> >> >> > >> > > > > > > >> >>> wrote:
>>>>> >> >> > >> > > > > > > >> >>>
>>>>> >> >> > >> > > > > > > >> >>>> hi peter:
>>>>> >> >> > >> > > > > > > >> >>>>    we had extension SqlClent to support sql job
>>>>> >> >> > >> submit in
>>>>> >> >> > >> > > web
>>>>> >> >> > >> > > > > > base
>>>>> >> >> > >> > > > > > > on
>>>>> >> >> > >> > > > > > > >> >>>> flink 1.9.   we support submit to yarn on per job
>>>>> >> >> > >> mode too.
>>>>> >> >> > >> > > > > > > >> >>>>    in this case, the job graph generated  on client
>>>>> >> >> > >> side
>>>>> >> >> > >> > > .  I
>>>>> >> >> > >> > > > > > think
>>>>> >> >> > >> > > > > > > >> >>> this
>>>>> >> >> > >> > > > > > > >> >>>> discuss Mainly to improve api programme.  but in my
>>>>> >> >> > >> case ,
>>>>> >> >> > >> > > > > there
>>>>> >> >> > >> > > > > > is
>>>>> >> >> > >> > > > > > > >> no
>>>>> >> >> > >> > > > > > > >> >>>> jar to upload but only a sql string .
>>>>> >> >> > >> > > > > > > >> >>>>    do u had more suggestion to improve for sql mode
>>>>> >> >> > >> or it
>>>>> >> >> > >> > > is
>>>>> >> >> > >> > > > > > only a
>>>>> >> >> > >> > > > > > > >> >>>> switch for api programme?
>>>>> >> >> > >> > > > > > > >> >>>>
>>>>> >> >> > >> > > > > > > >> >>>>
>>>>> >> >> > >> > > > > > > >> >>>> best
>>>>> >> >> > >> > > > > > > >> >>>> bai jj
>>>>> >> >> > >> > > > > > > >> >>>>
>>>>> >> >> > >> > > > > > > >> >>>>
>>>>> >> >> > >> > > > > > > >> >>>> Yang Wang <[hidden email]> 于2019年12月18日周三
>>>>> >> >> > >> 下午7:21写道:
>>>>> >> >> > >> > > > > > > >> >>>>
>>>>> >> >> > >> > > > > > > >> >>>>> I just want to revive this discussion.
>>>>> >> >> > >> > > > > > > >> >>>>>
>>>>> >> >> > >> > > > > > > >> >>>>> Recently, i am thinking about how to natively run
>>>>> >> >> > >> flink
>>>>> >> >> > >> > > > > per-job
>>>>> >> >> > >> > > > > > > >> >>> cluster on
>>>>> >> >> > >> > > > > > > >> >>>>> Kubernetes.
>>>>> >> >> > >> > > > > > > >> >>>>> The per-job mode on Kubernetes is very different
>>>>> >> >> > >> from on
>>>>> >> >> > >> > > > Yarn.
>>>>> >> >> > >> > > > > > And
>>>>> >> >> > >> > > > > > > >> we
>>>>> >> >> > >> > > > > > > >> >>> will
>>>>> >> >> > >> > > > > > > >> >>>>> have
>>>>> >> >> > >> > > > > > > >> >>>>> the same deployment requirements to the client and
>>>>> >> >> > >> entry
>>>>> >> >> > >> > > > > point.
>>>>> >> >> > >> > > > > > > >> >>>>>
>>>>> >> >> > >> > > > > > > >> >>>>> 1. Flink client not always need a local jar to start
>>>>> >> >> > >> a
>>>>> >> >> > >> > > Flink
>>>>> >> >> > >> > > > > > > per-job
>>>>> >> >> > >> > > > > > > >> >>>>> cluster. We could
>>>>> >> >> > >> > > > > > > >> >>>>> support multiple schemas. For example,
>>>>> >> >> > >> > > > file:///path/of/my.jar
>>>>> >> >> > >> > > > > > > means
>>>>> >> >> > >> > > > > > > >> a
>>>>> >> >> > >> > > > > > > >> >>> jar
>>>>> >> >> > >> > > > > > > >> >>>>> located
>>>>> >> >> > >> > > > > > > >> >>>>> at client side,
>>>>> >> >> > >> hdfs://myhdfs/user/myname/flink/my.jar
>>>>> >> >> > >> > > > means a
>>>>> >> >> > >> > > > > > jar
>>>>> >> >> > >> > > > > > > >> >>> located
>>>>> >> >> > >> > > > > > > >> >>>>> at
>>>>> >> >> > >> > > > > > > >> >>>>> remote hdfs, local:///path/in/image/my.jar means a
>>>>> >> >> > >> jar
>>>>> >> >> > >> > > > located
>>>>> >> >> > >> > > > > > at
>>>>> >> >> > >> > > > > > > >> >>>>> jobmanager side.
>>>>> >> >> > >> > > > > > > >> >>>>>
>>>>> >> >> > >> > > > > > > >> >>>>> 2. Support running user program on master side. This
>>>>> >> >> > >> also
>>>>> >> >> > >> > > > > means
>>>>> >> >> > >> > > > > > > the
>>>>> >> >> > >> > > > > > > >> >>> entry
>>>>> >> >> > >> > > > > > > >> >>>>> point
>>>>> >> >> > >> > > > > > > >> >>>>> will generate the job graph on master side. We could
>>>>> >> >> > >> use
>>>>> >> >> > >> > > the
>>>>> >> >> > >> > > > > > > >> >>>>> ClasspathJobGraphRetriever
>>>>> >> >> > >> > > > > > > >> >>>>> or start a local Flink client to achieve this
>>>>> >> >> > >> purpose.
>>>>> >> >> > >> > > > > > > >> >>>>>
>>>>> >> >> > >> > > > > > > >> >>>>>
>>>>> >> >> > >> > > > > > > >> >>>>> cc tison, Aljoscha & Kostas Do you think this is the
>>>>> >> >> > >> right
>>>>> >> >> > >> > > > > > > >> direction we
>>>>> >> >> > >> > > > > > > >> >>>>> need to work?
>>>>> >> >> > >> > > > > > > >> >>>>>
>>>>> >> >> > >> > > > > > > >> >>>>> tison <[hidden email]> 于2019年12月12日周四
>>>>> >> >> > >> 下午4:48写道:
>>>>> >> >> > >> > > > > > > >> >>>>>
>>>>> >> >> > >> > > > > > > >> >>>>>> A quick idea is that we separate the deployment
>>>>> >> >> > >> from user
>>>>> >> >> > >> > > > > > program
>>>>> >> >> > >> > > > > > > >> >>> that
>>>>> >> >> > >> > > > > > > >> >>>>> it
>>>>> >> >> > >> > > > > > > >> >>>>>> has always been done
>>>>> >> >> > >> > > > > > > >> >>>>>> outside the program. On user program executed there
>>>>> >> >> > >> is
>>>>> >> >> > >> > > > > always a
>>>>> >> >> > >> > > > > > > >> >>>>>> ClusterClient that communicates with
>>>>> >> >> > >> > > > > > > >> >>>>>> an existing cluster, remote or local. It will be
>>>>> >> >> > >> another
>>>>> >> >> > >> > > > > thread
>>>>> >> >> > >> > > > > > > so
>>>>> >> >> > >> > > > > > > >> >>> just
>>>>> >> >> > >> > > > > > > >> >>>>> for
>>>>> >> >> > >> > > > > > > >> >>>>>> your information.
>>>>> >> >> > >> > > > > > > >> >>>>>>
>>>>> >> >> > >> > > > > > > >> >>>>>> Best,
>>>>> >> >> > >> > > > > > > >> >>>>>> tison.
>>>>> >> >> > >> > > > > > > >> >>>>>>
>>>>> >> >> > >> > > > > > > >> >>>>>>
>>>>> >> >> > >> > > > > > > >> >>>>>> tison <[hidden email]> 于2019年12月12日周四
>>>>> >> >> > >> 下午4:40写道:
>>>>> >> >> > >> > > > > > > >> >>>>>>
>>>>> >> >> > >> > > > > > > >> >>>>>>> Hi Peter,
>>>>> >> >> > >> > > > > > > >> >>>>>>>
>>>>> >> >> > >> > > > > > > >> >>>>>>> Another concern I realized recently is that with
>>>>> >> >> > >> current
>>>>> >> >> > >> > > > > > > Executors
>>>>> >> >> > >> > > > > > > >> >>>>>>> abstraction(FLIP-73)
>>>>> >> >> > >> > > > > > > >> >>>>>>> I'm afraid that user program is designed to ALWAYS
>>>>> >> >> > >> run
>>>>> >> >> > >> > > on
>>>>> >> >> > >> > > > > the
>>>>> >> >> > >> > > > > > > >> >>> client
>>>>> >> >> > >> > > > > > > >> >>>>>> side.
>>>>> >> >> > >> > > > > > > >> >>>>>>> Specifically,
>>>>> >> >> > >> > > > > > > >> >>>>>>> we deploy the job in executor when env.execute
>>>>> >> >> > >> called.
>>>>> >> >> > >> > > > This
>>>>> >> >> > >> > > > > > > >> >>>>> abstraction
>>>>> >> >> > >> > > > > > > >> >>>>>>> possibly prevents
>>>>> >> >> > >> > > > > > > >> >>>>>>> Flink runs user program on the cluster side.
>>>>> >> >> > >> > > > > > > >> >>>>>>>
>>>>> >> >> > >> > > > > > > >> >>>>>>> For your proposal, in this case we already
>>>>> >> >> > >> compiled the
>>>>> >> >> > >> > > > > > program
>>>>> >> >> > >> > > > > > > >> and
>>>>> >> >> > >> > > > > > > >> >>>>> run
>>>>> >> >> > >> > > > > > > >> >>>>>> on
>>>>> >> >> > >> > > > > > > >> >>>>>>> the client side,
>>>>> >> >> > >> > > > > > > >> >>>>>>> even we deploy a cluster and retrieve job graph
>>>>> >> >> > >> from
>>>>> >> >> > >> > > > program
>>>>> >> >> > >> > > > > > > >> >>>>> metadata, it
>>>>> >> >> > >> > > > > > > >> >>>>>>> doesn't make
>>>>> >> >> > >> > > > > > > >> >>>>>>> many sense.
>>>>> >> >> > >> > > > > > > >> >>>>>>>
>>>>> >> >> > >> > > > > > > >> >>>>>>> cc Aljoscha & Kostas what do you think about this
>>>>> >> >> > >> > > > > constraint?
>>>>> >> >> > >> > > > > > > >> >>>>>>>
>>>>> >> >> > >> > > > > > > >> >>>>>>> Best,
>>>>> >> >> > >> > > > > > > >> >>>>>>> tison.
>>>>> >> >> > >> > > > > > > >> >>>>>>>
>>>>> >> >> > >> > > > > > > >> >>>>>>>
>>>>> >> >> > >> > > > > > > >> >>>>>>> Peter Huang <[hidden email]>
>>>>> >> >> > >> 于2019年12月10日周二
>>>>> >> >> > >> > > > > > > >> 下午12:45写道:
>>>>> >> >> > >> > > > > > > >> >>>>>>>
>>>>> >> >> > >> > > > > > > >> >>>>>>>> Hi Tison,
>>>>> >> >> > >> > > > > > > >> >>>>>>>>
>>>>> >> >> > >> > > > > > > >> >>>>>>>> Yes, you are right. I think I made the wrong
>>>>> >> >> > >> argument
>>>>> >> >> > >> > > in
>>>>> >> >> > >> > > > > the
>>>>> >> >> > >> > > > > > > doc.
>>>>> >> >> > >> > > > > > > >> >>>>>>>> Basically, the packaging jar problem is only for
>>>>> >> >> > >> > > platform
>>>>> >> >> > >> > > > > > > users.
>>>>> >> >> > >> > > > > > > >> >>> In
>>>>> >> >> > >> > > > > > > >> >>>>> our
>>>>> >> >> > >> > > > > > > >> >>>>>>>> internal deploy service,
>>>>> >> >> > >> > > > > > > >> >>>>>>>> we further optimized the deployment latency by
>>>>> >> >> > >> letting
>>>>> >> >> > >> > > > > users
>>>>> >> >> > >> > > > > > to
>>>>> >> >> > >> > > > > > > >> >>>>>> packaging
>>>>> >> >> > >> > > > > > > >> >>>>>>>> flink-runtime together with the uber jar, so that
>>>>> >> >> > >> we
>>>>> >> >> > >> > > > don't
>>>>> >> >> > >> > > > > > need
>>>>> >> >> > >> > > > > > > >> to
>>>>> >> >> > >> > > > > > > >> >>>>>>>> consider
>>>>> >> >> > >> > > > > > > >> >>>>>>>> multiple flink version
>>>>> >> >> > >> > > > > > > >> >>>>>>>> support for now. In the session client mode, as
>>>>> >> >> > >> Flink
>>>>> >> >> > >> > > > libs
>>>>> >> >> > >> > > > > > will
>>>>> >> >> > >> > > > > > > >> be
>>>>> >> >> > >> > > > > > > >> >>>>>> shipped
>>>>> >> >> > >> > > > > > > >> >>>>>>>> anyway as local resources of yarn. Users actually
>>>>> >> >> > >> don't
>>>>> >> >> > >> > > > > need
>>>>> >> >> > >> > > > > > to
>>>>> >> >> > >> > > > > > > >> >>>>> package
>>>>> >> >> > >> > > > > > > >> >>>>>>>> those libs into job jar.
>>>>> >> >> > >> > > > > > > >> >>>>>>>>
>>>>> >> >> > >> > > > > > > >> >>>>>>>>
>>>>> >> >> > >> > > > > > > >> >>>>>>>>
>>>>> >> >> > >> > > > > > > >> >>>>>>>> Best Regards
>>>>> >> >> > >> > > > > > > >> >>>>>>>> Peter Huang
>>>>> >> >> > >> > > > > > > >> >>>>>>>>
>>>>> >> >> > >> > > > > > > >> >>>>>>>> On Mon, Dec 9, 2019 at 8:35 PM tison <
>>>>> >> >> > >> > > > [hidden email]
>>>>> >> >> > >> > > > > >
>>>>> >> >> > >> > > > > > > >> >>> wrote:
>>>>> >> >> > >> > > > > > > >> >>>>>>>>
>>>>> >> >> > >> > > > > > > >> >>>>>>>>>> 3. What do you mean about the package? Do users
>>>>> >> >> > >> need
>>>>> >> >> > >> > > to
>>>>> >> >> > >> > > > > > > >> >>> compile
>>>>> >> >> > >> > > > > > > >> >>>>>> their
>>>>> >> >> > >> > > > > > > >> >>>>>>>>> jars
>>>>> >> >> > >> > > > > > > >> >>>>>>>>> inlcuding flink-clients, flink-optimizer,
>>>>> >> >> > >> flink-table
>>>>> >> >> > >> > > > > codes?
>>>>> >> >> > >> > > > > > > >> >>>>>>>>>
>>>>> >> >> > >> > > > > > > >> >>>>>>>>> The answer should be no because they exist in
>>>>> >> >> > >> system
>>>>> >> >> > >> > > > > > > classpath.
>>>>> >> >> > >> > > > > > > >> >>>>>>>>>
>>>>> >> >> > >> > > > > > > >> >>>>>>>>> Best,
>>>>> >> >> > >> > > > > > > >> >>>>>>>>> tison.
>>>>> >> >> > >> > > > > > > >> >>>>>>>>>
>>>>> >> >> > >> > > > > > > >> >>>>>>>>>
>>>>> >> >> > >> > > > > > > >> >>>>>>>>> Yang Wang <[hidden email]> 于2019年12月10日周二
>>>>> >> >> > >> > > > > 下午12:18写道:
>>>>> >> >> > >> > > > > > > >> >>>>>>>>>
>>>>> >> >> > >> > > > > > > >> >>>>>>>>>> Hi Peter,
>>>>> >> >> > >> > > > > > > >> >>>>>>>>>>
>>>>> >> >> > >> > > > > > > >> >>>>>>>>>> Thanks a lot for starting this discussion. I
>>>>> >> >> > >> think
>>>>> >> >> > >> > > this
>>>>> >> >> > >> > > > > is
>>>>> >> >> > >> > > > > > a
>>>>> >> >> > >> > > > > > > >> >>> very
>>>>> >> >> > >> > > > > > > >> >>>>>>>> useful
>>>>> >> >> > >> > > > > > > >> >>>>>>>>>> feature.
>>>>> >> >> > >> > > > > > > >> >>>>>>>>>>
>>>>> >> >> > >> > > > > > > >> >>>>>>>>>> Not only for Yarn, i am focused on flink on
>>>>> >> >> > >> > > Kubernetes
>>>>> >> >> > >> > > > > > > >> >>>>> integration
>>>>> >> >> > >> > > > > > > >> >>>>>> and
>>>>> >> >> > >> > > > > > > >> >>>>>>>>> come
>>>>> >> >> > >> > > > > > > >> >>>>>>>>>> across the same
>>>>> >> >> > >> > > > > > > >> >>>>>>>>>> problem. I do not want the job graph generated
>>>>> >> >> > >> on
>>>>> >> >> > >> > > > client
>>>>> >> >> > >> > > > > > > side.
>>>>> >> >> > >> > > > > > > >> >>>>>>>> Instead,
>>>>> >> >> > >> > > > > > > >> >>>>>>>>> the
>>>>> >> >> > >> > > > > > > >> >>>>>>>>>> user jars are built in
>>>>> >> >> > >> > > > > > > >> >>>>>>>>>> a user-defined image. When the job manager
>>>>> >> >> > >> launched,
>>>>> >> >> > >> > > we
>>>>> >> >> > >> > > > > > just
>>>>> >> >> > >> > > > > > > >> >>>>> need to
>>>>> >> >> > >> > > > > > > >> >>>>>>>>>> generate the job graph
>>>>> >> >> > >> > > > > > > >> >>>>>>>>>> based on local user jars.
>>>>> >> >> > >> > > > > > > >> >>>>>>>>>>
>>>>> >> >> > >> > > > > > > >> >>>>>>>>>> I have some small suggestion about this.
>>>>> >> >> > >> > > > > > > >> >>>>>>>>>>
>>>>> >> >> > >> > > > > > > >> >>>>>>>>>> 1. `ProgramJobGraphRetriever` is very similar to
>>>>> >> >> > >> > > > > > > >> >>>>>>>>>> `ClasspathJobGraphRetriever`, the differences
>>>>> >> >> > >> > > > > > > >> >>>>>>>>>> are the former needs `ProgramMetadata` and the
>>>>> >> >> > >> latter
>>>>> >> >> > >> > > > > needs
>>>>> >> >> > >> > > > > > > >> >>> some
>>>>> >> >> > >> > > > > > > >> >>>>>>>>> arguments.
>>>>> >> >> > >> > > > > > > >> >>>>>>>>>> Is it possible to
>>>>> >> >> > >> > > > > > > >> >>>>>>>>>> have an unified `JobGraphRetriever` to support
>>>>> >> >> > >> both?
>>>>> >> >> > >> > > > > > > >> >>>>>>>>>> 2. Is it possible to not use a local user jar to
>>>>> >> >> > >> > > start
>>>>> >> >> > >> > > > a
>>>>> >> >> > >> > > > > > > >> >>> per-job
>>>>> >> >> > >> > > > > > > >> >>>>>>>> cluster?
>>>>> >> >> > >> > > > > > > >> >>>>>>>>>> In your case, the user jars has
>>>>> >> >> > >> > > > > > > >> >>>>>>>>>> existed on hdfs already and we do need to
>>>>> >> >> > >> download
>>>>> >> >> > >> > > the
>>>>> >> >> > >> > > > > jars
>>>>> >> >> > >> > > > > > > to
>>>>> >> >> > >> > > > > > > >> >>>>>>>> deployer
>>>>> >> >> > >> > > > > > > >> >>>>>>>>>> service. Currently, we
>>>>> >> >> > >> > > > > > > >> >>>>>>>>>> always need a local user jar to start a flink
>>>>> >> >> > >> > > cluster.
>>>>> >> >> > >> > > > It
>>>>> >> >> > >> > > > > > is
>>>>> >> >> > >> > > > > > > >> >>> be
>>>>> >> >> > >> > > > > > > >> >>>>>> great
>>>>> >> >> > >> > > > > > > >> >>>>>>>> if
>>>>> >> >> > >> > > > > > > >> >>>>>>>>> we
>>>>> >> >> > >> > > > > > > >> >>>>>>>>>> could support remote user jars.
>>>>> >> >> > >> > > > > > > >> >>>>>>>>>>>> In the implementation, we assume users package
>>>>> >> >> > >> > > > > > > >> >>> flink-clients,
>>>>> >> >> > >> > > > > > > >> >>>>>>>>>> flink-optimizer, flink-table together within
>>>>> >> >> > >> the job
>>>>> >> >> > >> > > > jar.
>>>>> >> >> > >> > > > > > > >> >>>>> Otherwise,
>>>>> >> >> > >> > > > > > > >> >>>>>>>> the
>>>>> >> >> > >> > > > > > > >> >>>>>>>>>> job graph generation within
>>>>> >> >> > >> JobClusterEntryPoint will
>>>>> >> >> > >> > > > > fail.
>>>>> >> >> > >> > > > > > > >> >>>>>>>>>> 3. What do you mean about the package? Do users
>>>>> >> >> > >> need
>>>>> >> >> > >> > > to
>>>>> >> >> > >> > > > > > > >> >>> compile
>>>>> >> >> > >> > > > > > > >> >>>>>> their
>>>>> >> >> > >> > > > > > > >> >>>>>>>>> jars
>>>>> >> >> > >> > > > > > > >> >>>>>>>>>> inlcuding flink-clients, flink-optimizer,
>>>>> >> >> > >> flink-table
>>>>> >> >> > >> > > > > > codes?
>>>>> >> >> > >> > > > > > > >> >>>>>>>>>>
>>>>> >> >> > >> > > > > > > >> >>>>>>>>>>
>>>>> >> >> > >> > > > > > > >> >>>>>>>>>>
>>>>> >> >> > >> > > > > > > >> >>>>>>>>>> Best,
>>>>> >> >> > >> > > > > > > >> >>>>>>>>>> Yang
>>>>> >> >> > >> > > > > > > >> >>>>>>>>>>
>>>>> >> >> > >> > > > > > > >> >>>>>>>>>> Peter Huang <[hidden email]>
>>>>> >> >> > >> > > > 于2019年12月10日周二
>>>>> >> >> > >> > > > > > > >> >>>>> 上午2:37写道:
>>>>> >> >> > >> > > > > > > >> >>>>>>>>>>
>>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> Dear All,
>>>>> >> >> > >> > > > > > > >> >>>>>>>>>>>
>>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> Recently, the Flink community starts to
>>>>> >> >> > >> improve the
>>>>> >> >> > >> > > > yarn
>>>>> >> >> > >> > > > > > > >> >>>>> cluster
>>>>> >> >> > >> > > > > > > >> >>>>>>>>>> descriptor
>>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> to make job jar and config files configurable
>>>>> >> >> > >> from
>>>>> >> >> > >> > > > CLI.
>>>>> >> >> > >> > > > > It
>>>>> >> >> > >> > > > > > > >> >>>>>> improves
>>>>> >> >> > >> > > > > > > >> >>>>>>>> the
>>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> flexibility of  Flink deployment Yarn Per Job
>>>>> >> >> > >> Mode.
>>>>> >> >> > >> > > > For
>>>>> >> >> > >> > > > > > > >> >>>>> platform
>>>>> >> >> > >> > > > > > > >> >>>>>>>> users
>>>>> >> >> > >> > > > > > > >> >>>>>>>>>> who
>>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> manage tens of hundreds of streaming pipelines
>>>>> >> >> > >> for
>>>>> >> >> > >> > > the
>>>>> >> >> > >> > > > > > whole
>>>>> >> >> > >> > > > > > > >> >>>>> org
>>>>> >> >> > >> > > > > > > >> >>>>>> or
>>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> company, we found the job graph generation in
>>>>> >> >> > >> > > > > client-side
>>>>> >> >> > >> > > > > > is
>>>>> >> >> > >> > > > > > > >> >>>>>> another
>>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> pinpoint. Thus, we want to propose a
>>>>> >> >> > >> configurable
>>>>> >> >> > >> > > > > feature
>>>>> >> >> > >> > > > > > > >> >>> for
>>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> FlinkYarnSessionCli. The feature can allow
>>>>> >> >> > >> users to
>>>>> >> >> > >> > > > > choose
>>>>> >> >> > >> > > > > > > >> >>> the
>>>>> >> >> > >> > > > > > > >> >>>>> job
>>>>> >> >> > >> > > > > > > >> >>>>>>>>> graph
>>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> generation in Flink ClusterEntryPoint so that
>>>>> >> >> > >> the
>>>>> >> >> > >> > > job
>>>>> >> >> > >> > > > > jar
>>>>> >> >> > >> > > > > > > >> >>>>> doesn't
>>>>> >> >> > >> > > > > > > >> >>>>>>>> need
>>>>> >> >> > >> > > > > > > >> >>>>>>>>> to
>>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> be locally for the job graph generation. The
>>>>> >> >> > >> > > proposal
>>>>> >> >> > >> > > > is
>>>>> >> >> > >> > > > > > > >> >>>>> organized
>>>>> >> >> > >> > > > > > > >> >>>>>>>> as a
>>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> FLIP
>>>>> >> >> > >> > > > > > > >> >>>>>>>>>>>
>>>>> >> >> > >> > > > > > > >> >>>>>>>>>>>
>>>>> >> >> > >> > > > > > > >> >>>>>>>>>>
>>>>> >> >> > >> > > > > > > >> >>>>>>>>>
>>>>> >> >> > >> > > > > > > >> >>>>>>>>
>>>>> >> >> > >> > > > > > > >> >>>>>>
>>>>> >> >> > >> > > > > > > >> >>>>>
>>>>> >> >> > >> > > > > > > >> >>>
>>>>> >> >> > >> > > > > > > >>
>>>>> >> >> > >> > > > > > >
>>>>> >> >> > >> > > > > >
>>>>> >> >> > >> > > > >
>>>>> >> >> > >> > > >
>>>>> >> >> > >> > >
>>>>> >> >> > >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-85+Delayed+JobGraph+Generation
>>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> .
>>>>> >> >> > >> > > > > > > >> >>>>>>>>>>>
>>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> Any questions and suggestions are welcomed.
>>>>> >> >> > >> Thank
>>>>> >> >> > >> > > you
>>>>> >> >> > >> > > > in
>>>>> >> >> > >> > > > > > > >> >>>>> advance.
>>>>> >> >> > >> > > > > > > >> >>>>>>>>>>>
>>>>> >> >> > >> > > > > > > >> >>>>>>>>>>>
>>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> Best Regards
>>>>> >> >> > >> > > > > > > >> >>>>>>>>>>> Peter Huang
>>>>> >> >> > >> > > > > > > >> >>>>>>>>>>>
>>>>> >> >> > >> > > > > > > >> >>>>>>>>>>
>>>>> >> >> > >> > > > > > > >> >>>>>>>>>
>>>>> >> >> > >> > > > > > > >> >>>>>>>>
>>>>> >> >> > >> > > > > > > >> >>>>>>>
>>>>> >> >> > >> > > > > > > >> >>>>>>
>>>>> >> >> > >> > > > > > > >> >>>>>
>>>>> >> >> > >> > > > > > > >> >>>>
>>>>> >> >> > >> > > > > > > >> >>>
>>>>> >> >> > >> > > > > > > >> >>
>>>>> >> >> > >> > > > > > > >>
>>>>> >> >> > >> > > > > > > >>
>>>>> >> >> > >> > > > > > >
>>>>> >> >> > >> > > > > >
>>>>> >> >> > >> > > > >
>>>>> >> >> > >> > > >
>>>>> >> >> > >> > >
>>>>> >> >> > >>
>>>>> >> >> > >
>>>>> >> >>
123