Hi All!
With the growing number of Flink streaming applications the current HS implementation is starting to lose its value. Users running streaming applications mostly care about what is running right now on the cluster and a centralised view on history is not very useful. We have been experimenting with reworking the current HS into a Global Flink Dashboard that would show all running and completed/failed jobs on all the running Flink clusters the users have. In essence we would get a view similar to the current HS but it would also show the running jobs with a link redirecting to the actual cluster specific dashboard. This is how it looks now: In this version we took a very simple approach of introducing a cluster discovery abstraction to collect all the running Flink clusters (by listing yarn apps for instance). The main pages aggregating jobs from different clusters would then simply make calls to all clusters and aggregate the response. Job specific endpoints would be simply routed to the correct target cluster. This way the changes required are localised to the current HS implementation and cluster rest endpoints don't need to be changed. In addition to getting a fully working global dashboard this also gets us a fully functioning rest endpoint for accessing all jobs in all clusters without having to provide the clusterId (yarn app id for instance) that we can use to enhance CLI experience in multi cluster (lot of per-job clusters) environments. Please let us know what you think! Gyula |
Hi Gyula,
Big +1 for this, it would be very helpful for flink jobs and cluster operations. Do you call flink rest api to gather the job info ? I hope this history server could work with multiple versions of flink as long as the flink rest api is compatible. Gyula Fóra <[hidden email]> 于2020年5月13日周三 下午4:13写道: > Hi All! > > With the growing number of Flink streaming applications the current HS > implementation is starting to lose its value. Users running streaming > applications mostly care about what is running right now on the cluster and > a centralised view on history is not very useful. > > We have been experimenting with reworking the current HS into a Global > Flink Dashboard that would show all running and completed/failed jobs on > all the running Flink clusters the users have. > > In essence we would get a view similar to the current HS but it would also > show the running jobs with a link redirecting to the actual cluster > specific dashboard. > > This is how it looks now: > > > In this version we took a very simple approach of introducing a cluster > discovery abstraction to collect all the running Flink clusters (by listing > yarn apps for instance). > > The main pages aggregating jobs from different clusters would then simply > make calls to all clusters and aggregate the response. Job specific > endpoints would be simply routed to the correct target cluster. This way > the changes required are localised to the current HS implementation and > cluster rest endpoints don't need to be changed. > > In addition to getting a fully working global dashboard this also gets us a > fully functioning rest endpoint for accessing all jobs in all clusters > without having to provide the clusterId (yarn app id for instance) that we > can use to enhance CLI experience in multi cluster (lot of per-job > clusters) environments. Please let us know what you think! Gyula > -- Best Regards Jeff Zhang |
Oops I forgot the screenshot, thanks Ufuk :D
@Jeff Zhang <[hidden email]> : Yes we simply call to the individual cluster's rest endpoints so it would work with multiple flink versions yes. Gyula On Wed, May 13, 2020 at 10:56 AM Jeff Zhang <[hidden email]> wrote: > Hi Gyula, > > Big +1 for this, it would be very helpful for flink jobs and cluster > operations. Do you call flink rest api to gather the job info ? I hope this > history server could work with multiple versions of flink as long as the > flink rest api is compatible. > > Gyula Fóra <[hidden email]> 于2020年5月13日周三 下午4:13写道: > > > Hi All! > > > > With the growing number of Flink streaming applications the current HS > > implementation is starting to lose its value. Users running streaming > > applications mostly care about what is running right now on the cluster > and > > a centralised view on history is not very useful. > > > > We have been experimenting with reworking the current HS into a Global > > Flink Dashboard that would show all running and completed/failed jobs on > > all the running Flink clusters the users have. > > > > In essence we would get a view similar to the current HS but it would > also > > show the running jobs with a link redirecting to the actual cluster > > specific dashboard. > > > > This is how it looks now: > > > > > > In this version we took a very simple approach of introducing a cluster > > discovery abstraction to collect all the running Flink clusters (by > listing > > yarn apps for instance). > > > > The main pages aggregating jobs from different clusters would then simply > > make calls to all clusters and aggregate the response. Job specific > > endpoints would be simply routed to the correct target cluster. This way > > the changes required are localised to the current HS implementation and > > cluster rest endpoints don't need to be changed. > > > > In addition to getting a fully working global dashboard this also gets > us a > > fully functioning rest endpoint for accessing all jobs in all clusters > > without having to provide the clusterId (yarn app id for instance) that > we > > can use to enhance CLI experience in multi cluster (lot of per-job > > clusters) environments. Please let us know what you think! Gyula > > > > > -- > Best Regards > > Jeff Zhang > |
It seems that not everyone can see the screenshot in the email, so here is
a link: https://drive.google.com/open?id=1abrlpI976NFqOZSX20k2FoiAfVhBbER9 On Wed, May 13, 2020 at 11:29 AM Gyula Fóra <[hidden email]> wrote: > Oops I forgot the screenshot, thanks Ufuk :D > > > @Jeff Zhang <[hidden email]> : Yes we simply call to the individual > cluster's rest endpoints so it would work with multiple flink versions yes. > Gyula > > > On Wed, May 13, 2020 at 10:56 AM Jeff Zhang <[hidden email]> wrote: > >> Hi Gyula, >> >> Big +1 for this, it would be very helpful for flink jobs and cluster >> operations. Do you call flink rest api to gather the job info ? I hope >> this >> history server could work with multiple versions of flink as long as the >> flink rest api is compatible. >> >> Gyula Fóra <[hidden email]> 于2020年5月13日周三 下午4:13写道: >> >> > Hi All! >> > >> > With the growing number of Flink streaming applications the current HS >> > implementation is starting to lose its value. Users running streaming >> > applications mostly care about what is running right now on the cluster >> and >> > a centralised view on history is not very useful. >> > >> > We have been experimenting with reworking the current HS into a Global >> > Flink Dashboard that would show all running and completed/failed jobs on >> > all the running Flink clusters the users have. >> > >> > In essence we would get a view similar to the current HS but it would >> also >> > show the running jobs with a link redirecting to the actual cluster >> > specific dashboard. >> > >> > This is how it looks now: >> > >> > >> > In this version we took a very simple approach of introducing a cluster >> > discovery abstraction to collect all the running Flink clusters (by >> listing >> > yarn apps for instance). >> > >> > The main pages aggregating jobs from different clusters would then >> simply >> > make calls to all clusters and aggregate the response. Job specific >> > endpoints would be simply routed to the correct target cluster. This way >> > the changes required are localised to the current HS implementation and >> > cluster rest endpoints don't need to be changed. >> > >> > In addition to getting a fully working global dashboard this also gets >> us a >> > fully functioning rest endpoint for accessing all jobs in all clusters >> > without having to provide the clusterId (yarn app id for instance) that >> we >> > can use to enhance CLI experience in multi cluster (lot of per-job >> > clusters) environments. Please let us know what you think! Gyula >> > >> >> >> -- >> Best Regards >> >> Jeff Zhang >> > |
Hi Gyula,
thanks for proposing this extension. I can see that such a feature could be helpful. However, I wouldn't consider the management of multiple clusters core to Flink. Managing a single cluster is already complex enough and given the available community capacity I would rather concentrate on doing this aspect right instead of adding more complexity and more code to maintain. Maybe we could add this feature as a Flink package instead. That way it would still be available to our users. If it gains enough traction then we can also add it to Flink later. What do you think? Cheers, Till On Wed, May 13, 2020 at 11:36 AM Gyula Fóra <[hidden email]> wrote: > It seems that not everyone can see the screenshot in the email, so here is > a link: > > https://drive.google.com/open?id=1abrlpI976NFqOZSX20k2FoiAfVhBbER9 > > On Wed, May 13, 2020 at 11:29 AM Gyula Fóra <[hidden email]> wrote: > > > Oops I forgot the screenshot, thanks Ufuk :D > > > > > > @Jeff Zhang <[hidden email]> : Yes we simply call to the individual > > cluster's rest endpoints so it would work with multiple flink versions > yes. > > Gyula > > > > > > On Wed, May 13, 2020 at 10:56 AM Jeff Zhang <[hidden email]> wrote: > > > >> Hi Gyula, > >> > >> Big +1 for this, it would be very helpful for flink jobs and cluster > >> operations. Do you call flink rest api to gather the job info ? I hope > >> this > >> history server could work with multiple versions of flink as long as the > >> flink rest api is compatible. > >> > >> Gyula Fóra <[hidden email]> 于2020年5月13日周三 下午4:13写道: > >> > >> > Hi All! > >> > > >> > With the growing number of Flink streaming applications the current HS > >> > implementation is starting to lose its value. Users running streaming > >> > applications mostly care about what is running right now on the > cluster > >> and > >> > a centralised view on history is not very useful. > >> > > >> > We have been experimenting with reworking the current HS into a Global > >> > Flink Dashboard that would show all running and completed/failed jobs > on > >> > all the running Flink clusters the users have. > >> > > >> > In essence we would get a view similar to the current HS but it would > >> also > >> > show the running jobs with a link redirecting to the actual cluster > >> > specific dashboard. > >> > > >> > This is how it looks now: > >> > > >> > > >> > In this version we took a very simple approach of introducing a > cluster > >> > discovery abstraction to collect all the running Flink clusters (by > >> listing > >> > yarn apps for instance). > >> > > >> > The main pages aggregating jobs from different clusters would then > >> simply > >> > make calls to all clusters and aggregate the response. Job specific > >> > endpoints would be simply routed to the correct target cluster. This > way > >> > the changes required are localised to the current HS implementation > and > >> > cluster rest endpoints don't need to be changed. > >> > > >> > In addition to getting a fully working global dashboard this also gets > >> us a > >> > fully functioning rest endpoint for accessing all jobs in all clusters > >> > without having to provide the clusterId (yarn app id for instance) > that > >> we > >> > can use to enhance CLI experience in multi cluster (lot of per-job > >> > clusters) environments. Please let us know what you think! Gyula > >> > > >> > >> > >> -- > >> Best Regards > >> > >> Jeff Zhang > >> > > > |
Hi Till!
I agree to some extent that managing multiple clusters is not Flink's primary responsibility. However many (if not most) production users use Flink in per-job-cluster mode which gives superior configurability and resource isolation than standalone/session modes. But still the best job management experience is on standalone clusters where users see all the jobs, and can interact with them purely using their unique job id. This is the mismatch we were trying to resolve here, to get the best of both worlds. This of course only concerns production users running many different jobs so we can definitely call it an enterprise feature. I agree that this would be new code to maintain in contrast to the current history server which "just works". We are completely okay with not adding this to Flink just yet, as it will be part of the next Cloudera Flink release anyways. We will test run it there and gather production feedback for the Flink community and can make a better decision afterwards when we see the real value. Cheers, Gyula On Thu, May 14, 2020 at 3:36 PM Till Rohrmann <[hidden email]> wrote: > Hi Gyula, > > thanks for proposing this extension. I can see that such a feature could be > helpful. > > However, I wouldn't consider the management of multiple clusters core to > Flink. Managing a single cluster is already complex enough and given the > available community capacity I would rather concentrate on doing this > aspect right instead of adding more complexity and more code to maintain. > > Maybe we could add this feature as a Flink package instead. That way it > would still be available to our users. If it gains enough traction then we > can also add it to Flink later. What do you think? > > Cheers, > Till > > On Wed, May 13, 2020 at 11:36 AM Gyula Fóra <[hidden email]> wrote: > > > It seems that not everyone can see the screenshot in the email, so here > is > > a link: > > > > https://drive.google.com/open?id=1abrlpI976NFqOZSX20k2FoiAfVhBbER9 > > > > On Wed, May 13, 2020 at 11:29 AM Gyula Fóra <[hidden email]> > wrote: > > > > > Oops I forgot the screenshot, thanks Ufuk :D > > > > > > > > > @Jeff Zhang <[hidden email]> : Yes we simply call to the individual > > > cluster's rest endpoints so it would work with multiple flink versions > > yes. > > > Gyula > > > > > > > > > On Wed, May 13, 2020 at 10:56 AM Jeff Zhang <[hidden email]> wrote: > > > > > >> Hi Gyula, > > >> > > >> Big +1 for this, it would be very helpful for flink jobs and cluster > > >> operations. Do you call flink rest api to gather the job info ? I hope > > >> this > > >> history server could work with multiple versions of flink as long as > the > > >> flink rest api is compatible. > > >> > > >> Gyula Fóra <[hidden email]> 于2020年5月13日周三 下午4:13写道: > > >> > > >> > Hi All! > > >> > > > >> > With the growing number of Flink streaming applications the current > HS > > >> > implementation is starting to lose its value. Users running > streaming > > >> > applications mostly care about what is running right now on the > > cluster > > >> and > > >> > a centralised view on history is not very useful. > > >> > > > >> > We have been experimenting with reworking the current HS into a > Global > > >> > Flink Dashboard that would show all running and completed/failed > jobs > > on > > >> > all the running Flink clusters the users have. > > >> > > > >> > In essence we would get a view similar to the current HS but it > would > > >> also > > >> > show the running jobs with a link redirecting to the actual cluster > > >> > specific dashboard. > > >> > > > >> > This is how it looks now: > > >> > > > >> > > > >> > In this version we took a very simple approach of introducing a > > cluster > > >> > discovery abstraction to collect all the running Flink clusters (by > > >> listing > > >> > yarn apps for instance). > > >> > > > >> > The main pages aggregating jobs from different clusters would then > > >> simply > > >> > make calls to all clusters and aggregate the response. Job specific > > >> > endpoints would be simply routed to the correct target cluster. This > > way > > >> > the changes required are localised to the current HS implementation > > and > > >> > cluster rest endpoints don't need to be changed. > > >> > > > >> > In addition to getting a fully working global dashboard this also > gets > > >> us a > > >> > fully functioning rest endpoint for accessing all jobs in all > clusters > > >> > without having to provide the clusterId (yarn app id for instance) > > that > > >> we > > >> > can use to enhance CLI experience in multi cluster (lot of per-job > > >> > clusters) environments. Please let us know what you think! Gyula > > >> > > > >> > > >> > > >> -- > > >> Best Regards > > >> > > >> Jeff Zhang > > >> > > > > > > |
This sounds like a good plan to me Gyula. And there is always the Flink
packages option available if we want to make it available earlier. Cheers, Till On Fri, May 15, 2020 at 10:12 AM Gyula Fóra <[hidden email]> wrote: > Hi Till! > > I agree to some extent that managing multiple clusters is not Flink's > primary responsibility. > > However many (if not most) production users use Flink in per-job-cluster > mode which gives superior configurability and resource isolation than > standalone/session modes. > But still the best job management experience is on standalone clusters > where users see all the jobs, and can interact with them purely using their > unique job id. > > This is the mismatch we were trying to resolve here, to get the best of > both worlds. This of course only concerns production users running many > different jobs so we can > definitely call it an enterprise feature. > > I agree that this would be new code to maintain in contrast to the current > history server which "just works". > > We are completely okay with not adding this to Flink just yet, as it will > be part of the next Cloudera Flink release anyways. We will test run it > there and gather production feedback for the Flink community and can make a > better decision afterwards when we see the real value. > > Cheers, > Gyula > > > > On Thu, May 14, 2020 at 3:36 PM Till Rohrmann <[hidden email]> > wrote: > > > Hi Gyula, > > > > thanks for proposing this extension. I can see that such a feature could > be > > helpful. > > > > However, I wouldn't consider the management of multiple clusters core to > > Flink. Managing a single cluster is already complex enough and given the > > available community capacity I would rather concentrate on doing this > > aspect right instead of adding more complexity and more code to maintain. > > > > Maybe we could add this feature as a Flink package instead. That way it > > would still be available to our users. If it gains enough traction then > we > > can also add it to Flink later. What do you think? > > > > Cheers, > > Till > > > > On Wed, May 13, 2020 at 11:36 AM Gyula Fóra <[hidden email]> > wrote: > > > > > It seems that not everyone can see the screenshot in the email, so here > > is > > > a link: > > > > > > https://drive.google.com/open?id=1abrlpI976NFqOZSX20k2FoiAfVhBbER9 > > > > > > On Wed, May 13, 2020 at 11:29 AM Gyula Fóra <[hidden email]> > > wrote: > > > > > > > Oops I forgot the screenshot, thanks Ufuk :D > > > > > > > > > > > > @Jeff Zhang <[hidden email]> : Yes we simply call to the > individual > > > > cluster's rest endpoints so it would work with multiple flink > versions > > > yes. > > > > Gyula > > > > > > > > > > > > On Wed, May 13, 2020 at 10:56 AM Jeff Zhang <[hidden email]> > wrote: > > > > > > > >> Hi Gyula, > > > >> > > > >> Big +1 for this, it would be very helpful for flink jobs and cluster > > > >> operations. Do you call flink rest api to gather the job info ? I > hope > > > >> this > > > >> history server could work with multiple versions of flink as long as > > the > > > >> flink rest api is compatible. > > > >> > > > >> Gyula Fóra <[hidden email]> 于2020年5月13日周三 下午4:13写道: > > > >> > > > >> > Hi All! > > > >> > > > > >> > With the growing number of Flink streaming applications the > current > > HS > > > >> > implementation is starting to lose its value. Users running > > streaming > > > >> > applications mostly care about what is running right now on the > > > cluster > > > >> and > > > >> > a centralised view on history is not very useful. > > > >> > > > > >> > We have been experimenting with reworking the current HS into a > > Global > > > >> > Flink Dashboard that would show all running and completed/failed > > jobs > > > on > > > >> > all the running Flink clusters the users have. > > > >> > > > > >> > In essence we would get a view similar to the current HS but it > > would > > > >> also > > > >> > show the running jobs with a link redirecting to the actual > cluster > > > >> > specific dashboard. > > > >> > > > > >> > This is how it looks now: > > > >> > > > > >> > > > > >> > In this version we took a very simple approach of introducing a > > > cluster > > > >> > discovery abstraction to collect all the running Flink clusters > (by > > > >> listing > > > >> > yarn apps for instance). > > > >> > > > > >> > The main pages aggregating jobs from different clusters would then > > > >> simply > > > >> > make calls to all clusters and aggregate the response. Job > specific > > > >> > endpoints would be simply routed to the correct target cluster. > This > > > way > > > >> > the changes required are localised to the current HS > implementation > > > and > > > >> > cluster rest endpoints don't need to be changed. > > > >> > > > > >> > In addition to getting a fully working global dashboard this also > > gets > > > >> us a > > > >> > fully functioning rest endpoint for accessing all jobs in all > > clusters > > > >> > without having to provide the clusterId (yarn app id for instance) > > > that > > > >> we > > > >> > can use to enhance CLI experience in multi cluster (lot of per-job > > > >> > clusters) environments. Please let us know what you think! Gyula > > > >> > > > > >> > > > >> > > > >> -- > > > >> Best Regards > > > >> > > > >> Jeff Zhang > > > >> > > > > > > > > > > |
Free forum by Nabble | Edit this page |