[DISCUSS] Rework History Server into Global Dashboard

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

[DISCUSS] Rework History Server into Global Dashboard

Gyula Fóra
Hi All!

With the growing number of Flink streaming applications the current HS
implementation is starting to lose its value. Users running streaming
applications mostly care about what is running right now on the cluster and
a centralised view on history is not very useful.

We have been experimenting with reworking the current HS into a Global
Flink Dashboard that would show all running and completed/failed jobs on
all the running Flink clusters the users have.

In essence we would get a view similar to the current HS but it would also
show the running jobs with a link redirecting to the actual cluster
specific dashboard.

This is how it looks now:


In this version we took a very simple approach of introducing a cluster
discovery abstraction to collect all the running Flink clusters (by listing
yarn apps for instance).

The main pages aggregating jobs from different clusters would then simply
make calls to all clusters and aggregate the response. Job specific
endpoints would be simply routed to the correct target cluster. This way
the changes required are localised to the current HS implementation and
cluster rest endpoints don't need to be changed.

In addition to getting a fully working global dashboard this also gets us a
fully functioning rest endpoint for accessing all jobs in all clusters
without having to provide the clusterId (yarn app id for instance) that we
can use to enhance CLI experience in multi cluster (lot of per-job
clusters) environments. Please let us know what you think! Gyula
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Rework History Server into Global Dashboard

Jeff Zhang
Hi Gyula,

Big +1 for this, it would be very helpful for flink jobs and cluster
operations. Do you call flink rest api to gather the job info ? I hope this
history server could work with multiple versions of flink as long as the
flink rest api is compatible.

Gyula Fóra <[hidden email]> 于2020年5月13日周三 下午4:13写道:

> Hi All!
>
> With the growing number of Flink streaming applications the current HS
> implementation is starting to lose its value. Users running streaming
> applications mostly care about what is running right now on the cluster and
> a centralised view on history is not very useful.
>
> We have been experimenting with reworking the current HS into a Global
> Flink Dashboard that would show all running and completed/failed jobs on
> all the running Flink clusters the users have.
>
> In essence we would get a view similar to the current HS but it would also
> show the running jobs with a link redirecting to the actual cluster
> specific dashboard.
>
> This is how it looks now:
>
>
> In this version we took a very simple approach of introducing a cluster
> discovery abstraction to collect all the running Flink clusters (by listing
> yarn apps for instance).
>
> The main pages aggregating jobs from different clusters would then simply
> make calls to all clusters and aggregate the response. Job specific
> endpoints would be simply routed to the correct target cluster. This way
> the changes required are localised to the current HS implementation and
> cluster rest endpoints don't need to be changed.
>
> In addition to getting a fully working global dashboard this also gets us a
> fully functioning rest endpoint for accessing all jobs in all clusters
> without having to provide the clusterId (yarn app id for instance) that we
> can use to enhance CLI experience in multi cluster (lot of per-job
> clusters) environments. Please let us know what you think! Gyula
>


--
Best Regards

Jeff Zhang
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Rework History Server into Global Dashboard

Gyula Fóra
Oops I forgot the screenshot, thanks Ufuk :D


@Jeff Zhang <[hidden email]> : Yes we simply call to the individual
cluster's rest endpoints so it would work with multiple flink versions yes.
Gyula


On Wed, May 13, 2020 at 10:56 AM Jeff Zhang <[hidden email]> wrote:

> Hi Gyula,
>
> Big +1 for this, it would be very helpful for flink jobs and cluster
> operations. Do you call flink rest api to gather the job info ? I hope this
> history server could work with multiple versions of flink as long as the
> flink rest api is compatible.
>
> Gyula Fóra <[hidden email]> 于2020年5月13日周三 下午4:13写道:
>
> > Hi All!
> >
> > With the growing number of Flink streaming applications the current HS
> > implementation is starting to lose its value. Users running streaming
> > applications mostly care about what is running right now on the cluster
> and
> > a centralised view on history is not very useful.
> >
> > We have been experimenting with reworking the current HS into a Global
> > Flink Dashboard that would show all running and completed/failed jobs on
> > all the running Flink clusters the users have.
> >
> > In essence we would get a view similar to the current HS but it would
> also
> > show the running jobs with a link redirecting to the actual cluster
> > specific dashboard.
> >
> > This is how it looks now:
> >
> >
> > In this version we took a very simple approach of introducing a cluster
> > discovery abstraction to collect all the running Flink clusters (by
> listing
> > yarn apps for instance).
> >
> > The main pages aggregating jobs from different clusters would then simply
> > make calls to all clusters and aggregate the response. Job specific
> > endpoints would be simply routed to the correct target cluster. This way
> > the changes required are localised to the current HS implementation and
> > cluster rest endpoints don't need to be changed.
> >
> > In addition to getting a fully working global dashboard this also gets
> us a
> > fully functioning rest endpoint for accessing all jobs in all clusters
> > without having to provide the clusterId (yarn app id for instance) that
> we
> > can use to enhance CLI experience in multi cluster (lot of per-job
> > clusters) environments. Please let us know what you think! Gyula
> >
>
>
> --
> Best Regards
>
> Jeff Zhang
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Rework History Server into Global Dashboard

Gyula Fóra
It seems that not everyone can see the screenshot in the email, so here is
a link:

https://drive.google.com/open?id=1abrlpI976NFqOZSX20k2FoiAfVhBbER9

On Wed, May 13, 2020 at 11:29 AM Gyula Fóra <[hidden email]> wrote:

> Oops I forgot the screenshot, thanks Ufuk :D
>
>
> @Jeff Zhang <[hidden email]> : Yes we simply call to the individual
> cluster's rest endpoints so it would work with multiple flink versions yes.
> Gyula
>
>
> On Wed, May 13, 2020 at 10:56 AM Jeff Zhang <[hidden email]> wrote:
>
>> Hi Gyula,
>>
>> Big +1 for this, it would be very helpful for flink jobs and cluster
>> operations. Do you call flink rest api to gather the job info ? I hope
>> this
>> history server could work with multiple versions of flink as long as the
>> flink rest api is compatible.
>>
>> Gyula Fóra <[hidden email]> 于2020年5月13日周三 下午4:13写道:
>>
>> > Hi All!
>> >
>> > With the growing number of Flink streaming applications the current HS
>> > implementation is starting to lose its value. Users running streaming
>> > applications mostly care about what is running right now on the cluster
>> and
>> > a centralised view on history is not very useful.
>> >
>> > We have been experimenting with reworking the current HS into a Global
>> > Flink Dashboard that would show all running and completed/failed jobs on
>> > all the running Flink clusters the users have.
>> >
>> > In essence we would get a view similar to the current HS but it would
>> also
>> > show the running jobs with a link redirecting to the actual cluster
>> > specific dashboard.
>> >
>> > This is how it looks now:
>> >
>> >
>> > In this version we took a very simple approach of introducing a cluster
>> > discovery abstraction to collect all the running Flink clusters (by
>> listing
>> > yarn apps for instance).
>> >
>> > The main pages aggregating jobs from different clusters would then
>> simply
>> > make calls to all clusters and aggregate the response. Job specific
>> > endpoints would be simply routed to the correct target cluster. This way
>> > the changes required are localised to the current HS implementation and
>> > cluster rest endpoints don't need to be changed.
>> >
>> > In addition to getting a fully working global dashboard this also gets
>> us a
>> > fully functioning rest endpoint for accessing all jobs in all clusters
>> > without having to provide the clusterId (yarn app id for instance) that
>> we
>> > can use to enhance CLI experience in multi cluster (lot of per-job
>> > clusters) environments. Please let us know what you think! Gyula
>> >
>>
>>
>> --
>> Best Regards
>>
>> Jeff Zhang
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Rework History Server into Global Dashboard

Till Rohrmann
Hi Gyula,

thanks for proposing this extension. I can see that such a feature could be
helpful.

However, I wouldn't consider the management of multiple clusters core to
Flink. Managing a single cluster is already complex enough and given the
available community capacity I would rather concentrate on doing this
aspect right instead of adding more complexity and more code to maintain.

Maybe we could add this feature as a Flink package instead. That way it
would still be available to our users. If it gains enough traction then we
can also add it to Flink later. What do you think?

Cheers,
Till

On Wed, May 13, 2020 at 11:36 AM Gyula Fóra <[hidden email]> wrote:

> It seems that not everyone can see the screenshot in the email, so here is
> a link:
>
> https://drive.google.com/open?id=1abrlpI976NFqOZSX20k2FoiAfVhBbER9
>
> On Wed, May 13, 2020 at 11:29 AM Gyula Fóra <[hidden email]> wrote:
>
> > Oops I forgot the screenshot, thanks Ufuk :D
> >
> >
> > @Jeff Zhang <[hidden email]> : Yes we simply call to the individual
> > cluster's rest endpoints so it would work with multiple flink versions
> yes.
> > Gyula
> >
> >
> > On Wed, May 13, 2020 at 10:56 AM Jeff Zhang <[hidden email]> wrote:
> >
> >> Hi Gyula,
> >>
> >> Big +1 for this, it would be very helpful for flink jobs and cluster
> >> operations. Do you call flink rest api to gather the job info ? I hope
> >> this
> >> history server could work with multiple versions of flink as long as the
> >> flink rest api is compatible.
> >>
> >> Gyula Fóra <[hidden email]> 于2020年5月13日周三 下午4:13写道:
> >>
> >> > Hi All!
> >> >
> >> > With the growing number of Flink streaming applications the current HS
> >> > implementation is starting to lose its value. Users running streaming
> >> > applications mostly care about what is running right now on the
> cluster
> >> and
> >> > a centralised view on history is not very useful.
> >> >
> >> > We have been experimenting with reworking the current HS into a Global
> >> > Flink Dashboard that would show all running and completed/failed jobs
> on
> >> > all the running Flink clusters the users have.
> >> >
> >> > In essence we would get a view similar to the current HS but it would
> >> also
> >> > show the running jobs with a link redirecting to the actual cluster
> >> > specific dashboard.
> >> >
> >> > This is how it looks now:
> >> >
> >> >
> >> > In this version we took a very simple approach of introducing a
> cluster
> >> > discovery abstraction to collect all the running Flink clusters (by
> >> listing
> >> > yarn apps for instance).
> >> >
> >> > The main pages aggregating jobs from different clusters would then
> >> simply
> >> > make calls to all clusters and aggregate the response. Job specific
> >> > endpoints would be simply routed to the correct target cluster. This
> way
> >> > the changes required are localised to the current HS implementation
> and
> >> > cluster rest endpoints don't need to be changed.
> >> >
> >> > In addition to getting a fully working global dashboard this also gets
> >> us a
> >> > fully functioning rest endpoint for accessing all jobs in all clusters
> >> > without having to provide the clusterId (yarn app id for instance)
> that
> >> we
> >> > can use to enhance CLI experience in multi cluster (lot of per-job
> >> > clusters) environments. Please let us know what you think! Gyula
> >> >
> >>
> >>
> >> --
> >> Best Regards
> >>
> >> Jeff Zhang
> >>
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Rework History Server into Global Dashboard

Gyula Fóra
Hi Till!

I agree to some extent that managing multiple clusters is not Flink's
primary responsibility.

However many (if not most) production users use Flink in per-job-cluster
mode which gives superior configurability and resource isolation than
standalone/session modes.
But still the best job management experience is on standalone clusters
where users see all the jobs, and can interact with them purely using their
unique job id.

This is the mismatch we were trying to resolve here, to get the best of
both worlds. This of course only concerns production users running many
different jobs so we can
definitely call it an enterprise feature.

I agree that this would be new code to maintain in contrast to the current
history server which "just works".

We are completely okay with not adding this to Flink just yet, as it will
be part of the next Cloudera Flink release anyways. We will test run it
there and gather production feedback for the Flink community and can make a
better decision afterwards when we see the real value.

Cheers,
Gyula



On Thu, May 14, 2020 at 3:36 PM Till Rohrmann <[hidden email]> wrote:

> Hi Gyula,
>
> thanks for proposing this extension. I can see that such a feature could be
> helpful.
>
> However, I wouldn't consider the management of multiple clusters core to
> Flink. Managing a single cluster is already complex enough and given the
> available community capacity I would rather concentrate on doing this
> aspect right instead of adding more complexity and more code to maintain.
>
> Maybe we could add this feature as a Flink package instead. That way it
> would still be available to our users. If it gains enough traction then we
> can also add it to Flink later. What do you think?
>
> Cheers,
> Till
>
> On Wed, May 13, 2020 at 11:36 AM Gyula Fóra <[hidden email]> wrote:
>
> > It seems that not everyone can see the screenshot in the email, so here
> is
> > a link:
> >
> > https://drive.google.com/open?id=1abrlpI976NFqOZSX20k2FoiAfVhBbER9
> >
> > On Wed, May 13, 2020 at 11:29 AM Gyula Fóra <[hidden email]>
> wrote:
> >
> > > Oops I forgot the screenshot, thanks Ufuk :D
> > >
> > >
> > > @Jeff Zhang <[hidden email]> : Yes we simply call to the individual
> > > cluster's rest endpoints so it would work with multiple flink versions
> > yes.
> > > Gyula
> > >
> > >
> > > On Wed, May 13, 2020 at 10:56 AM Jeff Zhang <[hidden email]> wrote:
> > >
> > >> Hi Gyula,
> > >>
> > >> Big +1 for this, it would be very helpful for flink jobs and cluster
> > >> operations. Do you call flink rest api to gather the job info ? I hope
> > >> this
> > >> history server could work with multiple versions of flink as long as
> the
> > >> flink rest api is compatible.
> > >>
> > >> Gyula Fóra <[hidden email]> 于2020年5月13日周三 下午4:13写道:
> > >>
> > >> > Hi All!
> > >> >
> > >> > With the growing number of Flink streaming applications the current
> HS
> > >> > implementation is starting to lose its value. Users running
> streaming
> > >> > applications mostly care about what is running right now on the
> > cluster
> > >> and
> > >> > a centralised view on history is not very useful.
> > >> >
> > >> > We have been experimenting with reworking the current HS into a
> Global
> > >> > Flink Dashboard that would show all running and completed/failed
> jobs
> > on
> > >> > all the running Flink clusters the users have.
> > >> >
> > >> > In essence we would get a view similar to the current HS but it
> would
> > >> also
> > >> > show the running jobs with a link redirecting to the actual cluster
> > >> > specific dashboard.
> > >> >
> > >> > This is how it looks now:
> > >> >
> > >> >
> > >> > In this version we took a very simple approach of introducing a
> > cluster
> > >> > discovery abstraction to collect all the running Flink clusters (by
> > >> listing
> > >> > yarn apps for instance).
> > >> >
> > >> > The main pages aggregating jobs from different clusters would then
> > >> simply
> > >> > make calls to all clusters and aggregate the response. Job specific
> > >> > endpoints would be simply routed to the correct target cluster. This
> > way
> > >> > the changes required are localised to the current HS implementation
> > and
> > >> > cluster rest endpoints don't need to be changed.
> > >> >
> > >> > In addition to getting a fully working global dashboard this also
> gets
> > >> us a
> > >> > fully functioning rest endpoint for accessing all jobs in all
> clusters
> > >> > without having to provide the clusterId (yarn app id for instance)
> > that
> > >> we
> > >> > can use to enhance CLI experience in multi cluster (lot of per-job
> > >> > clusters) environments. Please let us know what you think! Gyula
> > >> >
> > >>
> > >>
> > >> --
> > >> Best Regards
> > >>
> > >> Jeff Zhang
> > >>
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Rework History Server into Global Dashboard

Till Rohrmann
This sounds like a good plan to me Gyula. And there is always the Flink
packages option available if we want to make it available earlier.

Cheers,
Till

On Fri, May 15, 2020 at 10:12 AM Gyula Fóra <[hidden email]> wrote:

> Hi Till!
>
> I agree to some extent that managing multiple clusters is not Flink's
> primary responsibility.
>
> However many (if not most) production users use Flink in per-job-cluster
> mode which gives superior configurability and resource isolation than
> standalone/session modes.
> But still the best job management experience is on standalone clusters
> where users see all the jobs, and can interact with them purely using their
> unique job id.
>
> This is the mismatch we were trying to resolve here, to get the best of
> both worlds. This of course only concerns production users running many
> different jobs so we can
> definitely call it an enterprise feature.
>
> I agree that this would be new code to maintain in contrast to the current
> history server which "just works".
>
> We are completely okay with not adding this to Flink just yet, as it will
> be part of the next Cloudera Flink release anyways. We will test run it
> there and gather production feedback for the Flink community and can make a
> better decision afterwards when we see the real value.
>
> Cheers,
> Gyula
>
>
>
> On Thu, May 14, 2020 at 3:36 PM Till Rohrmann <[hidden email]>
> wrote:
>
> > Hi Gyula,
> >
> > thanks for proposing this extension. I can see that such a feature could
> be
> > helpful.
> >
> > However, I wouldn't consider the management of multiple clusters core to
> > Flink. Managing a single cluster is already complex enough and given the
> > available community capacity I would rather concentrate on doing this
> > aspect right instead of adding more complexity and more code to maintain.
> >
> > Maybe we could add this feature as a Flink package instead. That way it
> > would still be available to our users. If it gains enough traction then
> we
> > can also add it to Flink later. What do you think?
> >
> > Cheers,
> > Till
> >
> > On Wed, May 13, 2020 at 11:36 AM Gyula Fóra <[hidden email]>
> wrote:
> >
> > > It seems that not everyone can see the screenshot in the email, so here
> > is
> > > a link:
> > >
> > > https://drive.google.com/open?id=1abrlpI976NFqOZSX20k2FoiAfVhBbER9
> > >
> > > On Wed, May 13, 2020 at 11:29 AM Gyula Fóra <[hidden email]>
> > wrote:
> > >
> > > > Oops I forgot the screenshot, thanks Ufuk :D
> > > >
> > > >
> > > > @Jeff Zhang <[hidden email]> : Yes we simply call to the
> individual
> > > > cluster's rest endpoints so it would work with multiple flink
> versions
> > > yes.
> > > > Gyula
> > > >
> > > >
> > > > On Wed, May 13, 2020 at 10:56 AM Jeff Zhang <[hidden email]>
> wrote:
> > > >
> > > >> Hi Gyula,
> > > >>
> > > >> Big +1 for this, it would be very helpful for flink jobs and cluster
> > > >> operations. Do you call flink rest api to gather the job info ? I
> hope
> > > >> this
> > > >> history server could work with multiple versions of flink as long as
> > the
> > > >> flink rest api is compatible.
> > > >>
> > > >> Gyula Fóra <[hidden email]> 于2020年5月13日周三 下午4:13写道:
> > > >>
> > > >> > Hi All!
> > > >> >
> > > >> > With the growing number of Flink streaming applications the
> current
> > HS
> > > >> > implementation is starting to lose its value. Users running
> > streaming
> > > >> > applications mostly care about what is running right now on the
> > > cluster
> > > >> and
> > > >> > a centralised view on history is not very useful.
> > > >> >
> > > >> > We have been experimenting with reworking the current HS into a
> > Global
> > > >> > Flink Dashboard that would show all running and completed/failed
> > jobs
> > > on
> > > >> > all the running Flink clusters the users have.
> > > >> >
> > > >> > In essence we would get a view similar to the current HS but it
> > would
> > > >> also
> > > >> > show the running jobs with a link redirecting to the actual
> cluster
> > > >> > specific dashboard.
> > > >> >
> > > >> > This is how it looks now:
> > > >> >
> > > >> >
> > > >> > In this version we took a very simple approach of introducing a
> > > cluster
> > > >> > discovery abstraction to collect all the running Flink clusters
> (by
> > > >> listing
> > > >> > yarn apps for instance).
> > > >> >
> > > >> > The main pages aggregating jobs from different clusters would then
> > > >> simply
> > > >> > make calls to all clusters and aggregate the response. Job
> specific
> > > >> > endpoints would be simply routed to the correct target cluster.
> This
> > > way
> > > >> > the changes required are localised to the current HS
> implementation
> > > and
> > > >> > cluster rest endpoints don't need to be changed.
> > > >> >
> > > >> > In addition to getting a fully working global dashboard this also
> > gets
> > > >> us a
> > > >> > fully functioning rest endpoint for accessing all jobs in all
> > clusters
> > > >> > without having to provide the clusterId (yarn app id for instance)
> > > that
> > > >> we
> > > >> > can use to enhance CLI experience in multi cluster (lot of per-job
> > > >> > clusters) environments. Please let us know what you think! Gyula
> > > >> >
> > > >>
> > > >>
> > > >> --
> > > >> Best Regards
> > > >>
> > > >> Jeff Zhang
> > > >>
> > > >
> > >
> >
>