[DISCUSS] What do we gain by supporting customized High-Availability services

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

[DISCUSS] What do we gain by supporting customized High-Availability services

tison
Hi devs,

Recently the community excludes customize support on new restart strategies[1],
which reminds
me to think of which kind of customized support a framework like Flink
should provides.

The key idea is pluggable is not customizable.

We might handle a series of implementation of restart strategies as well as
high-availability
services in our codebase. But it has a fixed size, which is definitely
different from support
arbitrarily customized.

For a services like high-availability services, it underneath relies on
quite a lot of runtime
implementations. For example, JobGraphStore supports #releaseJobGraphStore
originally
due to ZK lock strategy; getJobManagerRetriever requires default address
because
StandaloneHighAvailabilityServices is non-ha and pre-configured.

This kind of interfaces, however, are possibly evolves with flink runtime
implementation such
as cluster management and coordination details. If we support customizing
it, it means
such internal a high-availability services becomes public interfaces. If we
keep it pluggable,
we can extend it reacting to runtime evolution, ensuring the
implementations stay in a fixed
set; while introducing new implementation(such as etcd[2] or MapDB[3]) if
they are good fit.

We don't have a customize support on ResourceManager although it is
pluggable that
others can implement a kubernetes resource manager[4]. Maybe this is a
better way
how we handle high-availability services. Pluggable, but not customizable.

Looking forward to your ideas. To be clear, I'm not trying to drop it now,
but I'm a bit
confusing about this topic and would like to turn to the wisdom in our
community.

Best,
tison.

[1]
https://lists.apache.org/x/thread.html/6ed95eb6a91168dba09901e158bc1b6f4b08f1e176db4641f79de765@%3Cdev.flink.apache.org%3E
[2] https://issues.apache.org/jira/browse/FLINK-11105
[3]
https://lists.apache.org/x/thread.html/eae4cbdf6dac466bc0247e3bc1a7a69fe7e1db7a512fcd607e9c081b@%3Cuser.flink.apache.org%3E
[4] https://github.com/tianchen92/flink/tree/k8s-master/flink-kubernete
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] What do we gain by supporting customized High-Availability services

tison
A challenge is how we ensure the support for customized implementation. When
we introduce JobGraphStore#releaseJobGraph we actually change quite a bit
codepath
in Dispatcher. While we are unable to test arbitrarily customized
implementation our
compatibility promise is actually no more than compilation compatible.

Customer should still be required to be familiar with implementation
details to figure
out the fitment when they bump Flink version. This effort requires also and
no extra
when we support pluggable strategy. In another word, a customized support
tends
to hide the challenge when customer want to use their own implementation.
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] What do we gain by supporting customized High-Availability services

tison
Another perspective is that a stable, carefully-designed interface with
clear semantic
could be safer to customize.

Following the discussion in FLINK-10333 our JobGraphStore is actually
required
performing write operation only with leadership,  which is a basic
requirement
for coordination rather than an implementation detail.
Thus it depends on LeaderElectionService(in the design, we narrow the
specific interface
as LeaderStore). HighAvailabilityServices#getJobGraphStore() infers a
implicit field for
that which is hard to express the relationship between them.

If the interface is unstable(also we introduce a ClientHAService for
separate concern and
have to keep b/w comp. for customized), we'd better keep it internal for
freely evolution.
And when we try to support customized, it would be helpful to start a
proposal to revisit
the interface to be well-designed and stable. ref[2].

In short, current high-availability services as well as
runtime/coordination is still under
development and active evolution. It is possibly not a good time for make
it public and
customizable.

Best,
tison.

[1] https://issues.apache.org/jira/browse/FLINK-10333
[2]
https://github.com/apache/pulsar/wiki/PIP-45%3A-Pluggable-metadata-interface


Zili Chen <[hidden email]> 于2019年10月17日周四 下午8:13写道:

> A challenge is how we ensure the support for customized implementation.
> When
> we introduce JobGraphStore#releaseJobGraph we actually change quite a bit
> codepath
> in Dispatcher. While we are unable to test arbitrarily customized
> implementation our
> compatibility promise is actually no more than compilation compatible.
>
> Customer should still be required to be familiar with implementation
> details to figure
> out the fitment when they bump Flink version. This effort requires also
> and no extra
> when we support pluggable strategy. In another word, a customized support
> tends
> to hide the challenge when customer want to use their own implementation.
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] What do we gain by supporting customized High-Availability services

Till Rohrmann
Hi Tison,

I'm not sure whether I fully understand your distinction between
customizable and pluggable. Maybe you could clarify your ideas a bit
because you seem to favour support for pluggable implementations.

Maybe let me try to answer some other questions you raised. With the
HighAvailabilityServices interface and the functionality to load custom
implementations specified via `high-availability: FQDN` it is indeed
possible to provide a custom implementation of the HA services. This is,
however, pretty much a power user feature where users have to implement
against an internal API. As such, it can be subject to change and we don't
give guarantees that these interfaces won't change. Of course, if possible,
we should extend/change it in a way that we guarantee backwards
compatibility.

You are right that at the moment this interface is not stable enough for
being public API. Once this changes and we are happy with it, then we can
think about making it a public API and documenting it properly. Afaik,
there is no documentation how to implement your own HA services at the
moment. This underlines as well that this interface is an internal API.

Cheers,
Till

On Fri, Oct 18, 2019 at 5:56 AM Zili Chen <[hidden email]> wrote:

> Another perspective is that a stable, carefully-designed interface with
> clear semantic
> could be safer to customize.
>
> Following the discussion in FLINK-10333 our JobGraphStore is actually
> required
> performing write operation only with leadership,  which is a basic
> requirement
> for coordination rather than an implementation detail.
> Thus it depends on LeaderElectionService(in the design, we narrow the
> specific interface
> as LeaderStore). HighAvailabilityServices#getJobGraphStore() infers a
> implicit field for
> that which is hard to express the relationship between them.
>
> If the interface is unstable(also we introduce a ClientHAService for
> separate concern and
> have to keep b/w comp. for customized), we'd better keep it internal for
> freely evolution.
> And when we try to support customized, it would be helpful to start a
> proposal to revisit
> the interface to be well-designed and stable. ref[2].
>
> In short, current high-availability services as well as
> runtime/coordination is still under
> development and active evolution. It is possibly not a good time for make
> it public and
> customizable.
>
> Best,
> tison.
>
> [1] https://issues.apache.org/jira/browse/FLINK-10333
> [2]
>
> https://github.com/apache/pulsar/wiki/PIP-45%3A-Pluggable-metadata-interface
>
>
> Zili Chen <[hidden email]> 于2019年10月17日周四 下午8:13写道:
>
> > A challenge is how we ensure the support for customized implementation.
> > When
> > we introduce JobGraphStore#releaseJobGraph we actually change quite a bit
> > codepath
> > in Dispatcher. While we are unable to test arbitrarily customized
> > implementation our
> > compatibility promise is actually no more than compilation compatible.
> >
> > Customer should still be required to be familiar with implementation
> > details to figure
> > out the fitment when they bump Flink version. This effort requires also
> > and no extra
> > when we support pluggable strategy. In another word, a customized support
> > tends
> > to hide the challenge when customer want to use their own implementation.
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] What do we gain by supporting customized High-Availability services

tison
Thanks for your clarification Till.

For terminology pluggable and customizable, I am mainly concerning about
interface audience issue. Pluggable means we have multiple high-availability
implementation but closed to extend in user scope; customizable means
high-availability interface are stable and user-facing.

I agree that we should try to keep backward compatibility if possible. I
start
this thread due to being confused how I progress FLINK-10333. Specifically,
how to deal with compatibility things.

Given that it is an internal interface I think it is reasonable we
evolve it for
supporting leader store based high-availability storage in a minor version
bump. I'm going to start a discuss thread next week for gathering wider
feedbacks beyond the original (maybe limited) JIRA, which also calls of
review among community members. What do you think?

Best,
tison.


Till Rohrmann <[hidden email]> 于2019年10月21日周一 下午5:21写道:

> Hi Tison,
>
> I'm not sure whether I fully understand your distinction between
> customizable and pluggable. Maybe you could clarify your ideas a bit
> because you seem to favour support for pluggable implementations.
>
> Maybe let me try to answer some other questions you raised. With the
> HighAvailabilityServices interface and the functionality to load custom
> implementations specified via `high-availability: FQDN` it is indeed
> possible to provide a custom implementation of the HA services. This is,
> however, pretty much a power user feature where users have to implement
> against an internal API. As such, it can be subject to change and we don't
> give guarantees that these interfaces won't change. Of course, if possible,
> we should extend/change it in a way that we guarantee backwards
> compatibility.
>
> You are right that at the moment this interface is not stable enough for
> being public API. Once this changes and we are happy with it, then we can
> think about making it a public API and documenting it properly. Afaik,
> there is no documentation how to implement your own HA services at the
> moment. This underlines as well that this interface is an internal API.
>
> Cheers,
> Till
>
> On Fri, Oct 18, 2019 at 5:56 AM Zili Chen <[hidden email]> wrote:
>
> > Another perspective is that a stable, carefully-designed interface with
> > clear semantic
> > could be safer to customize.
> >
> > Following the discussion in FLINK-10333 our JobGraphStore is actually
> > required
> > performing write operation only with leadership,  which is a basic
> > requirement
> > for coordination rather than an implementation detail.
> > Thus it depends on LeaderElectionService(in the design, we narrow the
> > specific interface
> > as LeaderStore). HighAvailabilityServices#getJobGraphStore() infers a
> > implicit field for
> > that which is hard to express the relationship between them.
> >
> > If the interface is unstable(also we introduce a ClientHAService for
> > separate concern and
> > have to keep b/w comp. for customized), we'd better keep it internal for
> > freely evolution.
> > And when we try to support customized, it would be helpful to start a
> > proposal to revisit
> > the interface to be well-designed and stable. ref[2].
> >
> > In short, current high-availability services as well as
> > runtime/coordination is still under
> > development and active evolution. It is possibly not a good time for make
> > it public and
> > customizable.
> >
> > Best,
> > tison.
> >
> > [1] https://issues.apache.org/jira/browse/FLINK-10333
> > [2]
> >
> >
> https://github.com/apache/pulsar/wiki/PIP-45%3A-Pluggable-metadata-interface
> >
> >
> > Zili Chen <[hidden email]> 于2019年10月17日周四 下午8:13写道:
> >
> > > A challenge is how we ensure the support for customized implementation.
> > > When
> > > we introduce JobGraphStore#releaseJobGraph we actually change quite a
> bit
> > > codepath
> > > in Dispatcher. While we are unable to test arbitrarily customized
> > > implementation our
> > > compatibility promise is actually no more than compilation compatible.
> > >
> > > Customer should still be required to be familiar with implementation
> > > details to figure
> > > out the fitment when they bump Flink version. This effort requires also
> > > and no extra
> > > when we support pluggable strategy. In another word, a customized
> support
> > > tends
> > > to hide the challenge when customer want to use their own
> implementation.
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] What do we gain by supporting customized High-Availability services

Till Rohrmann
I would be in favour of changing interfaces only between major versions.
Otherwise we risk that existing setups break when upgrading to the latest
minor version.

+1 for the outreach.

Cheers,
Till

On Wed, Oct 23, 2019 at 1:11 PM Zili Chen <[hidden email]> wrote:

> Thanks for your clarification Till.
>
> For terminology pluggable and customizable, I am mainly concerning about
> interface audience issue. Pluggable means we have multiple
> high-availability
> implementation but closed to extend in user scope; customizable means
> high-availability interface are stable and user-facing.
>
> I agree that we should try to keep backward compatibility if possible. I
> start
> this thread due to being confused how I progress FLINK-10333. Specifically,
> how to deal with compatibility things.
>
> Given that it is an internal interface I think it is reasonable we
> evolve it for
> supporting leader store based high-availability storage in a minor version
> bump. I'm going to start a discuss thread next week for gathering wider
> feedbacks beyond the original (maybe limited) JIRA, which also calls of
> review among community members. What do you think?
>
> Best,
> tison.
>
>
> Till Rohrmann <[hidden email]> 于2019年10月21日周一 下午5:21写道:
>
> > Hi Tison,
> >
> > I'm not sure whether I fully understand your distinction between
> > customizable and pluggable. Maybe you could clarify your ideas a bit
> > because you seem to favour support for pluggable implementations.
> >
> > Maybe let me try to answer some other questions you raised. With the
> > HighAvailabilityServices interface and the functionality to load custom
> > implementations specified via `high-availability: FQDN` it is indeed
> > possible to provide a custom implementation of the HA services. This is,
> > however, pretty much a power user feature where users have to implement
> > against an internal API. As such, it can be subject to change and we
> don't
> > give guarantees that these interfaces won't change. Of course, if
> possible,
> > we should extend/change it in a way that we guarantee backwards
> > compatibility.
> >
> > You are right that at the moment this interface is not stable enough for
> > being public API. Once this changes and we are happy with it, then we can
> > think about making it a public API and documenting it properly. Afaik,
> > there is no documentation how to implement your own HA services at the
> > moment. This underlines as well that this interface is an internal API.
> >
> > Cheers,
> > Till
> >
> > On Fri, Oct 18, 2019 at 5:56 AM Zili Chen <[hidden email]> wrote:
> >
> > > Another perspective is that a stable, carefully-designed interface with
> > > clear semantic
> > > could be safer to customize.
> > >
> > > Following the discussion in FLINK-10333 our JobGraphStore is actually
> > > required
> > > performing write operation only with leadership,  which is a basic
> > > requirement
> > > for coordination rather than an implementation detail.
> > > Thus it depends on LeaderElectionService(in the design, we narrow the
> > > specific interface
> > > as LeaderStore). HighAvailabilityServices#getJobGraphStore() infers a
> > > implicit field for
> > > that which is hard to express the relationship between them.
> > >
> > > If the interface is unstable(also we introduce a ClientHAService for
> > > separate concern and
> > > have to keep b/w comp. for customized), we'd better keep it internal
> for
> > > freely evolution.
> > > And when we try to support customized, it would be helpful to start a
> > > proposal to revisit
> > > the interface to be well-designed and stable. ref[2].
> > >
> > > In short, current high-availability services as well as
> > > runtime/coordination is still under
> > > development and active evolution. It is possibly not a good time for
> make
> > > it public and
> > > customizable.
> > >
> > > Best,
> > > tison.
> > >
> > > [1] https://issues.apache.org/jira/browse/FLINK-10333
> > > [2]
> > >
> > >
> >
> https://github.com/apache/pulsar/wiki/PIP-45%3A-Pluggable-metadata-interface
> > >
> > >
> > > Zili Chen <[hidden email]> 于2019年10月17日周四 下午8:13写道:
> > >
> > > > A challenge is how we ensure the support for customized
> implementation.
> > > > When
> > > > we introduce JobGraphStore#releaseJobGraph we actually change quite a
> > bit
> > > > codepath
> > > > in Dispatcher. While we are unable to test arbitrarily customized
> > > > implementation our
> > > > compatibility promise is actually no more than compilation
> compatible.
> > > >
> > > > Customer should still be required to be familiar with implementation
> > > > details to figure
> > > > out the fitment when they bump Flink version. This effort requires
> also
> > > > and no extra
> > > > when we support pluggable strategy. In another word, a customized
> > support
> > > > tends
> > > > to hide the challenge when customer want to use their own
> > implementation.
> > > >
> > >
> >
>