[DISCUSS] Flink backward compatibility

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

[DISCUSS] Flink backward compatibility

Thomas Weise
Hi,

I wanted to bring back the topic of backward compatibility with respect to
all/most of the user facing aspects of Flink. Please note that isn't
limited to the programming API, but also includes job submission and
management.

As can be seen in [1], changes in these areas cause difficulties
downstream. Projects have to choose between Flink versions and users are
ultimately at disadvantage, either by not being able to use the desired
dependency or facing forced upgrades to their infrastructure.

IMO the preferred solution would be that downstream projects can build
against a minimum version of Flink and expect compatibility with future
releases of the major version stream. For example, my project depends on
1.6.x and can expect to run without recompilation on 1.7.x and later.

How far away is Flink from stabilizing the surface that affects typical
users?

Thanks,
Thomas

[1] https://issues.apache.org/jira/browse/BEAM-5419
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Flink backward compatibility

Chesnay Schepler-3
I think this discussion needs specific examples as to what should be
possible as it otherwise is to vague / open to interpretation.

For example, "job submission" may refer to CLI invocations continuing to
work (i.e. CLI arguments), or being able to use a 1.6 client against a
1.7 cluster, which are entirely different things.

What does "management" include? Dependencies? Set of  jars that are
released on maven? Set of jars bundled with flink-dist?

On 26.11.2018 17:24, Thomas Weise wrote:

> Hi,
>
> I wanted to bring back the topic of backward compatibility with respect to
> all/most of the user facing aspects of Flink. Please note that isn't
> limited to the programming API, but also includes job submission and
> management.
>
> As can be seen in [1], changes in these areas cause difficulties
> downstream. Projects have to choose between Flink versions and users are
> ultimately at disadvantage, either by not being able to use the desired
> dependency or facing forced upgrades to their infrastructure.
>
> IMO the preferred solution would be that downstream projects can build
> against a minimum version of Flink and expect compatibility with future
> releases of the major version stream. For example, my project depends on
> 1.6.x and can expect to run without recompilation on 1.7.x and later.
>
> How far away is Flink from stabilizing the surface that affects typical
> users?
>
> Thanks,
> Thomas
>
> [1] https://issues.apache.org/jira/browse/BEAM-5419
>

Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Flink backward compatibility

Fabian Hueske-2
Hi,

I think this is a very good discussion to have.
Flink is becoming part of more and more production deployments and more
tools are built around it.
The question is do we want to (or can we) make parts of the
control/maintenance/monitoring API stable such that external
systems/frameworks can rely on them as stable.

Which APIs are relevant?
Which APIs could be declared as stable?
Which parts are still evolving?

Fabian

Am Di., 27. Nov. 2018 um 15:10 Uhr schrieb Chesnay Schepler <
[hidden email]>:

> I think this discussion needs specific examples as to what should be
> possible as it otherwise is to vague / open to interpretation.
>
> For example, "job submission" may refer to CLI invocations continuing to
> work (i.e. CLI arguments), or being able to use a 1.6 client against a
> 1.7 cluster, which are entirely different things.
>
> What does "management" include? Dependencies? Set of  jars that are
> released on maven? Set of jars bundled with flink-dist?
>
> On 26.11.2018 17:24, Thomas Weise wrote:
> > Hi,
> >
> > I wanted to bring back the topic of backward compatibility with respect
> to
> > all/most of the user facing aspects of Flink. Please note that isn't
> > limited to the programming API, but also includes job submission and
> > management.
> >
> > As can be seen in [1], changes in these areas cause difficulties
> > downstream. Projects have to choose between Flink versions and users are
> > ultimately at disadvantage, either by not being able to use the desired
> > dependency or facing forced upgrades to their infrastructure.
> >
> > IMO the preferred solution would be that downstream projects can build
> > against a minimum version of Flink and expect compatibility with future
> > releases of the major version stream. For example, my project depends on
> > 1.6.x and can expect to run without recompilation on 1.7.x and later.
> >
> > How far away is Flink from stabilizing the surface that affects typical
> > users?
> >
> > Thanks,
> > Thomas
> >
> > [1] https://issues.apache.org/jira/browse/BEAM-5419
> >
>
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Flink backward compatibility

Thomas Weise
Some scenarios that come to mind:

Flink client binary compatibility with remote cluster: This would include
RemoteStreamEnvironment, RESTClusterClient etc. - User should be able to
submit the job built with 1.6.x using the 1.6.x binaries to the remote
Flink 1.7.x or later cluster. The use case for this is Beam.

REST API compatibility: User tooling built against 1.6.x REST API spec
continues to work with 1.7.x or later REST API

CLI compatibility: The commands/options exposed in the CLI continue to be
available after an upgrade. Users can just point to the new CLI location.

Metrics:  Metrics that exist in 1.6.x are available in 1.7.x

There is probably a lot more (such as various backends that users can
configure and their options) and there are different levels of
cost/complexity trade-offs. I brought up the REST API in the past after
observing the tools breakage when going from 1.4.x to 1.5.x.

The client binary compatibility issue will grow more severe as the
ecosystem expands. Beam is a representative example in that category. To
solve the issue downstream, different communities and users each would need
to come up with build system/release support for multiple parallel Flink
versions. It would be better to shield from such complexity.

Thanks,
Thomas


On Tue, Nov 27, 2018 at 6:27 AM Fabian Hueske <[hidden email]> wrote:

> Hi,
>
> I think this is a very good discussion to have.
> Flink is becoming part of more and more production deployments and more
> tools are built around it.
> The question is do we want to (or can we) make parts of the
> control/maintenance/monitoring API stable such that external
> systems/frameworks can rely on them as stable.
>
> Which APIs are relevant?
> Which APIs could be declared as stable?
> Which parts are still evolving?
>
> Fabian
>
> Am Di., 27. Nov. 2018 um 15:10 Uhr schrieb Chesnay Schepler <
> [hidden email]>:
>
>> I think this discussion needs specific examples as to what should be
>> possible as it otherwise is to vague / open to interpretation.
>>
>> For example, "job submission" may refer to CLI invocations continuing to
>> work (i.e. CLI arguments), or being able to use a 1.6 client against a
>> 1.7 cluster, which are entirely different things.
>>
>> What does "management" include? Dependencies? Set of  jars that are
>> released on maven? Set of jars bundled with flink-dist?
>>
>> On 26.11.2018 17:24, Thomas Weise wrote:
>> > Hi,
>> >
>> > I wanted to bring back the topic of backward compatibility with respect
>> to
>> > all/most of the user facing aspects of Flink. Please note that isn't
>> > limited to the programming API, but also includes job submission and
>> > management.
>> >
>> > As can be seen in [1], changes in these areas cause difficulties
>> > downstream. Projects have to choose between Flink versions and users are
>> > ultimately at disadvantage, either by not being able to use the desired
>> > dependency or facing forced upgrades to their infrastructure.
>> >
>> > IMO the preferred solution would be that downstream projects can build
>> > against a minimum version of Flink and expect compatibility with future
>> > releases of the major version stream. For example, my project depends on
>> > 1.6.x and can expect to run without recompilation on 1.7.x and later.
>> >
>> > How far away is Flink from stabilizing the surface that affects typical
>> > users?
>> >
>> > Thanks,
>> > Thomas
>> >
>> > [1] https://issues.apache.org/jira/browse/BEAM-5419
>> >
>>
>>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Flink backward compatibility

Chesnay Schepler-3
so let's take a look...

binary client compatibility: The key issue i see hasn't changed since
the last time this was brought up: Clients rely on the JobGraph to
submit the job which is an internal data structure. AFAIK there will
also be changes made to said class soon(ish). So long as we don't
introduce a decoupled structure and/or compatibility routines here this
is not feasible.
The client in general may be in the way here. The unfortunate reality is
that the client code is one big mess that is due for a complete rewrite.
I doubt anyone has an all-encompassing view over hidden assumptions that
are baked into it, that we would have to retain if we go for backwards
compatibility.

CLI compatibility: Does this include all start scripts or just the flink
executable? I think this makes sense, but so far we did a reasonable job
at not changing command-line parameters. (But maybe only because
changing this part of the CLI is a massive pain...)

REST API: The versioning introduced in 1.7.0 is a significant step
towards a stable API as it allows us to modify things without
(inherently) breaking it.
We're primarily missing tests here to verify the stability, but these
are being worked on.

Metrics: I would not categorize them as stable in general, the reason
being that we are still refactoring and stream-lining the usage. For
some core system metrics (checkpoint info, IO) we can _probably_
guarantee stability.

On 27.11.2018 18:43, Thomas Weise wrote:

> Some scenarios that come to mind:
>
> Flink client binary compatibility with remote cluster: This would include
> RemoteStreamEnvironment, RESTClusterClient etc. - User should be able to
> submit the job built with 1.6.x using the 1.6.x binaries to the remote
> Flink 1.7.x or later cluster. The use case for this is Beam.
>
> REST API compatibility: User tooling built against 1.6.x REST API spec
> continues to work with 1.7.x or later REST API
>
> CLI compatibility: The commands/options exposed in the CLI continue to be
> available after an upgrade. Users can just point to the new CLI location.
>
> Metrics:  Metrics that exist in 1.6.x are available in 1.7.x
>
> There is probably a lot more (such as various backends that users can
> configure and their options) and there are different levels of
> cost/complexity trade-offs. I brought up the REST API in the past after
> observing the tools breakage when going from 1.4.x to 1.5.x.
>
> The client binary compatibility issue will grow more severe as the
> ecosystem expands. Beam is a representative example in that category. To
> solve the issue downstream, different communities and users each would need
> to come up with build system/release support for multiple parallel Flink
> versions. It would be better to shield from such complexity.
>
> Thanks,
> Thomas
>
>
> On Tue, Nov 27, 2018 at 6:27 AM Fabian Hueske <[hidden email]> wrote:
>
>> Hi,
>>
>> I think this is a very good discussion to have.
>> Flink is becoming part of more and more production deployments and more
>> tools are built around it.
>> The question is do we want to (or can we) make parts of the
>> control/maintenance/monitoring API stable such that external
>> systems/frameworks can rely on them as stable.
>>
>> Which APIs are relevant?
>> Which APIs could be declared as stable?
>> Which parts are still evolving?
>>
>> Fabian
>>
>> Am Di., 27. Nov. 2018 um 15:10 Uhr schrieb Chesnay Schepler <
>> [hidden email]>:
>>
>>> I think this discussion needs specific examples as to what should be
>>> possible as it otherwise is to vague / open to interpretation.
>>>
>>> For example, "job submission" may refer to CLI invocations continuing to
>>> work (i.e. CLI arguments), or being able to use a 1.6 client against a
>>> 1.7 cluster, which are entirely different things.
>>>
>>> What does "management" include? Dependencies? Set of  jars that are
>>> released on maven? Set of jars bundled with flink-dist?
>>>
>>> On 26.11.2018 17:24, Thomas Weise wrote:
>>>> Hi,
>>>>
>>>> I wanted to bring back the topic of backward compatibility with respect
>>> to
>>>> all/most of the user facing aspects of Flink. Please note that isn't
>>>> limited to the programming API, but also includes job submission and
>>>> management.
>>>>
>>>> As can be seen in [1], changes in these areas cause difficulties
>>>> downstream. Projects have to choose between Flink versions and users are
>>>> ultimately at disadvantage, either by not being able to use the desired
>>>> dependency or facing forced upgrades to their infrastructure.
>>>>
>>>> IMO the preferred solution would be that downstream projects can build
>>>> against a minimum version of Flink and expect compatibility with future
>>>> releases of the major version stream. For example, my project depends on
>>>> 1.6.x and can expect to run without recompilation on 1.7.x and later.
>>>>
>>>> How far away is Flink from stabilizing the surface that affects typical
>>>> users?
>>>>
>>>> Thanks,
>>>> Thomas
>>>>
>>>> [1] https://issues.apache.org/jira/browse/BEAM-5419
>>>>
>>>

Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Flink backward compatibility

Stephan Ewen
Few thoughts from my side:

(1) The client needs big refactoring / cleanup. It should use a proper HTTP
client library to help with future authentication mechanisms.
Once that is done, we should identify a "client API" that we make stable,
just as the DataStream / DataSet API.

(2) We will most likely refactor the stack in the near future (see
discussion threads on batch / streaming unification).
I would suggest that we define a DAG API as the common substrate and as the
data structure in which jobs are submitted to the REST API (session modes)
and stored in HA services (job mode). Think of it as a JobGraph++.  It may
be a good idea to define that structure via ProtoBuf (or a similar tool) to
support forward/backwards compatibility.

Best,
Stephan


On Wed, Nov 28, 2018 at 10:45 AM Chesnay Schepler <[hidden email]>
wrote:

> so let's take a look...
>
> binary client compatibility: The key issue i see hasn't changed since
> the last time this was brought up: Clients rely on the JobGraph to
> submit the job which is an internal data structure. AFAIK there will
> also be changes made to said class soon(ish). So long as we don't
> introduce a decoupled structure and/or compatibility routines here this
> is not feasible.
> The client in general may be in the way here. The unfortunate reality is
> that the client code is one big mess that is due for a complete rewrite.
> I doubt anyone has an all-encompassing view over hidden assumptions that
> are baked into it, that we would have to retain if we go for backwards
> compatibility.
>
> CLI compatibility: Does this include all start scripts or just the flink
> executable? I think this makes sense, but so far we did a reasonable job
> at not changing command-line parameters. (But maybe only because
> changing this part of the CLI is a massive pain...)
>
> REST API: The versioning introduced in 1.7.0 is a significant step
> towards a stable API as it allows us to modify things without
> (inherently) breaking it.
> We're primarily missing tests here to verify the stability, but these
> are being worked on.
>
> Metrics: I would not categorize them as stable in general, the reason
> being that we are still refactoring and stream-lining the usage. For
> some core system metrics (checkpoint info, IO) we can _probably_
> guarantee stability.
>
> On 27.11.2018 18:43, Thomas Weise wrote:
> > Some scenarios that come to mind:
> >
> > Flink client binary compatibility with remote cluster: This would include
> > RemoteStreamEnvironment, RESTClusterClient etc. - User should be able to
> > submit the job built with 1.6.x using the 1.6.x binaries to the remote
> > Flink 1.7.x or later cluster. The use case for this is Beam.
> >
> > REST API compatibility: User tooling built against 1.6.x REST API spec
> > continues to work with 1.7.x or later REST API
> >
> > CLI compatibility: The commands/options exposed in the CLI continue to be
> > available after an upgrade. Users can just point to the new CLI location.
> >
> > Metrics:  Metrics that exist in 1.6.x are available in 1.7.x
> >
> > There is probably a lot more (such as various backends that users can
> > configure and their options) and there are different levels of
> > cost/complexity trade-offs. I brought up the REST API in the past after
> > observing the tools breakage when going from 1.4.x to 1.5.x.
> >
> > The client binary compatibility issue will grow more severe as the
> > ecosystem expands. Beam is a representative example in that category. To
> > solve the issue downstream, different communities and users each would
> need
> > to come up with build system/release support for multiple parallel Flink
> > versions. It would be better to shield from such complexity.
> >
> > Thanks,
> > Thomas
> >
> >
> > On Tue, Nov 27, 2018 at 6:27 AM Fabian Hueske <[hidden email]> wrote:
> >
> >> Hi,
> >>
> >> I think this is a very good discussion to have.
> >> Flink is becoming part of more and more production deployments and more
> >> tools are built around it.
> >> The question is do we want to (or can we) make parts of the
> >> control/maintenance/monitoring API stable such that external
> >> systems/frameworks can rely on them as stable.
> >>
> >> Which APIs are relevant?
> >> Which APIs could be declared as stable?
> >> Which parts are still evolving?
> >>
> >> Fabian
> >>
> >> Am Di., 27. Nov. 2018 um 15:10 Uhr schrieb Chesnay Schepler <
> >> [hidden email]>:
> >>
> >>> I think this discussion needs specific examples as to what should be
> >>> possible as it otherwise is to vague / open to interpretation.
> >>>
> >>> For example, "job submission" may refer to CLI invocations continuing
> to
> >>> work (i.e. CLI arguments), or being able to use a 1.6 client against a
> >>> 1.7 cluster, which are entirely different things.
> >>>
> >>> What does "management" include? Dependencies? Set of  jars that are
> >>> released on maven? Set of jars bundled with flink-dist?
> >>>
> >>> On 26.11.2018 17:24, Thomas Weise wrote:
> >>>> Hi,
> >>>>
> >>>> I wanted to bring back the topic of backward compatibility with
> respect
> >>> to
> >>>> all/most of the user facing aspects of Flink. Please note that isn't
> >>>> limited to the programming API, but also includes job submission and
> >>>> management.
> >>>>
> >>>> As can be seen in [1], changes in these areas cause difficulties
> >>>> downstream. Projects have to choose between Flink versions and users
> are
> >>>> ultimately at disadvantage, either by not being able to use the
> desired
> >>>> dependency or facing forced upgrades to their infrastructure.
> >>>>
> >>>> IMO the preferred solution would be that downstream projects can build
> >>>> against a minimum version of Flink and expect compatibility with
> future
> >>>> releases of the major version stream. For example, my project depends
> on
> >>>> 1.6.x and can expect to run without recompilation on 1.7.x and later.
> >>>>
> >>>> How far away is Flink from stabilizing the surface that affects
> typical
> >>>> users?
> >>>>
> >>>> Thanks,
> >>>> Thomas
> >>>>
> >>>> [1] https://issues.apache.org/jira/browse/BEAM-5419
> >>>>
> >>>
>
>