(DEPRECATED) Apache Flink Mailing List archive.

[DISCUSS] Consolidate method naming between the batch and streaming API

Classic

List

Threaded

7 messages Options

Márton Balassi

[DISCUSS] Consolidate method naming between the batch and streaming API

Looking at the DataSet and DataStream APIs we have come to the conclusion
with Aljoscha that there are a few methods that although providing the same
functionality are named differently. These are the following:

1. rebalance (batch) / distribute (streaming): Rebalances the data sent
to the downstream operators thus equally distributing it.
2. partitionByHash, partitionCustom (batch) / partitionBy (streaming):
Partitioning has just recently been exposed in the streaming API and is not
as refined as the batch one. The streaming partitionBy is actually
partitionByHash.
3. Union (batch) / merge, connect (streaming): The streaming merge does
a union of two streams with the same type. Connect is conceptually
different, it provides a way of sharing state between two streams with
potentially different types without mapping them to a common type and then
merging them. This saves latency and an ugly mapping. The former advantage
can be offset by proper operator chaining, the second one would remain if
we did not have connect.

To consolidate the naming I would suggest the following:

1. Rename streaming distribute to rebalance.
2. Rename streaming partitionBy to partitionByHash and file JIRA for
custom partitioning support for streaming.
3. Rename streaming merge to union, leave streaming connect as it is.

Aljoscha Krettek-2

Re: [DISCUSS] Consolidate method naming between the batch and streaming API

Yes, these renamings make sense. The partitionBy() is not yet in the
master for streaming, though.

On Mon, Jun 1, 2015 at 4:10 PM, Márton Balassi <[hidden email]> wrote:

> Looking at the DataSet and DataStream APIs we have come to the conclusion
> with Aljoscha that there are a few methods that although providing the same
> functionality are named differently. These are the following:
>
> 1. rebalance (batch) / distribute (streaming): Rebalances the data sent
> to the downstream operators thus equally distributing it.
> 2. partitionByHash, partitionCustom (batch) / partitionBy (streaming):
> Partitioning has just recently been exposed in the streaming API and is not
> as refined as the batch one. The streaming partitionBy is actually
> partitionByHash.
> 3. Union (batch) / merge, connect (streaming): The streaming merge does
> a union of two streams with the same type. Connect is conceptually
> different, it provides a way of sharing state between two streams with
> potentially different types without mapping them to a common type and then
> merging them. This saves latency and an ugly mapping. The former advantage
> can be offset by proper operator chaining, the second one would remain if
> we did not have connect.
>
> To consolidate the naming I would suggest the following:
>
> 1. Rename streaming distribute to rebalance.
> 2. Rename streaming partitionBy to partitionByHash and file JIRA for
> custom partitioning support for streaming.
> 3. Rename streaming merge to union, leave streaming connect as it is.

Gyula Fóra-2

Re: [DISCUSS] Consolidate method naming between the batch and streaming API

+1 for the changes proposed by Marton (before the release)

Aljoscha Krettek <[hidden email]> ezt írta (időpont: 2015. jún. 1., H,
16:32):

> Yes, these renamings make sense. The partitionBy() is not yet in the
> master for streaming, though.
>
> On Mon, Jun 1, 2015 at 4:10 PM, Márton Balassi <[hidden email]>
> wrote:
> > Looking at the DataSet and DataStream APIs we have come to the conclusion
> > with Aljoscha that there are a few methods that although providing the
> same
> > functionality are named differently. These are the following:
> >
> > 1. rebalance (batch) / distribute (streaming): Rebalances the data
> sent
> > to the downstream operators thus equally distributing it.
> > 2. partitionByHash, partitionCustom (batch) / partitionBy (streaming):
> > Partitioning has just recently been exposed in the streaming API and
> is not
> > as refined as the batch one. The streaming partitionBy is actually
> > partitionByHash.
> > 3. Union (batch) / merge, connect (streaming): The streaming merge
> does
> > a union of two streams with the same type. Connect is conceptually
> > different, it provides a way of sharing state between two streams with
> > potentially different types without mapping them to a common type and
> then
> > merging them. This saves latency and an ugly mapping. The former
> advantage
> > can be offset by proper operator chaining, the second one would
> remain if
> > we did not have connect.
> >
> > To consolidate the naming I would suggest the following:
> >
> > 1. Rename streaming distribute to rebalance.
> > 2. Rename streaming partitionBy to partitionByHash and file JIRA for
> > custom partitioning support for streaming.
> > 3. Rename streaming merge to union, leave streaming connect as it is.
>

Fabian Hueske-2

Re: [DISCUSS] Consolidate method naming between the batch and streaming API

Thanks for bringing up this point!

+1 for the renaming.
@Marton: Is this a "complete" list, i.e., did you go through both APIs or
might there be more methods that are semantically identical but named
differently?

2015-06-01 17:31 GMT+02:00 Gyula Fóra <[hidden email]>:

> +1 for the changes proposed by Marton (before the release)
>
> Aljoscha Krettek <[hidden email]> ezt írta (időpont: 2015. jún. 1.,
> H,
> 16:32):
>
> > Yes, these renamings make sense. The partitionBy() is not yet in the
> > master for streaming, though.
> >
> > On Mon, Jun 1, 2015 at 4:10 PM, Márton Balassi <[hidden email]
> >
> > wrote:
> > > Looking at the DataSet and DataStream APIs we have come to the
> conclusion
> > > with Aljoscha that there are a few methods that although providing the
> > same
> > > functionality are named differently. These are the following:
> > >
> > > 1. rebalance (batch) / distribute (streaming): Rebalances the data
> > sent
> > > to the downstream operators thus equally distributing it.
> > > 2. partitionByHash, partitionCustom (batch) / partitionBy
> (streaming):
> > > Partitioning has just recently been exposed in the streaming API and
> > is not
> > > as refined as the batch one. The streaming partitionBy is actually
> > > partitionByHash.
> > > 3. Union (batch) / merge, connect (streaming): The streaming merge
> > does
> > > a union of two streams with the same type. Connect is conceptually
> > > different, it provides a way of sharing state between two streams
> with
> > > potentially different types without mapping them to a common type
> and
> > then
> > > merging them. This saves latency and an ugly mapping. The former
> > advantage
> > > can be offset by proper operator chaining, the second one would
> > remain if
> > > we did not have connect.
> > >
> > > To consolidate the naming I would suggest the following:
> > >
> > > 1. Rename streaming distribute to rebalance.
> > > 2. Rename streaming partitionBy to partitionByHash and file JIRA for
> > > custom partitioning support for streaming.
> > > 3. Rename streaming merge to union, leave streaming connect as it
> is.
> >
>

Stephan Ewen

Re: [DISCUSS] Consolidate method naming between the batch and streaming API

+1

Good list and choices, Marton!

On Mon, Jun 1, 2015 at 5:45 PM, Fabian Hueske <[hidden email]> wrote:

> Thanks for bringing up this point!
>
> +1 for the renaming.
> @Marton: Is this a "complete" list, i.e., did you go through both APIs or
> might there be more methods that are semantically identical but named
> differently?
>
> 2015-06-01 17:31 GMT+02:00 Gyula Fóra <[hidden email]>:
>
> > +1 for the changes proposed by Marton (before the release)
> >
> > Aljoscha Krettek <[hidden email]> ezt írta (időpont: 2015. jún. 1.,
> > H,
> > 16:32):
> >
> > > Yes, these renamings make sense. The partitionBy() is not yet in the
> > > master for streaming, though.
> > >
> > > On Mon, Jun 1, 2015 at 4:10 PM, Márton Balassi <
> [hidden email]
> > >
> > > wrote:
> > > > Looking at the DataSet and DataStream APIs we have come to the
> > conclusion
> > > > with Aljoscha that there are a few methods that although providing
> the
> > > same
> > > > functionality are named differently. These are the following:
> > > >
> > > > 1. rebalance (batch) / distribute (streaming): Rebalances the
> data
> > > sent
> > > > to the downstream operators thus equally distributing it.
> > > > 2. partitionByHash, partitionCustom (batch) / partitionBy
> > (streaming):
> > > > Partitioning has just recently been exposed in the streaming API
> and
> > > is not
> > > > as refined as the batch one. The streaming partitionBy is actually
> > > > partitionByHash.
> > > > 3. Union (batch) / merge, connect (streaming): The streaming merge
> > > does
> > > > a union of two streams with the same type. Connect is conceptually
> > > > different, it provides a way of sharing state between two streams
> > with
> > > > potentially different types without mapping them to a common type
> > and
> > > then
> > > > merging them. This saves latency and an ugly mapping. The former
> > > advantage
> > > > can be offset by proper operator chaining, the second one would
> > > remain if
> > > > we did not have connect.
> > > >
> > > > To consolidate the naming I would suggest the following:
> > > >
> > > > 1. Rename streaming distribute to rebalance.
> > > > 2. Rename streaming partitionBy to partitionByHash and file JIRA
> for
> > > > custom partitioning support for streaming.
> > > > 3. Rename streaming merge to union, leave streaming connect as it
> > is.
> > >
> >
>

Márton Balassi

Re: [DISCUSS] Consolidate method naming between the batch and streaming API

@Fabian: I hope that this is the complete list, correct me f I am wrong. :)

I am opening a small PR with the changes on top of Aljoscha's one that
exposes the streaming partitioning then.

On Mon, Jun 1, 2015 at 6:01 PM, Stephan Ewen <[hidden email]> wrote:

> +1
>
> Good list and choices, Marton!
>
> On Mon, Jun 1, 2015 at 5:45 PM, Fabian Hueske <[hidden email]> wrote:
>
> > Thanks for bringing up this point!
> >
> > +1 for the renaming.
> > @Marton: Is this a "complete" list, i.e., did you go through both APIs or
> > might there be more methods that are semantically identical but named
> > differently?
> >
> > 2015-06-01 17:31 GMT+02:00 Gyula Fóra <[hidden email]>:
> >
> > > +1 for the changes proposed by Marton (before the release)
> > >
> > > Aljoscha Krettek <[hidden email]> ezt írta (időpont: 2015. jún.
> 1.,
> > > H,
> > > 16:32):
> > >
> > > > Yes, these renamings make sense. The partitionBy() is not yet in the
> > > > master for streaming, though.
> > > >
> > > > On Mon, Jun 1, 2015 at 4:10 PM, Márton Balassi <
> > [hidden email]
> > > >
> > > > wrote:
> > > > > Looking at the DataSet and DataStream APIs we have come to the
> > > conclusion
> > > > > with Aljoscha that there are a few methods that although providing
> > the
> > > > same
> > > > > functionality are named differently. These are the following:
> > > > >
> > > > > 1. rebalance (batch) / distribute (streaming): Rebalances the
> > data
> > > > sent
> > > > > to the downstream operators thus equally distributing it.
> > > > > 2. partitionByHash, partitionCustom (batch) / partitionBy
> > > (streaming):
> > > > > Partitioning has just recently been exposed in the streaming API
> > and
> > > > is not
> > > > > as refined as the batch one. The streaming partitionBy is
> actually
> > > > > partitionByHash.
> > > > > 3. Union (batch) / merge, connect (streaming): The streaming
> merge
> > > > does
> > > > > a union of two streams with the same type. Connect is
> conceptually
> > > > > different, it provides a way of sharing state between two
> streams
> > > with
> > > > > potentially different types without mapping them to a common
> type
> > > and
> > > > then
> > > > > merging them. This saves latency and an ugly mapping. The former
> > > > advantage
> > > > > can be offset by proper operator chaining, the second one would
> > > > remain if
> > > > > we did not have connect.
> > > > >
> > > > > To consolidate the naming I would suggest the following:
> > > > >
> > > > > 1. Rename streaming distribute to rebalance.
> > > > > 2. Rename streaming partitionBy to partitionByHash and file JIRA
> > for
> > > > > custom partitioning support for streaming.
> > > > > 3. Rename streaming merge to union, leave streaming connect as
> it
> > > is.
> > > >
> > >
> >
>

Szabó Péter

Re: [DISCUSS] Consolidate method naming between the batch and streaming API

Great proposal! We should use consistent naming for the two API.

Peter

2015-06-01 21:11 GMT+02:00 Márton Balassi <[hidden email]>:

> @Fabian: I hope that this is the complete list, correct me f I am wrong. :)
>
> I am opening a small PR with the changes on top of Aljoscha's one that
> exposes the streaming partitioning then.
>
> On Mon, Jun 1, 2015 at 6:01 PM, Stephan Ewen <[hidden email]> wrote:
>
> > +1
> >
> > Good list and choices, Marton!
> >
> > On Mon, Jun 1, 2015 at 5:45 PM, Fabian Hueske <[hidden email]> wrote:
> >
> > > Thanks for bringing up this point!
> > >
> > > +1 for the renaming.
> > > @Marton: Is this a "complete" list, i.e., did you go through both APIs
> or
> > > might there be more methods that are semantically identical but named
> > > differently?
> > >
> > > 2015-06-01 17:31 GMT+02:00 Gyula Fóra <[hidden email]>:
> > >
> > > > +1 for the changes proposed by Marton (before the release)
> > > >
> > > > Aljoscha Krettek <[hidden email]> ezt írta (időpont: 2015. jún.
> > 1.,
> > > > H,
> > > > 16:32):
> > > >
> > > > > Yes, these renamings make sense. The partitionBy() is not yet in
> the
> > > > > master for streaming, though.
> > > > >
> > > > > On Mon, Jun 1, 2015 at 4:10 PM, Márton Balassi <
> > > [hidden email]
> > > > >
> > > > > wrote:
> > > > > > Looking at the DataSet and DataStream APIs we have come to the
> > > > conclusion
> > > > > > with Aljoscha that there are a few methods that although
> providing
> > > the
> > > > > same
> > > > > > functionality are named differently. These are the following:
> > > > > >
> > > > > > 1. rebalance (batch) / distribute (streaming): Rebalances the
> > > data
> > > > > sent
> > > > > > to the downstream operators thus equally distributing it.
> > > > > > 2. partitionByHash, partitionCustom (batch) / partitionBy
> > > > (streaming):
> > > > > > Partitioning has just recently been exposed in the streaming
> API
> > > and
> > > > > is not
> > > > > > as refined as the batch one. The streaming partitionBy is
> > actually
> > > > > > partitionByHash.
> > > > > > 3. Union (batch) / merge, connect (streaming): The streaming
> > merge
> > > > > does
> > > > > > a union of two streams with the same type. Connect is
> > conceptually
> > > > > > different, it provides a way of sharing state between two
> > streams
> > > > with
> > > > > > potentially different types without mapping them to a common
> > type
> > > > and
> > > > > then
> > > > > > merging them. This saves latency and an ugly mapping. The
> former
> > > > > advantage
> > > > > > can be offset by proper operator chaining, the second one
> would
> > > > > remain if
> > > > > > we did not have connect.
> > > > > >
> > > > > > To consolidate the naming I would suggest the following:
> > > > > >
> > > > > > 1. Rename streaming distribute to rebalance.
> > > > > > 2. Rename streaming partitionBy to partitionByHash and file
> JIRA
> > > for
> > > > > > custom partitioning support for streaming.
> > > > > > 3. Rename streaming merge to union, leave streaming connect as
> > it
> > > > is.
> > > > >
> > > >
> > >
> >
>