|
Looking at the DataSet and DataStream APIs we have come to the conclusion
with Aljoscha that there are a few methods that although providing the same functionality are named differently. These are the following: 1. rebalance (batch) / distribute (streaming): Rebalances the data sent to the downstream operators thus equally distributing it. 2. partitionByHash, partitionCustom (batch) / partitionBy (streaming): Partitioning has just recently been exposed in the streaming API and is not as refined as the batch one. The streaming partitionBy is actually partitionByHash. 3. Union (batch) / merge, connect (streaming): The streaming merge does a union of two streams with the same type. Connect is conceptually different, it provides a way of sharing state between two streams with potentially different types without mapping them to a common type and then merging them. This saves latency and an ugly mapping. The former advantage can be offset by proper operator chaining, the second one would remain if we did not have connect. To consolidate the naming I would suggest the following: 1. Rename streaming distribute to rebalance. 2. Rename streaming partitionBy to partitionByHash and file JIRA for custom partitioning support for streaming. 3. Rename streaming merge to union, leave streaming connect as it is. |
|
Yes, these renamings make sense. The partitionBy() is not yet in the
master for streaming, though. On Mon, Jun 1, 2015 at 4:10 PM, Márton Balassi <[hidden email]> wrote: > Looking at the DataSet and DataStream APIs we have come to the conclusion > with Aljoscha that there are a few methods that although providing the same > functionality are named differently. These are the following: > > 1. rebalance (batch) / distribute (streaming): Rebalances the data sent > to the downstream operators thus equally distributing it. > 2. partitionByHash, partitionCustom (batch) / partitionBy (streaming): > Partitioning has just recently been exposed in the streaming API and is not > as refined as the batch one. The streaming partitionBy is actually > partitionByHash. > 3. Union (batch) / merge, connect (streaming): The streaming merge does > a union of two streams with the same type. Connect is conceptually > different, it provides a way of sharing state between two streams with > potentially different types without mapping them to a common type and then > merging them. This saves latency and an ugly mapping. The former advantage > can be offset by proper operator chaining, the second one would remain if > we did not have connect. > > To consolidate the naming I would suggest the following: > > 1. Rename streaming distribute to rebalance. > 2. Rename streaming partitionBy to partitionByHash and file JIRA for > custom partitioning support for streaming. > 3. Rename streaming merge to union, leave streaming connect as it is. |
|
+1 for the changes proposed by Marton (before the release)
Aljoscha Krettek <[hidden email]> ezt írta (időpont: 2015. jún. 1., H, 16:32): > Yes, these renamings make sense. The partitionBy() is not yet in the > master for streaming, though. > > On Mon, Jun 1, 2015 at 4:10 PM, Márton Balassi <[hidden email]> > wrote: > > Looking at the DataSet and DataStream APIs we have come to the conclusion > > with Aljoscha that there are a few methods that although providing the > same > > functionality are named differently. These are the following: > > > > 1. rebalance (batch) / distribute (streaming): Rebalances the data > sent > > to the downstream operators thus equally distributing it. > > 2. partitionByHash, partitionCustom (batch) / partitionBy (streaming): > > Partitioning has just recently been exposed in the streaming API and > is not > > as refined as the batch one. The streaming partitionBy is actually > > partitionByHash. > > 3. Union (batch) / merge, connect (streaming): The streaming merge > does > > a union of two streams with the same type. Connect is conceptually > > different, it provides a way of sharing state between two streams with > > potentially different types without mapping them to a common type and > then > > merging them. This saves latency and an ugly mapping. The former > advantage > > can be offset by proper operator chaining, the second one would > remain if > > we did not have connect. > > > > To consolidate the naming I would suggest the following: > > > > 1. Rename streaming distribute to rebalance. > > 2. Rename streaming partitionBy to partitionByHash and file JIRA for > > custom partitioning support for streaming. > > 3. Rename streaming merge to union, leave streaming connect as it is. > |
|
Thanks for bringing up this point!
+1 for the renaming. @Marton: Is this a "complete" list, i.e., did you go through both APIs or might there be more methods that are semantically identical but named differently? 2015-06-01 17:31 GMT+02:00 Gyula Fóra <[hidden email]>: > +1 for the changes proposed by Marton (before the release) > > Aljoscha Krettek <[hidden email]> ezt írta (időpont: 2015. jún. 1., > H, > 16:32): > > > Yes, these renamings make sense. The partitionBy() is not yet in the > > master for streaming, though. > > > > On Mon, Jun 1, 2015 at 4:10 PM, Márton Balassi <[hidden email] > > > > wrote: > > > Looking at the DataSet and DataStream APIs we have come to the > conclusion > > > with Aljoscha that there are a few methods that although providing the > > same > > > functionality are named differently. These are the following: > > > > > > 1. rebalance (batch) / distribute (streaming): Rebalances the data > > sent > > > to the downstream operators thus equally distributing it. > > > 2. partitionByHash, partitionCustom (batch) / partitionBy > (streaming): > > > Partitioning has just recently been exposed in the streaming API and > > is not > > > as refined as the batch one. The streaming partitionBy is actually > > > partitionByHash. > > > 3. Union (batch) / merge, connect (streaming): The streaming merge > > does > > > a union of two streams with the same type. Connect is conceptually > > > different, it provides a way of sharing state between two streams > with > > > potentially different types without mapping them to a common type > and > > then > > > merging them. This saves latency and an ugly mapping. The former > > advantage > > > can be offset by proper operator chaining, the second one would > > remain if > > > we did not have connect. > > > > > > To consolidate the naming I would suggest the following: > > > > > > 1. Rename streaming distribute to rebalance. > > > 2. Rename streaming partitionBy to partitionByHash and file JIRA for > > > custom partitioning support for streaming. > > > 3. Rename streaming merge to union, leave streaming connect as it > is. > > > |
|
+1
Good list and choices, Marton! On Mon, Jun 1, 2015 at 5:45 PM, Fabian Hueske <[hidden email]> wrote: > Thanks for bringing up this point! > > +1 for the renaming. > @Marton: Is this a "complete" list, i.e., did you go through both APIs or > might there be more methods that are semantically identical but named > differently? > > 2015-06-01 17:31 GMT+02:00 Gyula Fóra <[hidden email]>: > > > +1 for the changes proposed by Marton (before the release) > > > > Aljoscha Krettek <[hidden email]> ezt írta (időpont: 2015. jún. 1., > > H, > > 16:32): > > > > > Yes, these renamings make sense. The partitionBy() is not yet in the > > > master for streaming, though. > > > > > > On Mon, Jun 1, 2015 at 4:10 PM, Márton Balassi < > [hidden email] > > > > > > wrote: > > > > Looking at the DataSet and DataStream APIs we have come to the > > conclusion > > > > with Aljoscha that there are a few methods that although providing > the > > > same > > > > functionality are named differently. These are the following: > > > > > > > > 1. rebalance (batch) / distribute (streaming): Rebalances the > data > > > sent > > > > to the downstream operators thus equally distributing it. > > > > 2. partitionByHash, partitionCustom (batch) / partitionBy > > (streaming): > > > > Partitioning has just recently been exposed in the streaming API > and > > > is not > > > > as refined as the batch one. The streaming partitionBy is actually > > > > partitionByHash. > > > > 3. Union (batch) / merge, connect (streaming): The streaming merge > > > does > > > > a union of two streams with the same type. Connect is conceptually > > > > different, it provides a way of sharing state between two streams > > with > > > > potentially different types without mapping them to a common type > > and > > > then > > > > merging them. This saves latency and an ugly mapping. The former > > > advantage > > > > can be offset by proper operator chaining, the second one would > > > remain if > > > > we did not have connect. > > > > > > > > To consolidate the naming I would suggest the following: > > > > > > > > 1. Rename streaming distribute to rebalance. > > > > 2. Rename streaming partitionBy to partitionByHash and file JIRA > for > > > > custom partitioning support for streaming. > > > > 3. Rename streaming merge to union, leave streaming connect as it > > is. > > > > > > |
|
@Fabian: I hope that this is the complete list, correct me f I am wrong. :)
I am opening a small PR with the changes on top of Aljoscha's one that exposes the streaming partitioning then. On Mon, Jun 1, 2015 at 6:01 PM, Stephan Ewen <[hidden email]> wrote: > +1 > > Good list and choices, Marton! > > On Mon, Jun 1, 2015 at 5:45 PM, Fabian Hueske <[hidden email]> wrote: > > > Thanks for bringing up this point! > > > > +1 for the renaming. > > @Marton: Is this a "complete" list, i.e., did you go through both APIs or > > might there be more methods that are semantically identical but named > > differently? > > > > 2015-06-01 17:31 GMT+02:00 Gyula Fóra <[hidden email]>: > > > > > +1 for the changes proposed by Marton (before the release) > > > > > > Aljoscha Krettek <[hidden email]> ezt írta (időpont: 2015. jún. > 1., > > > H, > > > 16:32): > > > > > > > Yes, these renamings make sense. The partitionBy() is not yet in the > > > > master for streaming, though. > > > > > > > > On Mon, Jun 1, 2015 at 4:10 PM, Márton Balassi < > > [hidden email] > > > > > > > > wrote: > > > > > Looking at the DataSet and DataStream APIs we have come to the > > > conclusion > > > > > with Aljoscha that there are a few methods that although providing > > the > > > > same > > > > > functionality are named differently. These are the following: > > > > > > > > > > 1. rebalance (batch) / distribute (streaming): Rebalances the > > data > > > > sent > > > > > to the downstream operators thus equally distributing it. > > > > > 2. partitionByHash, partitionCustom (batch) / partitionBy > > > (streaming): > > > > > Partitioning has just recently been exposed in the streaming API > > and > > > > is not > > > > > as refined as the batch one. The streaming partitionBy is > actually > > > > > partitionByHash. > > > > > 3. Union (batch) / merge, connect (streaming): The streaming > merge > > > > does > > > > > a union of two streams with the same type. Connect is > conceptually > > > > > different, it provides a way of sharing state between two > streams > > > with > > > > > potentially different types without mapping them to a common > type > > > and > > > > then > > > > > merging them. This saves latency and an ugly mapping. The former > > > > advantage > > > > > can be offset by proper operator chaining, the second one would > > > > remain if > > > > > we did not have connect. > > > > > > > > > > To consolidate the naming I would suggest the following: > > > > > > > > > > 1. Rename streaming distribute to rebalance. > > > > > 2. Rename streaming partitionBy to partitionByHash and file JIRA > > for > > > > > custom partitioning support for streaming. > > > > > 3. Rename streaming merge to union, leave streaming connect as > it > > > is. > > > > > > > > > > |
|
Great proposal! We should use consistent naming for the two API.
Peter 2015-06-01 21:11 GMT+02:00 Márton Balassi <[hidden email]>: > @Fabian: I hope that this is the complete list, correct me f I am wrong. :) > > I am opening a small PR with the changes on top of Aljoscha's one that > exposes the streaming partitioning then. > > On Mon, Jun 1, 2015 at 6:01 PM, Stephan Ewen <[hidden email]> wrote: > > > +1 > > > > Good list and choices, Marton! > > > > On Mon, Jun 1, 2015 at 5:45 PM, Fabian Hueske <[hidden email]> wrote: > > > > > Thanks for bringing up this point! > > > > > > +1 for the renaming. > > > @Marton: Is this a "complete" list, i.e., did you go through both APIs > or > > > might there be more methods that are semantically identical but named > > > differently? > > > > > > 2015-06-01 17:31 GMT+02:00 Gyula Fóra <[hidden email]>: > > > > > > > +1 for the changes proposed by Marton (before the release) > > > > > > > > Aljoscha Krettek <[hidden email]> ezt írta (időpont: 2015. jún. > > 1., > > > > H, > > > > 16:32): > > > > > > > > > Yes, these renamings make sense. The partitionBy() is not yet in > the > > > > > master for streaming, though. > > > > > > > > > > On Mon, Jun 1, 2015 at 4:10 PM, Márton Balassi < > > > [hidden email] > > > > > > > > > > wrote: > > > > > > Looking at the DataSet and DataStream APIs we have come to the > > > > conclusion > > > > > > with Aljoscha that there are a few methods that although > providing > > > the > > > > > same > > > > > > functionality are named differently. These are the following: > > > > > > > > > > > > 1. rebalance (batch) / distribute (streaming): Rebalances the > > > data > > > > > sent > > > > > > to the downstream operators thus equally distributing it. > > > > > > 2. partitionByHash, partitionCustom (batch) / partitionBy > > > > (streaming): > > > > > > Partitioning has just recently been exposed in the streaming > API > > > and > > > > > is not > > > > > > as refined as the batch one. The streaming partitionBy is > > actually > > > > > > partitionByHash. > > > > > > 3. Union (batch) / merge, connect (streaming): The streaming > > merge > > > > > does > > > > > > a union of two streams with the same type. Connect is > > conceptually > > > > > > different, it provides a way of sharing state between two > > streams > > > > with > > > > > > potentially different types without mapping them to a common > > type > > > > and > > > > > then > > > > > > merging them. This saves latency and an ugly mapping. The > former > > > > > advantage > > > > > > can be offset by proper operator chaining, the second one > would > > > > > remain if > > > > > > we did not have connect. > > > > > > > > > > > > To consolidate the naming I would suggest the following: > > > > > > > > > > > > 1. Rename streaming distribute to rebalance. > > > > > > 2. Rename streaming partitionBy to partitionByHash and file > JIRA > > > for > > > > > > custom partitioning support for streaming. > > > > > > 3. Rename streaming merge to union, leave streaming connect as > > it > > > > is. > > > > > > > > > > > > > > > |
| Free forum by Nabble | Edit this page |
