Hi folks,
Two weeks ago, I started a thread [1] discussing whether we should discard the split/select methods (which have been marked as deprecation since v1.7) in DataStream API. The fact is, these methods will cause "unexpected" results when using consecutively (e.g., ds.split(a).select(b).split(c).select(d)) or multi-times on the same target (e.g., ds.split(a).select(b), ds.split(c).select(d)). The reason is that following the initial design, the new split/select logic will always override the existing one on the same target operator, rather than append to it. Some users may not be aware of that, but if you do, a current solution would be to use the more powerful side output feature [2]. FLINK-11084 <https://issues.apache.org/jira/browse/FLINK-11084> added some restrictions to the existing split/select logic and suggest to replace it with side output in the future. However, considering that the side output is currently only available in the process function layer and the split/select could have been widely used in many real-world applications, we'd like to start a vote andlisten to the community on how to deal with them. In the discussion thread [1], we proposed three solutions as follows. All of them are feasible but have different impacts on the public API. 1) Port the side output feature to DataStream API's flatMap and replace split/select with it. 2) Introduce a dedicated function in DataStream API (with the "correct" behavior but a different name) that can be used to replace the existing split/select. 3) Keep split/select but change the behavior/semantic to be "correct". Note that this is just a vote for gathering information, so feel free to participate and share your opinions. The voting time will end on *July 7th 17:00 EDT*. Thanks, Xingcan [1] https://lists.apache.org/thread.html/f94ea5c97f96c705527dcc809b0e2b69e87a4c5d400cb7c61859e1f4@%3Cdev.flink.apache.org%3E [2] https://ci.apache.org/projects/flink/flink-docs-master/dev/stream/side_output.html |
Personally I prefer 3) to keep split/select and correct the behavior. I
feel side output is kind of overkill for such a primitive function, and I prefer simple APIs like split/select. Hao Sun On Thu, Jul 4, 2019 at 11:20 AM Xingcan Cui <[hidden email]> wrote: > Hi folks, > > Two weeks ago, I started a thread [1] discussing whether we should discard > the split/select methods (which have been marked as deprecation since v1.7) > in DataStream API. > > The fact is, these methods will cause "unexpected" results when using > consecutively (e.g., ds.split(a).select(b).split(c).select(d)) or > multi-times on the same target (e.g., ds.split(a).select(b), > ds.split(c).select(d)). The reason is that following the initial design, > the new split/select logic will always override the existing one on the > same target operator, rather than append to it. Some users may not be > aware of that, but if you do, a current solution would be to use the more > powerful side output feature [2]. > > FLINK-11084 <https://issues.apache.org/jira/browse/FLINK-11084> added > some restrictions to the existing split/select logic and suggest to > replace it with side output in the future. However, considering that the > side output is currently only available in the process function layer and > the split/select could have been widely used in many real-world > applications, we'd like to start a vote andlisten to the community on how > to deal with them. > > In the discussion thread [1], we proposed three solutions as follows. All > of them are feasible but have different impacts on the public API. > > 1) Port the side output feature to DataStream API's flatMap and replace > split/select with it. > > 2) Introduce a dedicated function in DataStream API (with the "correct" > behavior but a different name) that can be used to replace the existing > split/select. > > 3) Keep split/select but change the behavior/semantic to be "correct". > > Note that this is just a vote for gathering information, so feel free to > participate and share your opinions. > > The voting time will end on *July 7th 17:00 EDT*. > > Thanks, > Xingcan > > [1] > https://lists.apache.org/thread.html/f94ea5c97f96c705527dcc809b0e2b69e87a4c5d400cb7c61859e1f4@%3Cdev.flink.apache.org%3E > <https://lists.apache.org/thread.html/f94ea5c97f96c705527dcc809b0e2b69e87a4c5d400cb7c61859e1f4@%3Cdev.flink.apache.org%3E> > [2] > https://ci.apache.org/projects/flink/flink-docs-master/dev/stream/side_output.html > <https://ci.apache.org/projects/flink/flink-docs-master/dev/stream/side_output.html> > |
In reply to this post by xingcanc
Hi all,
Thanks for your participation. In this thread, we got one +1 for option 1 and option 3, respectively. In the original thread[1], we got two +1 for option 1, one +1 for option 2, and five +1 and one -1 for option 3. To summarize, Option 1 (port side output to flatMap and deprecate split/select): three +1 Option 2 (introduce a new split/select and deprecate existing one): one +1 Option 3 ("correct" the existing split/select): six +1 and one -1 It seems that most people involved are in favor of "correcting" the existing split/select. However, this will definitely break the API compatibility, in a subtle way. IMO, the real behavior of consecutive split/select's has never been thoroughly clarified. Even in the community, it hard to say that we come into a consensus on its real semantics[2-4]. Though the initial design is not ambiguous, there's no doubt that its concept has drifted. As the split/select is quite an ancient API, I cc'ed this to more members. It couldn't be better if you can share your opinions on this. Thanks, Xingcan [1] https://lists.apache.org/thread.html/f94ea5c97f96c705527dcc809b0e2b69e87a4c5d400cb7c61859e1f4@%3Cdev.flink.apache.org%3E [2] https://issues.apache.org/jira/browse/FLINK-1772 [3] https://issues.apache.org/jira/browse/FLINK-5031 [4] https://issues.apache.org/jira/browse/FLINK-11084 > On Jul 5, 2019, at 12:04 AM, 杨力 <[hidden email]> wrote: > > I prefer the 1) approach. I used to carry fields, which is needed only for splitting, in the outputs of flatMap functions. Replacing it with outputTags would simplify data structures. > > Xingcan Cui <[hidden email] <mailto:[hidden email]>> 于 2019年7月5日周五 上午2:20写道: > Hi folks, > > Two weeks ago, I started a thread [1] discussing whether we should discard the split/select methods (which have been marked as deprecation since v1.7) in DataStream API. > > The fact is, these methods will cause "unexpected" results when using consecutively (e.g., ds.split(a).select(b).split(c).select(d)) or multi-times on the same target (e.g., ds.split(a).select(b), ds.split(c).select(d)). The reason is that following the initial design, the new split/select logic will always override the existing one on the same target operator, rather than append to it. Some users may not be aware of that, but if you do, a current solution would be to use the more powerful side output feature [2]. > > FLINK-11084 <https://issues.apache.org/jira/browse/FLINK-11084> added some restrictions to the existing split/select logic and suggest to replace it with side output in the future. However, considering that the side output is currently only available in the process function layer and the split/select could have been widely used in many real-world applications, we'd like to start a vote andlisten to the community on how to deal with them. > > In the discussion thread [1], we proposed three solutions as follows. All of them are feasible but have different impacts on the public API. > > 1) Port the side output feature to DataStream API's flatMap and replace split/select with it. > > 2) Introduce a dedicated function in DataStream API (with the "correct" behavior but a different name) that can be used to replace the existing split/select. > > 3) Keep split/select but change the behavior/semantic to be "correct". > > Note that this is just a vote for gathering information, so feel free to participate and share your opinions. > > The voting time will end on July 7th 17:00 EDT. > > Thanks, > Xingcan > > [1] https://lists.apache.org/thread.html/f94ea5c97f96c705527dcc809b0e2b69e87a4c5d400cb7c61859e1f4@%3Cdev.flink.apache.org%3E <https://lists.apache.org/thread.html/f94ea5c97f96c705527dcc809b0e2b69e87a4c5d400cb7c61859e1f4@%3Cdev.flink.apache.org%3E> > [2] https://ci.apache.org/projects/flink/flink-docs-master/dev/stream/side_output.html <https://ci.apache.org/projects/flink/flink-docs-master/dev/stream/side_output.html> |
I think this would benefit from a FLIP, that neatly sums up the options, and which then gives us also a point where we can vote and ratify a decision.
As a gut feeling, I most like Option 3). Initially I would have preferred option 1) (because of a sense of API purity), but by now I think it’s good that users have this simpler option. Aljoscha > On 8. Jul 2019, at 06:39, Xingcan Cui <[hidden email]> wrote: > > Hi all, > > Thanks for your participation. > > In this thread, we got one +1 for option 1 and option 3, respectively. In the original thread[1], we got two +1 for option 1, one +1 for option 2, and five +1 and one -1 for option 3. > > To summarize, > > Option 1 (port side output to flatMap and deprecate split/select): three +1 > Option 2 (introduce a new split/select and deprecate existing one): one +1 > Option 3 ("correct" the existing split/select): six +1 and one -1 > > It seems that most people involved are in favor of "correcting" the existing split/select. However, this will definitely break the API compatibility, in a subtle way. > > IMO, the real behavior of consecutive split/select's has never been thoroughly clarified. Even in the community, it hard to say that we come into a consensus on its real semantics[2-4]. Though the initial design is not ambiguous, there's no doubt that its concept has drifted. > > As the split/select is quite an ancient API, I cc'ed this to more members. It couldn't be better if you can share your opinions on this. > > Thanks, > Xingcan > > [1] https://lists.apache.org/thread.html/f94ea5c97f96c705527dcc809b0e2b69e87a4c5d400cb7c61859e1f4@%3Cdev.flink.apache.org%3E <https://lists.apache.org/thread.html/f94ea5c97f96c705527dcc809b0e2b69e87a4c5d400cb7c61859e1f4@%3Cdev.flink.apache.org%3E> > [2] https://issues.apache.org/jira/browse/FLINK-1772 <https://issues.apache.org/jira/browse/FLINK-1772> > [3] https://issues.apache.org/jira/browse/FLINK-5031 <https://issues.apache.org/jira/browse/FLINK-5031> > [4] https://issues.apache.org/jira/browse/FLINK-11084 <https://issues.apache.org/jira/browse/FLINK-11084> > > >> On Jul 5, 2019, at 12:04 AM, 杨力 <[hidden email] <mailto:[hidden email]>> wrote: >> >> I prefer the 1) approach. I used to carry fields, which is needed only for splitting, in the outputs of flatMap functions. Replacing it with outputTags would simplify data structures. >> >> Xingcan Cui <[hidden email] <mailto:[hidden email]> <mailto:[hidden email] <mailto:[hidden email]>>> 于 2019年7月5日周五 上午2:20写道: >> Hi folks, >> >> Two weeks ago, I started a thread [1] discussing whether we should discard the split/select methods (which have been marked as deprecation since v1.7) in DataStream API. >> >> The fact is, these methods will cause "unexpected" results when using consecutively (e.g., ds.split(a).select(b).split(c).select(d)) or multi-times on the same target (e.g., ds.split(a).select(b), ds.split(c).select(d)). The reason is that following the initial design, the new split/select logic will always override the existing one on the same target operator, rather than append to it. Some users may not be aware of that, but if you do, a current solution would be to use the more powerful side output feature [2]. >> >> FLINK-11084 <https://issues.apache.org/jira/browse/FLINK-11084> added some restrictions to the existing split/select logic and suggest to replace it with side output in the future. However, considering that the side output is currently only available in the process function layer and the split/select could have been widely used in many real-world applications, we'd like to start a vote andlisten to the community on how to deal with them. >> >> In the discussion thread [1], we proposed three solutions as follows. All of them are feasible but have different impacts on the public API. >> >> 1) Port the side output feature to DataStream API's flatMap and replace split/select with it. >> >> 2) Introduce a dedicated function in DataStream API (with the "correct" behavior but a different name) that can be used to replace the existing split/select. >> >> 3) Keep split/select but change the behavior/semantic to be "correct". >> >> Note that this is just a vote for gathering information, so feel free to participate and share your opinions. >> >> The voting time will end on July 7th 17:00 EDT. >> >> Thanks, >> Xingcan >> >> [1] https://lists.apache.org/thread.html/f94ea5c97f96c705527dcc809b0e2b69e87a4c5d400cb7c61859e1f4@%3Cdev.flink.apache.org%3E <https://lists.apache.org/thread.html/f94ea5c97f96c705527dcc809b0e2b69e87a4c5d400cb7c61859e1f4@%3Cdev.flink.apache.org%3E><https://lists.apache.org/thread.html/f94ea5c97f96c705527dcc809b0e2b69e87a4c5d400cb7c61859e1f4@%3Cdev.flink.apache.org%3E <https://lists.apache.org/thread.html/f94ea5c97f96c705527dcc809b0e2b69e87a4c5d400cb7c61859e1f4@%3Cdev.flink.apache.org%3E>> >> [2] https://ci.apache.org/projects/flink/flink-docs-master/dev/stream/side_output.html <https://ci.apache.org/projects/flink/flink-docs-master/dev/stream/side_output.html> <https://ci.apache.org/projects/flink/flink-docs-master/dev/stream/side_output.html <https://ci.apache.org/projects/flink/flink-docs-master/dev/stream/side_output.html>> |
Hi Aljoscha,
Thanks for your response. With all this preliminary information collected, I’ll start a formal process. Thank everybody for your attention. Best, Xingcan > On Jul 8, 2019, at 10:17 AM, Aljoscha Krettek <[hidden email]> wrote: > > I think this would benefit from a FLIP, that neatly sums up the options, and which then gives us also a point where we can vote and ratify a decision. > > As a gut feeling, I most like Option 3). Initially I would have preferred option 1) (because of a sense of API purity), but by now I think it’s good that users have this simpler option. > > Aljoscha > >> On 8. Jul 2019, at 06:39, Xingcan Cui <[hidden email] <mailto:[hidden email]>> wrote: >> >> Hi all, >> >> Thanks for your participation. >> >> In this thread, we got one +1 for option 1 and option 3, respectively. In the original thread[1], we got two +1 for option 1, one +1 for option 2, and five +1 and one -1 for option 3. >> >> To summarize, >> >> Option 1 (port side output to flatMap and deprecate split/select): three +1 >> Option 2 (introduce a new split/select and deprecate existing one): one +1 >> Option 3 ("correct" the existing split/select): six +1 and one -1 >> >> It seems that most people involved are in favor of "correcting" the existing split/select. However, this will definitely break the API compatibility, in a subtle way. >> >> IMO, the real behavior of consecutive split/select's has never been thoroughly clarified. Even in the community, it hard to say that we come into a consensus on its real semantics[2-4]. Though the initial design is not ambiguous, there's no doubt that its concept has drifted. >> >> As the split/select is quite an ancient API, I cc'ed this to more members. It couldn't be better if you can share your opinions on this. >> >> Thanks, >> Xingcan >> >> [1] https://lists.apache.org/thread.html/f94ea5c97f96c705527dcc809b0e2b69e87a4c5d400cb7c61859e1f4@%3Cdev.flink.apache.org%3E <https://lists.apache.org/thread.html/f94ea5c97f96c705527dcc809b0e2b69e87a4c5d400cb7c61859e1f4@%3Cdev.flink.apache.org%3E> >> [2] https://issues.apache.org/jira/browse/FLINK-1772 <https://issues.apache.org/jira/browse/FLINK-1772> >> [3] https://issues.apache.org/jira/browse/FLINK-5031 <https://issues.apache.org/jira/browse/FLINK-5031> >> [4] https://issues.apache.org/jira/browse/FLINK-11084 <https://issues.apache.org/jira/browse/FLINK-11084> >> >> >>> On Jul 5, 2019, at 12:04 AM, 杨力 <[hidden email] <mailto:[hidden email]>> wrote: >>> >>> I prefer the 1) approach. I used to carry fields, which is needed only for splitting, in the outputs of flatMap functions. Replacing it with outputTags would simplify data structures. >>> >>> Xingcan Cui <[hidden email] <mailto:[hidden email]> <mailto:[hidden email] <mailto:[hidden email]>>> 于 2019年7月5日周五 上午2:20写道: >>> Hi folks, >>> >>> Two weeks ago, I started a thread [1] discussing whether we should discard the split/select methods (which have been marked as deprecation since v1.7) in DataStream API. >>> >>> The fact is, these methods will cause "unexpected" results when using consecutively (e.g., ds.split(a).select(b).split(c).select(d)) or multi-times on the same target (e.g., ds.split(a).select(b), ds.split(c).select(d)). The reason is that following the initial design, the new split/select logic will always override the existing one on the same target operator, rather than append to it. Some users may not be aware of that, but if you do, a current solution would be to use the more powerful side output feature [2]. >>> >>> FLINK-11084 <https://issues.apache.org/jira/browse/FLINK-11084 <https://issues.apache.org/jira/browse/FLINK-11084>> added some restrictions to the existing split/select logic and suggest to replace it with side output in the future. However, considering that the side output is currently only available in the process function layer and the split/select could have been widely used in many real-world applications, we'd like to start a vote andlisten to the community on how to deal with them. >>> >>> In the discussion thread [1], we proposed three solutions as follows. All of them are feasible but have different impacts on the public API. >>> >>> 1) Port the side output feature to DataStream API's flatMap and replace split/select with it. >>> >>> 2) Introduce a dedicated function in DataStream API (with the "correct" behavior but a different name) that can be used to replace the existing split/select. >>> >>> 3) Keep split/select but change the behavior/semantic to be "correct". >>> >>> Note that this is just a vote for gathering information, so feel free to participate and share your opinions. >>> >>> The voting time will end on July 7th 17:00 EDT. >>> >>> Thanks, >>> Xingcan >>> >>> [1] https://lists.apache.org/thread.html/f94ea5c97f96c705527dcc809b0e2b69e87a4c5d400cb7c61859e1f4@%3Cdev.flink.apache.org%3E <https://lists.apache.org/thread.html/f94ea5c97f96c705527dcc809b0e2b69e87a4c5d400cb7c61859e1f4@%3Cdev.flink.apache.org%3E><https://lists.apache.org/thread.html/f94ea5c97f96c705527dcc809b0e2b69e87a4c5d400cb7c61859e1f4@%3Cdev.flink.apache.org%3E <https://lists.apache.org/thread.html/f94ea5c97f96c705527dcc809b0e2b69e87a4c5d400cb7c61859e1f4@%3Cdev.flink.apache.org%3E>> >>> [2] https://ci.apache.org/projects/flink/flink-docs-master/dev/stream/side_output.html <https://ci.apache.org/projects/flink/flink-docs-master/dev/stream/side_output.html> <https://ci.apache.org/projects/flink/flink-docs-master/dev/stream/side_output.html <https://ci.apache.org/projects/flink/flink-docs-master/dev/stream/side_output.html>> |
Free forum by Nabble | Edit this page |