About Deprecating split/select for DataStream API

classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

About Deprecating split/select for DataStream API

xingcanc
Hi all,

Recently, I noticed that the split/select methods in DataStream API have been marked as deprecated since 1.7.2 and 1.8.0 (the related JIRA issue FLINK-11084 <https://issues.apache.org/jira/browse/FLINK-11084>).

Although the two methods can be replaced by the more powerful side output feature[1], I still doubt whether we should really remove them in the future.

1. From semantics, the split/select is the reverse operation to the union transformation. Without them, the DataStream API seems to be missing a piece.

2. From accessibility, the side output only works for process functions, which means it forces the user to dive into a lower API.

According to FLINK-11084 <https://issues.apache.org/jira/browse/FLINK-11084>, there exist some problems with the current implementation of the two methods. Maybe we should fix the problems and re-active them again. Or if they really need to be deprecated, we should at least mark the corresponding documentation for that : )

What do you think?

Best,
Xingcan

[1] https://ci.apache.org/projects/flink/flink-docs-master/dev/stream/side_output.html <https://ci.apache.org/projects/flink/flink-docs-master/dev/stream/side_output.html>
Reply | Threaded
Open this post in threaded view
|

Re: About Deprecating split/select for DataStream API

SHI Xiaogang
Hi Xingcan,

Thanks for bringing it up for discusson.

I agree with you that we should not deprecate the split/select methods.
Their semantics are very clear and they are widely adopted by Flink users.
We should fix these problems instead of simply deprecating the methods.

Regards,
Xiaogang

Xingcan Cui <[hidden email]> 于2019年6月15日周六 下午4:13写道:

> Hi all,
>
> Recently, I noticed that the split/select methods in DataStream API have
> been marked as deprecated since 1.7.2 and 1.8.0 (the related JIRA issue
> FLINK-11084 <https://issues.apache.org/jira/browse/FLINK-11084>).
>
> Although the two methods can be replaced by the more powerful side output
> feature[1], I still doubt whether we should really remove them in the
> future.
>
> 1. From semantics, the split/select is the reverse operation to the union
> transformation. Without them, the DataStream API seems to be missing a
> piece.
>
> 2. From accessibility, the side output only works for process functions,
> which means it forces the user to dive into a lower API.
>
> According to FLINK-11084 <
> https://issues.apache.org/jira/browse/FLINK-11084>, there exist some
> problems with the current implementation of the two methods. Maybe we
> should fix the problems and re-active them again. Or if they really need to
> be deprecated, we should at least mark the corresponding documentation for
> that : )
>
> What do you think?
>
> Best,
> Xingcan
>
> [1]
> https://ci.apache.org/projects/flink/flink-docs-master/dev/stream/side_output.html
> <
> https://ci.apache.org/projects/flink/flink-docs-master/dev/stream/side_output.html
> >
Reply | Threaded
Open this post in threaded view
|

Re: About Deprecating split/select for DataStream API

vino yang
Hi,

I also think it is valuable and reasonable to keep the split/select APIs.
They are very convenient and widely used in our platform. I think they are
also used in other users' jobs.
If the community has doubts about this, IMHO, it would be better to start a
user survey.

Best,
Vino

SHI Xiaogang <[hidden email]> 于2019年6月17日周一 上午11:55写道:

> Hi Xingcan,
>
> Thanks for bringing it up for discusson.
>
> I agree with you that we should not deprecate the split/select methods.
> Their semantics are very clear and they are widely adopted by Flink users.
> We should fix these problems instead of simply deprecating the methods.
>
> Regards,
> Xiaogang
>
> Xingcan Cui <[hidden email]> 于2019年6月15日周六 下午4:13写道:
>
> > Hi all,
> >
> > Recently, I noticed that the split/select methods in DataStream API have
> > been marked as deprecated since 1.7.2 and 1.8.0 (the related JIRA issue
> > FLINK-11084 <https://issues.apache.org/jira/browse/FLINK-11084>).
> >
> > Although the two methods can be replaced by the more powerful side output
> > feature[1], I still doubt whether we should really remove them in the
> > future.
> >
> > 1. From semantics, the split/select is the reverse operation to the union
> > transformation. Without them, the DataStream API seems to be missing a
> > piece.
> >
> > 2. From accessibility, the side output only works for process functions,
> > which means it forces the user to dive into a lower API.
> >
> > According to FLINK-11084 <
> > https://issues.apache.org/jira/browse/FLINK-11084>, there exist some
> > problems with the current implementation of the two methods. Maybe we
> > should fix the problems and re-active them again. Or if they really need
> to
> > be deprecated, we should at least mark the corresponding documentation
> for
> > that : )
> >
> > What do you think?
> >
> > Best,
> > Xingcan
> >
> > [1]
> >
> https://ci.apache.org/projects/flink/flink-docs-master/dev/stream/side_output.html
> > <
> >
> https://ci.apache.org/projects/flink/flink-docs-master/dev/stream/side_output.html
> > >
>
Reply | Threaded
Open this post in threaded view
|

Re: About Deprecating split/select for DataStream API

Jark Wu-2
+1 to keep the split/select API. I think if there are some problems with
the API, it's better to fix them instead of deprecating them.
And select/split are straightforward and convenient APIs. It's worth to
have them.

Regards,
Jark

On Mon, 17 Jun 2019 at 14:46, vino yang <[hidden email]> wrote:

> Hi,
>
> I also think it is valuable and reasonable to keep the split/select APIs.
> They are very convenient and widely used in our platform. I think they are
> also used in other users' jobs.
> If the community has doubts about this, IMHO, it would be better to start a
> user survey.
>
> Best,
> Vino
>
> SHI Xiaogang <[hidden email]> 于2019年6月17日周一 上午11:55写道:
>
> > Hi Xingcan,
> >
> > Thanks for bringing it up for discusson.
> >
> > I agree with you that we should not deprecate the split/select methods.
> > Their semantics are very clear and they are widely adopted by Flink
> users.
> > We should fix these problems instead of simply deprecating the methods.
> >
> > Regards,
> > Xiaogang
> >
> > Xingcan Cui <[hidden email]> 于2019年6月15日周六 下午4:13写道:
> >
> > > Hi all,
> > >
> > > Recently, I noticed that the split/select methods in DataStream API
> have
> > > been marked as deprecated since 1.7.2 and 1.8.0 (the related JIRA issue
> > > FLINK-11084 <https://issues.apache.org/jira/browse/FLINK-11084>).
> > >
> > > Although the two methods can be replaced by the more powerful side
> output
> > > feature[1], I still doubt whether we should really remove them in the
> > > future.
> > >
> > > 1. From semantics, the split/select is the reverse operation to the
> union
> > > transformation. Without them, the DataStream API seems to be missing a
> > > piece.
> > >
> > > 2. From accessibility, the side output only works for process
> functions,
> > > which means it forces the user to dive into a lower API.
> > >
> > > According to FLINK-11084 <
> > > https://issues.apache.org/jira/browse/FLINK-11084>, there exist some
> > > problems with the current implementation of the two methods. Maybe we
> > > should fix the problems and re-active them again. Or if they really
> need
> > to
> > > be deprecated, we should at least mark the corresponding documentation
> > for
> > > that : )
> > >
> > > What do you think?
> > >
> > > Best,
> > > Xingcan
> > >
> > > [1]
> > >
> >
> https://ci.apache.org/projects/flink/flink-docs-master/dev/stream/side_output.html
> > > <
> > >
> >
> https://ci.apache.org/projects/flink/flink-docs-master/dev/stream/side_output.html
> > > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: About Deprecating split/select for DataStream API

dwysakowicz

Hi all,

Thank you for starting the discussion. To start with I have to say I am not entirely against leaving them. On the other hand I totally disagree that the semantics are clearly defined. Actually the design is fundamentally flawed.

  1. We use String as a selector for elements. This is not the cleanest design, but I agree it is not the worst.
  2. Users cannot define different types for different splits.
  3. (The actual reason why I think it's actually better to drop the split/select and introduce a better mechanism) The behavior of a split is to actually add an output selector. We can have just a single selector on a single operator, but the API allows (I would even say encourages) to create chains of split/select, which leads to undefined behavior. Take this for example: ds.split().select("a", "b").select("c", "d"). Which tags should be forwarded? ("a", "b", "c", "d") (union) or () (intersection). In my opinion the most obvious answer in this case would be the intersection. Let's modify it slightly though and I would assume a different behavior (the union)

           splitted = ds.split();

           splitted.select("a", "b").map()

           splitted.select("c", "d").map()

Taking the 3rd argument into consideration I would be in favor of removing the current mechanism. I think the side outputs serve the purpose much better with much cleaner semantics. I get the argument that users are now forced to use processFunction if they want to use the side outputs. If this is the main problem how about enabling them e.g. for flatMap as well?

Best,

Dawid

On 17/06/2019 08:51, Jark Wu wrote:
+1 to keep the split/select API. I think if there are some problems with
the API, it's better to fix them instead of deprecating them.
And select/split are straightforward and convenient APIs. It's worth to
have them.

Regards,
Jark

On Mon, 17 Jun 2019 at 14:46, vino yang [hidden email] wrote:

Hi,

I also think it is valuable and reasonable to keep the split/select APIs.
They are very convenient and widely used in our platform. I think they are
also used in other users' jobs.
If the community has doubts about this, IMHO, it would be better to start a
user survey.

Best,
Vino

SHI Xiaogang [hidden email] 于2019年6月17日周一 上午11:55写道:

Hi Xingcan,

Thanks for bringing it up for discusson.

I agree with you that we should not deprecate the split/select methods.
Their semantics are very clear and they are widely adopted by Flink
users.
We should fix these problems instead of simply deprecating the methods.

Regards,
Xiaogang

Xingcan Cui [hidden email] 于2019年6月15日周六 下午4:13写道:

Hi all,

Recently, I noticed that the split/select methods in DataStream API
have
been marked as deprecated since 1.7.2 and 1.8.0 (the related JIRA issue
FLINK-11084 <https://issues.apache.org/jira/browse/FLINK-11084>).

Although the two methods can be replaced by the more powerful side
output
feature[1], I still doubt whether we should really remove them in the
future.

1. From semantics, the split/select is the reverse operation to the
union
transformation. Without them, the DataStream API seems to be missing a
piece.

2. From accessibility, the side output only works for process
functions,
which means it forces the user to dive into a lower API.

According to FLINK-11084 <
https://issues.apache.org/jira/browse/FLINK-11084>, there exist some
problems with the current implementation of the two methods. Maybe we
should fix the problems and re-active them again. Or if they really
need
to
be deprecated, we should at least mark the corresponding documentation
for
that : )

What do you think?

Best,
Xingcan

[1]


        
https://ci.apache.org/projects/flink/flink-docs-master/dev/stream/side_output.html
<


        
https://ci.apache.org/projects/flink/flink-docs-master/dev/stream/side_output.html

            

        

      

    

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: About Deprecating split/select for DataStream API

SHI Xiaogang
Hi Dawid,

As the select method is only allowed on SplitStreams, it's impossible to
construct the example ds.split().select("a", "b").select("c", "d").

Are you meaning ds.split().select("a", "b").split().select("c", "d")?
If so, then the tagging in the first split operation should not affect the
second one. Then
    splitted.select("a", "b") => empty
    splitted.select("c", "d") => ds

I cannot quite catch your point here. It's appreciated if you can provide a
more concrete explanation?

Regards,
Xiaogang Shi




Dawid Wysakowicz <[hidden email]> 于2019年6月17日周一 下午3:10写道:

> Hi all,
>
> Thank you for starting the discussion. To start with I have to say I am
> not entirely against leaving them. On the other hand I totally disagree
> that the semantics are clearly defined. Actually the design is
> fundamentally flawed.
>
>    1. We use String as a selector for elements. This is not the cleanest
>    design, but I agree it is not the worst.
>    2. Users cannot define different types for different splits.
>    3. (The actual reason why I think it's actually better to drop the
>    split/select and introduce a better mechanism) The behavior of a split is
>    to actually add an output selector. We can have just a single selector on a
>    single operator, but the API allows (I would even say encourages) to create
>    chains of split/select, which leads to undefined behavior. Take this for
>    example: ds.split().select("a", "b").select("c", "d"). Which tags should be
>    forwarded? ("a", "b", "c", "d") (union) or () (intersection). In my opinion
>    the most obvious answer in this case would be the intersection. Let's
>    modify it slightly though and I would assume a different behavior (the
>    union)
>
>            splitted = ds.split();
>
>            splitted.select("a", "b").map()
>
>            splitted.select("c", "d").map()
>
> Taking the 3rd argument into consideration I would be in favor of removing
> the current mechanism. I think the side outputs serve the purpose much
> better with much cleaner semantics. I get the argument that users are now
> forced to use processFunction if they want to use the side outputs. If this
> is the main problem how about enabling them e.g. for flatMap as well?
>
> Best,
>
> Dawid
> On 17/06/2019 08:51, Jark Wu wrote:
>
> +1 to keep the split/select API. I think if there are some problems with
> the API, it's better to fix them instead of deprecating them.
> And select/split are straightforward and convenient APIs. It's worth to
> have them.
>
> Regards,
> Jark
>
> On Mon, 17 Jun 2019 at 14:46, vino yang <[hidden email]> <[hidden email]> wrote:
>
>
> Hi,
>
> I also think it is valuable and reasonable to keep the split/select APIs.
> They are very convenient and widely used in our platform. I think they are
> also used in other users' jobs.
> If the community has doubts about this, IMHO, it would be better to start a
> user survey.
>
> Best,
> Vino
>
> SHI Xiaogang <[hidden email]> <[hidden email]> 于2019年6月17日周一 上午11:55写道:
>
>
> Hi Xingcan,
>
> Thanks for bringing it up for discusson.
>
> I agree with you that we should not deprecate the split/select methods.
> Their semantics are very clear and they are widely adopted by Flink
>
> users.
>
> We should fix these problems instead of simply deprecating the methods.
>
> Regards,
> Xiaogang
>
> Xingcan Cui <[hidden email]> <[hidden email]> 于2019年6月15日周六 下午4:13写道:
>
>
> Hi all,
>
> Recently, I noticed that the split/select methods in DataStream API
>
> have
>
> been marked as deprecated since 1.7.2 and 1.8.0 (the related JIRA issue
> FLINK-11084 <https://issues.apache.org/jira/browse/FLINK-11084> <https://issues.apache.org/jira/browse/FLINK-11084>).
>
> Although the two methods can be replaced by the more powerful side
>
> output
>
> feature[1], I still doubt whether we should really remove them in the
> future.
>
> 1. From semantics, the split/select is the reverse operation to the
>
> union
>
> transformation. Without them, the DataStream API seems to be missing a
> piece.
>
> 2. From accessibility, the side output only works for process
>
> functions,
>
> which means it forces the user to dive into a lower API.
>
> According to FLINK-11084 <
> https://issues.apache.org/jira/browse/FLINK-11084> <https://issues.apache.org/jira/browse/FLINK-11084>, there exist some
> problems with the current implementation of the two methods. Maybe we
> should fix the problems and re-active them again. Or if they really
>
> need
>
> to
>
> be deprecated, we should at least mark the corresponding documentation
>
> for
>
> that : )
>
> What do you think?
>
> Best,
> Xingcan
>
> [1]
>
>
> https://ci.apache.org/projects/flink/flink-docs-master/dev/stream/side_output.html
>
> <
>
>
> https://ci.apache.org/projects/flink/flink-docs-master/dev/stream/side_output.html
>
>
Reply | Threaded
Open this post in threaded view
|

Re: About Deprecating split/select for DataStream API

dwysakowicz
Yes you are correct. The problem I described applies to the split not
select as I wrote in the first email. Sorry for that.

I will try to prepare a correct example. Let's have a look at this example:

    val splitted1 = ds.split(if (1) then "a")

    val splitted2 = ds.split(if (!=1) then "a")

In those cases splitted1.select("a") -> will output all elements, the
same for splitted2, because the OutputSelector(s) are applied to
previous operator. The behavior I would assume is that splitted1 outputs
only "1"s, whereas splitted2 all but "1"s

On the other hand in a call

    val splitted1 = ds.split(if ("1" or "2") then
"a").select("a").split(if ("3") then "b").select("b")

I would assume an intersection of those two splits, so no results. What
actually happens is that it will be "1", "2" & "3"s. Actually, right
exceptions should be thrown in those cases not to produce confusing
results, but this just shows that this API is broken, if we need to
check for some prohibited configurations during runtime.

Those weird behaviors are in my opinion results of the flawed API, as it
actually assigns an output selector to the previous operator. In other
words it modifies previous operator. I think it would be much cleaner if
this happened inside an operator rather than separately. This is what
SideOutputs do, as you define them inside the ProcessFunction, rather
than afterwards. Therefore I am very much in favor of using them for
those cases. Once again if the problem is that they are available only
in the ProcessFunction I would prefer enabling them e.g. in FlatMap,
rather than keeping the split/select.

   

On 17/06/2019 09:40, SHI Xiaogang wrote:

> Hi Dawid,
>
> As the select method is only allowed on SplitStreams, it's impossible to
> construct the example ds.split().select("a", "b").select("c", "d").
>
> Are you meaning ds.split().select("a", "b").split().select("c", "d")?
> If so, then the tagging in the first split operation should not affect the
> second one. Then
>     splitted.select("a", "b") => empty
>     splitted.select("c", "d") => ds
>
> I cannot quite catch your point here. It's appreciated if you can provide a
> more concrete explanation?
>
> Regards,
> Xiaogang Shi
>
>
>
>
> Dawid Wysakowicz <[hidden email]> 于2019年6月17日周一 下午3:10写道:
>
>> Hi all,
>>
>> Thank you for starting the discussion. To start with I have to say I am
>> not entirely against leaving them. On the other hand I totally disagree
>> that the semantics are clearly defined. Actually the design is
>> fundamentally flawed.
>>
>>    1. We use String as a selector for elements. This is not the cleanest
>>    design, but I agree it is not the worst.
>>    2. Users cannot define different types for different splits.
>>    3. (The actual reason why I think it's actually better to drop the
>>    split/select and introduce a better mechanism) The behavior of a split is
>>    to actually add an output selector. We can have just a single selector on a
>>    single operator, but the API allows (I would even say encourages) to create
>>    chains of split/select, which leads to undefined behavior. Take this for
>>    example: ds.split().select("a", "b").select("c", "d"). Which tags should be
>>    forwarded? ("a", "b", "c", "d") (union) or () (intersection). In my opinion
>>    the most obvious answer in this case would be the intersection. Let's
>>    modify it slightly though and I would assume a different behavior (the
>>    union)
>>
>>            splitted = ds.split();
>>
>>            splitted.select("a", "b").map()
>>
>>            splitted.select("c", "d").map()
>>
>> Taking the 3rd argument into consideration I would be in favor of removing
>> the current mechanism. I think the side outputs serve the purpose much
>> better with much cleaner semantics. I get the argument that users are now
>> forced to use processFunction if they want to use the side outputs. If this
>> is the main problem how about enabling them e.g. for flatMap as well?
>>
>> Best,
>>
>> Dawid
>> On 17/06/2019 08:51, Jark Wu wrote:
>>
>> +1 to keep the split/select API. I think if there are some problems with
>> the API, it's better to fix them instead of deprecating them.
>> And select/split are straightforward and convenient APIs. It's worth to
>> have them.
>>
>> Regards,
>> Jark
>>
>> On Mon, 17 Jun 2019 at 14:46, vino yang <[hidden email]> <[hidden email]> wrote:
>>
>>
>> Hi,
>>
>> I also think it is valuable and reasonable to keep the split/select APIs.
>> They are very convenient and widely used in our platform. I think they are
>> also used in other users' jobs.
>> If the community has doubts about this, IMHO, it would be better to start a
>> user survey.
>>
>> Best,
>> Vino
>>
>> SHI Xiaogang <[hidden email]> <[hidden email]> 于2019年6月17日周一 上午11:55写道:
>>
>>
>> Hi Xingcan,
>>
>> Thanks for bringing it up for discusson.
>>
>> I agree with you that we should not deprecate the split/select methods.
>> Their semantics are very clear and they are widely adopted by Flink
>>
>> users.
>>
>> We should fix these problems instead of simply deprecating the methods.
>>
>> Regards,
>> Xiaogang
>>
>> Xingcan Cui <[hidden email]> <[hidden email]> 于2019年6月15日周六 下午4:13写道:
>>
>>
>> Hi all,
>>
>> Recently, I noticed that the split/select methods in DataStream API
>>
>> have
>>
>> been marked as deprecated since 1.7.2 and 1.8.0 (the related JIRA issue
>> FLINK-11084 <https://issues.apache.org/jira/browse/FLINK-11084> <https://issues.apache.org/jira/browse/FLINK-11084>).
>>
>> Although the two methods can be replaced by the more powerful side
>>
>> output
>>
>> feature[1], I still doubt whether we should really remove them in the
>> future.
>>
>> 1. From semantics, the split/select is the reverse operation to the
>>
>> union
>>
>> transformation. Without them, the DataStream API seems to be missing a
>> piece.
>>
>> 2. From accessibility, the side output only works for process
>>
>> functions,
>>
>> which means it forces the user to dive into a lower API.
>>
>> According to FLINK-11084 <
>> https://issues.apache.org/jira/browse/FLINK-11084> <https://issues.apache.org/jira/browse/FLINK-11084>, there exist some
>> problems with the current implementation of the two methods. Maybe we
>> should fix the problems and re-active them again. Or if they really
>>
>> need
>>
>> to
>>
>> be deprecated, we should at least mark the corresponding documentation
>>
>> for
>>
>> that : )
>>
>> What do you think?
>>
>> Best,
>> Xingcan
>>
>> [1]
>>
>>
>> https://ci.apache.org/projects/flink/flink-docs-master/dev/stream/side_output.html
>>
>> <
>>
>>
>> https://ci.apache.org/projects/flink/flink-docs-master/dev/stream/side_output.html
>>
>>


signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: About Deprecating split/select for DataStream API

SHI Xiaogang
Hi Dawid,

Thanks a lot for your example.

I think most users will expect splitted1 to be empty in the example.

The unexpected results produced, in my opinion, is due to our problematic
implementation, instead of the confusing semantics.
We can fix the problem if we add a SELECT operator to filter out unexpected
records (Of course, we can find some optimization to improve the
efficiency.).

After all, i prefer to fix the problems to make the results as expected.
What do you think?

Regards,
Xiaogang

Dawid Wysakowicz <[hidden email]> 于2019年6月17日周一 下午7:21写道:

> Yes you are correct. The problem I described applies to the split not
> select as I wrote in the first email. Sorry for that.
>
> I will try to prepare a correct example. Let's have a look at this example:
>
>     val splitted1 = ds.split(if (1) then "a")
>
>     val splitted2 = ds.split(if (!=1) then "a")
>
> In those cases splitted1.select("a") -> will output all elements, the
> same for splitted2, because the OutputSelector(s) are applied to
> previous operator. The behavior I would assume is that splitted1 outputs
> only "1"s, whereas splitted2 all but "1"s
>
> On the other hand in a call
>
>     val splitted1 = ds.split(if ("1" or "2") then
> "a").select("a").split(if ("3") then "b").select("b")
>
> I would assume an intersection of those two splits, so no results. What
> actually happens is that it will be "1", "2" & "3"s. Actually, right
> exceptions should be thrown in those cases not to produce confusing
> results, but this just shows that this API is broken, if we need to
> check for some prohibited configurations during runtime.
>
> Those weird behaviors are in my opinion results of the flawed API, as it
> actually assigns an output selector to the previous operator. In other
> words it modifies previous operator. I think it would be much cleaner if
> this happened inside an operator rather than separately. This is what
> SideOutputs do, as you define them inside the ProcessFunction, rather
> than afterwards. Therefore I am very much in favor of using them for
> those cases. Once again if the problem is that they are available only
> in the ProcessFunction I would prefer enabling them e.g. in FlatMap,
> rather than keeping the split/select.
>
>
>
> On 17/06/2019 09:40, SHI Xiaogang wrote:
> > Hi Dawid,
> >
> > As the select method is only allowed on SplitStreams, it's impossible to
> > construct the example ds.split().select("a", "b").select("c", "d").
> >
> > Are you meaning ds.split().select("a", "b").split().select("c", "d")?
> > If so, then the tagging in the first split operation should not affect
> the
> > second one. Then
> >     splitted.select("a", "b") => empty
> >     splitted.select("c", "d") => ds
> >
> > I cannot quite catch your point here. It's appreciated if you can
> provide a
> > more concrete explanation?
> >
> > Regards,
> > Xiaogang Shi
> >
> >
> >
> >
> > Dawid Wysakowicz <[hidden email]> 于2019年6月17日周一 下午3:10写道:
> >
> >> Hi all,
> >>
> >> Thank you for starting the discussion. To start with I have to say I am
> >> not entirely against leaving them. On the other hand I totally disagree
> >> that the semantics are clearly defined. Actually the design is
> >> fundamentally flawed.
> >>
> >>    1. We use String as a selector for elements. This is not the cleanest
> >>    design, but I agree it is not the worst.
> >>    2. Users cannot define different types for different splits.
> >>    3. (The actual reason why I think it's actually better to drop the
> >>    split/select and introduce a better mechanism) The behavior of a
> split is
> >>    to actually add an output selector. We can have just a single
> selector on a
> >>    single operator, but the API allows (I would even say encourages) to
> create
> >>    chains of split/select, which leads to undefined behavior. Take this
> for
> >>    example: ds.split().select("a", "b").select("c", "d"). Which tags
> should be
> >>    forwarded? ("a", "b", "c", "d") (union) or () (intersection). In my
> opinion
> >>    the most obvious answer in this case would be the intersection. Let's
> >>    modify it slightly though and I would assume a different behavior
> (the
> >>    union)
> >>
> >>            splitted = ds.split();
> >>
> >>            splitted.select("a", "b").map()
> >>
> >>            splitted.select("c", "d").map()
> >>
> >> Taking the 3rd argument into consideration I would be in favor of
> removing
> >> the current mechanism. I think the side outputs serve the purpose much
> >> better with much cleaner semantics. I get the argument that users are
> now
> >> forced to use processFunction if they want to use the side outputs. If
> this
> >> is the main problem how about enabling them e.g. for flatMap as well?
> >>
> >> Best,
> >>
> >> Dawid
> >> On 17/06/2019 08:51, Jark Wu wrote:
> >>
> >> +1 to keep the split/select API. I think if there are some problems with
> >> the API, it's better to fix them instead of deprecating them.
> >> And select/split are straightforward and convenient APIs. It's worth to
> >> have them.
> >>
> >> Regards,
> >> Jark
> >>
> >> On Mon, 17 Jun 2019 at 14:46, vino yang <[hidden email]> <
> [hidden email]> wrote:
> >>
> >>
> >> Hi,
> >>
> >> I also think it is valuable and reasonable to keep the split/select
> APIs.
> >> They are very convenient and widely used in our platform. I think they
> are
> >> also used in other users' jobs.
> >> If the community has doubts about this, IMHO, it would be better to
> start a
> >> user survey.
> >>
> >> Best,
> >> Vino
> >>
> >> SHI Xiaogang <[hidden email]> <[hidden email]>
> 于2019年6月17日周一 上午11:55写道:
> >>
> >>
> >> Hi Xingcan,
> >>
> >> Thanks for bringing it up for discusson.
> >>
> >> I agree with you that we should not deprecate the split/select methods.
> >> Their semantics are very clear and they are widely adopted by Flink
> >>
> >> users.
> >>
> >> We should fix these problems instead of simply deprecating the methods.
> >>
> >> Regards,
> >> Xiaogang
> >>
> >> Xingcan Cui <[hidden email]> <[hidden email]> 于2019年6月15日周六
> 下午4:13写道:
> >>
> >>
> >> Hi all,
> >>
> >> Recently, I noticed that the split/select methods in DataStream API
> >>
> >> have
> >>
> >> been marked as deprecated since 1.7.2 and 1.8.0 (the related JIRA issue
> >> FLINK-11084 <https://issues.apache.org/jira/browse/FLINK-11084> <
> https://issues.apache.org/jira/browse/FLINK-11084>).
> >>
> >> Although the two methods can be replaced by the more powerful side
> >>
> >> output
> >>
> >> feature[1], I still doubt whether we should really remove them in the
> >> future.
> >>
> >> 1. From semantics, the split/select is the reverse operation to the
> >>
> >> union
> >>
> >> transformation. Without them, the DataStream API seems to be missing a
> >> piece.
> >>
> >> 2. From accessibility, the side output only works for process
> >>
> >> functions,
> >>
> >> which means it forces the user to dive into a lower API.
> >>
> >> According to FLINK-11084 <
> >> https://issues.apache.org/jira/browse/FLINK-11084> <
> https://issues.apache.org/jira/browse/FLINK-11084>, there exist some
> >> problems with the current implementation of the two methods. Maybe we
> >> should fix the problems and re-active them again. Or if they really
> >>
> >> need
> >>
> >> to
> >>
> >> be deprecated, we should at least mark the corresponding documentation
> >>
> >> for
> >>
> >> that : )
> >>
> >> What do you think?
> >>
> >> Best,
> >> Xingcan
> >>
> >> [1]
> >>
> >>
> >>
> https://ci.apache.org/projects/flink/flink-docs-master/dev/stream/side_output.html
> >>
> >> <
> >>
> >>
> >>
> https://ci.apache.org/projects/flink/flink-docs-master/dev/stream/side_output.html
> >>
> >>
>
>
Reply | Threaded
Open this post in threaded view
|

RE: About Deprecating split/select for DataStream API

xingcanc
Hi all,

Thanks for sharing your thoughts on this topic.

First, we must admit that the current implementation for split/select is flawed. I roughly went through the source codes, the problem may be that for consecutive select/split(s), the former one will be overridden by the later one during StreamGraph generation phase. That's why we forbid this consecutive logic in FLINK-11084.

Now the question is whether we should guide users to migrate to the new side output feature or thoroughly rework the broken API with the correct semantics (instead of just trying to forbid all the "invalid" usages).

Personally, I prefer the later solution because

1. The split/select may have been widely used without touching the broken part.
2. Though restricted compared with side output, the semantics for split/select itself is acceptable since union does not support different data types either.
3. We need a complete and easy-to-use transformation set for DataStream API. Enabling side output for flatMap may not be an ultimate solution.

To summarize, maybe we should not easily deprecate the split/select public API. If we come to a consensus on that, how about rewriting it based on side output? (like the implementation for join on coGroup)

Any feedback is welcome : )

Best,
Xingcan

-----Original Message-----
From: SHI Xiaogang <[hidden email]>
Sent: Monday, June 17, 2019 8:08 AM
To: Dawid Wysakowicz <[hidden email]>
Cc: [hidden email]
Subject: Re: About Deprecating split/select for DataStream API

Hi Dawid,

Thanks a lot for your example.

I think most users will expect splitted1 to be empty in the example.

The unexpected results produced, in my opinion, is due to our problematic implementation, instead of the confusing semantics.
We can fix the problem if we add a SELECT operator to filter out unexpected records (Of course, we can find some optimization to improve the efficiency.).

After all, i prefer to fix the problems to make the results as expected.
What do you think?

Regards,
Xiaogang

Dawid Wysakowicz <[hidden email]> 于2019年6月17日周一 下午7:21写道:

> Yes you are correct. The problem I described applies to the split not
> select as I wrote in the first email. Sorry for that.
>
> I will try to prepare a correct example. Let's have a look at this example:
>
>     val splitted1 = ds.split(if (1) then "a")
>
>     val splitted2 = ds.split(if (!=1) then "a")
>
> In those cases splitted1.select("a") -> will output all elements, the
> same for splitted2, because the OutputSelector(s) are applied to
> previous operator. The behavior I would assume is that splitted1
> outputs only "1"s, whereas splitted2 all but "1"s
>
> On the other hand in a call
>
>     val splitted1 = ds.split(if ("1" or "2") then
> "a").select("a").split(if ("3") then "b").select("b")
>
> I would assume an intersection of those two splits, so no results.
> What actually happens is that it will be "1", "2" & "3"s. Actually,
> right exceptions should be thrown in those cases not to produce
> confusing results, but this just shows that this API is broken, if we
> need to check for some prohibited configurations during runtime.
>
> Those weird behaviors are in my opinion results of the flawed API, as
> it actually assigns an output selector to the previous operator. In
> other words it modifies previous operator. I think it would be much
> cleaner if this happened inside an operator rather than separately.
> This is what SideOutputs do, as you define them inside the
> ProcessFunction, rather than afterwards. Therefore I am very much in
> favor of using them for those cases. Once again if the problem is that
> they are available only in the ProcessFunction I would prefer enabling
> them e.g. in FlatMap, rather than keeping the split/select.
>
>
>
> On 17/06/2019 09:40, SHI Xiaogang wrote:
> > Hi Dawid,
> >
> > As the select method is only allowed on SplitStreams, it's
> > impossible to construct the example ds.split().select("a", "b").select("c", "d").
> >
> > Are you meaning ds.split().select("a", "b").split().select("c", "d")?
> > If so, then the tagging in the first split operation should not
> > affect
> the
> > second one. Then
> >     splitted.select("a", "b") => empty
> >     splitted.select("c", "d") => ds
> >
> > I cannot quite catch your point here. It's appreciated if you can
> provide a
> > more concrete explanation?
> >
> > Regards,
> > Xiaogang Shi
> >
> >
> >
> >
> > Dawid Wysakowicz <[hidden email]> 于2019年6月17日周一 下午3:10写道:
> >
> >> Hi all,
> >>
> >> Thank you for starting the discussion. To start with I have to say
> >> I am not entirely against leaving them. On the other hand I totally
> >> disagree that the semantics are clearly defined. Actually the
> >> design is fundamentally flawed.
> >>
> >>    1. We use String as a selector for elements. This is not the cleanest
> >>    design, but I agree it is not the worst.
> >>    2. Users cannot define different types for different splits.
> >>    3. (The actual reason why I think it's actually better to drop the
> >>    split/select and introduce a better mechanism) The behavior of a
> split is
> >>    to actually add an output selector. We can have just a single
> selector on a
> >>    single operator, but the API allows (I would even say
> >> encourages) to
> create
> >>    chains of split/select, which leads to undefined behavior. Take
> >> this
> for
> >>    example: ds.split().select("a", "b").select("c", "d"). Which
> >> tags
> should be
> >>    forwarded? ("a", "b", "c", "d") (union) or () (intersection). In
> >> my
> opinion
> >>    the most obvious answer in this case would be the intersection. Let's
> >>    modify it slightly though and I would assume a different
> >> behavior
> (the
> >>    union)
> >>
> >>            splitted = ds.split();
> >>
> >>            splitted.select("a", "b").map()
> >>
> >>            splitted.select("c", "d").map()
> >>
> >> Taking the 3rd argument into consideration I would be in favor of
> removing
> >> the current mechanism. I think the side outputs serve the purpose
> >> much better with much cleaner semantics. I get the argument that
> >> users are
> now
> >> forced to use processFunction if they want to use the side outputs.
> >> If
> this
> >> is the main problem how about enabling them e.g. for flatMap as well?
> >>
> >> Best,
> >>
> >> Dawid
> >> On 17/06/2019 08:51, Jark Wu wrote:
> >>
> >> +1 to keep the split/select API. I think if there are some problems
> >> +with
> >> the API, it's better to fix them instead of deprecating them.
> >> And select/split are straightforward and convenient APIs. It's
> >> worth to have them.
> >>
> >> Regards,
> >> Jark
> >>
> >> On Mon, 17 Jun 2019 at 14:46, vino yang <[hidden email]> <
> [hidden email]> wrote:
> >>
> >>
> >> Hi,
> >>
> >> I also think it is valuable and reasonable to keep the split/select
> APIs.
> >> They are very convenient and widely used in our platform. I think
> >> they
> are
> >> also used in other users' jobs.
> >> If the community has doubts about this, IMHO, it would be better to
> start a
> >> user survey.
> >>
> >> Best,
> >> Vino
> >>
> >> SHI Xiaogang <[hidden email]> <[hidden email]>
> 于2019年6月17日周一 上午11:55写道:
> >>
> >>
> >> Hi Xingcan,
> >>
> >> Thanks for bringing it up for discusson.
> >>
> >> I agree with you that we should not deprecate the split/select methods.
> >> Their semantics are very clear and they are widely adopted by Flink
> >>
> >> users.
> >>
> >> We should fix these problems instead of simply deprecating the methods.
> >>
> >> Regards,
> >> Xiaogang
> >>
> >> Xingcan Cui <[hidden email]> <[hidden email]> 于2019年6月15日周六
> 下午4:13写道:
> >>
> >>
> >> Hi all,
> >>
> >> Recently, I noticed that the split/select methods in DataStream API
> >>
> >> have
> >>
> >> been marked as deprecated since 1.7.2 and 1.8.0 (the related JIRA
> >> issue
> >> FLINK-11084 <https://issues.apache.org/jira/browse/FLINK-11084> <
> https://issues.apache.org/jira/browse/FLINK-11084>).
> >>
> >> Although the two methods can be replaced by the more powerful side
> >>
> >> output
> >>
> >> feature[1], I still doubt whether we should really remove them in
> >> the future.
> >>
> >> 1. From semantics, the split/select is the reverse operation to the
> >>
> >> union
> >>
> >> transformation. Without them, the DataStream API seems to be
> >> missing a piece.
> >>
> >> 2. From accessibility, the side output only works for process
> >>
> >> functions,
> >>
> >> which means it forces the user to dive into a lower API.
> >>
> >> According to FLINK-11084 <
> >> https://issues.apache.org/jira/browse/FLINK-11084> <
> https://issues.apache.org/jira/browse/FLINK-11084>, there exist some
> >> problems with the current implementation of the two methods. Maybe
> >> we should fix the problems and re-active them again. Or if they
> >> really
> >>
> >> need
> >>
> >> to
> >>
> >> be deprecated, we should at least mark the corresponding
> >> documentation
> >>
> >> for
> >>
> >> that : )
> >>
> >> What do you think?
> >>
> >> Best,
> >> Xingcan
> >>
> >> [1]
> >>
> >>
> >>
> https://ci.apache.org/projects/flink/flink-docs-master/dev/stream/side
> _output.html
> >>
> >> <
> >>
> >>
> >>
> https://ci.apache.org/projects/flink/flink-docs-master/dev/stream/side
> _output.html
> >>
> >>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: About Deprecating split/select for DataStream API

Dian Fu-2
Hi all,

Thanks a lot for the discussion. I'm also in favor of rewriting/redesigning the split/select API instead of removing them. It has been a consensus that the side output API can achieve all the functionalities of the split/select API. The problem is whether we should also support some easy-to-use APIs on top of it. IMO, we should do that as long as the APIs have clear semantic and wide usage scenario. I think split/select API is such a kind of API.

Regards,
Dian

> 在 2019年6月18日,上午12:30,[hidden email] 写道:
>
> Hi all,
>
> Thanks for sharing your thoughts on this topic.
>
> First, we must admit that the current implementation for split/select is flawed. I roughly went through the source codes, the problem may be that for consecutive select/split(s), the former one will be overridden by the later one during StreamGraph generation phase. That's why we forbid this consecutive logic in FLINK-11084.
>
> Now the question is whether we should guide users to migrate to the new side output feature or thoroughly rework the broken API with the correct semantics (instead of just trying to forbid all the "invalid" usages).
>
> Personally, I prefer the later solution because
>
> 1. The split/select may have been widely used without touching the broken part.
> 2. Though restricted compared with side output, the semantics for split/select itself is acceptable since union does not support different data types either.
> 3. We need a complete and easy-to-use transformation set for DataStream API. Enabling side output for flatMap may not be an ultimate solution.
>
> To summarize, maybe we should not easily deprecate the split/select public API. If we come to a consensus on that, how about rewriting it based on side output? (like the implementation for join on coGroup)
>
> Any feedback is welcome : )
>
> Best,
> Xingcan
>
> -----Original Message-----
> From: SHI Xiaogang <[hidden email]>
> Sent: Monday, June 17, 2019 8:08 AM
> To: Dawid Wysakowicz <[hidden email]>
> Cc: [hidden email]
> Subject: Re: About Deprecating split/select for DataStream API
>
> Hi Dawid,
>
> Thanks a lot for your example.
>
> I think most users will expect splitted1 to be empty in the example.
>
> The unexpected results produced, in my opinion, is due to our problematic implementation, instead of the confusing semantics.
> We can fix the problem if we add a SELECT operator to filter out unexpected records (Of course, we can find some optimization to improve the efficiency.).
>
> After all, i prefer to fix the problems to make the results as expected.
> What do you think?
>
> Regards,
> Xiaogang
>
> Dawid Wysakowicz <[hidden email]> 于2019年6月17日周一 下午7:21写道:
>
>> Yes you are correct. The problem I described applies to the split not
>> select as I wrote in the first email. Sorry for that.
>>
>> I will try to prepare a correct example. Let's have a look at this example:
>>
>>    val splitted1 = ds.split(if (1) then "a")
>>
>>    val splitted2 = ds.split(if (!=1) then "a")
>>
>> In those cases splitted1.select("a") -> will output all elements, the
>> same for splitted2, because the OutputSelector(s) are applied to
>> previous operator. The behavior I would assume is that splitted1
>> outputs only "1"s, whereas splitted2 all but "1"s
>>
>> On the other hand in a call
>>
>>    val splitted1 = ds.split(if ("1" or "2") then
>> "a").select("a").split(if ("3") then "b").select("b")
>>
>> I would assume an intersection of those two splits, so no results.
>> What actually happens is that it will be "1", "2" & "3"s. Actually,
>> right exceptions should be thrown in those cases not to produce
>> confusing results, but this just shows that this API is broken, if we
>> need to check for some prohibited configurations during runtime.
>>
>> Those weird behaviors are in my opinion results of the flawed API, as
>> it actually assigns an output selector to the previous operator. In
>> other words it modifies previous operator. I think it would be much
>> cleaner if this happened inside an operator rather than separately.
>> This is what SideOutputs do, as you define them inside the
>> ProcessFunction, rather than afterwards. Therefore I am very much in
>> favor of using them for those cases. Once again if the problem is that
>> they are available only in the ProcessFunction I would prefer enabling
>> them e.g. in FlatMap, rather than keeping the split/select.
>>
>>
>>
>> On 17/06/2019 09:40, SHI Xiaogang wrote:
>>> Hi Dawid,
>>>
>>> As the select method is only allowed on SplitStreams, it's
>>> impossible to construct the example ds.split().select("a", "b").select("c", "d").
>>>
>>> Are you meaning ds.split().select("a", "b").split().select("c", "d")?
>>> If so, then the tagging in the first split operation should not
>>> affect
>> the
>>> second one. Then
>>>    splitted.select("a", "b") => empty
>>>    splitted.select("c", "d") => ds
>>>
>>> I cannot quite catch your point here. It's appreciated if you can
>> provide a
>>> more concrete explanation?
>>>
>>> Regards,
>>> Xiaogang Shi
>>>
>>>
>>>
>>>
>>> Dawid Wysakowicz <[hidden email]> 于2019年6月17日周一 下午3:10写道:
>>>
>>>> Hi all,
>>>>
>>>> Thank you for starting the discussion. To start with I have to say
>>>> I am not entirely against leaving them. On the other hand I totally
>>>> disagree that the semantics are clearly defined. Actually the
>>>> design is fundamentally flawed.
>>>>
>>>>   1. We use String as a selector for elements. This is not the cleanest
>>>>   design, but I agree it is not the worst.
>>>>   2. Users cannot define different types for different splits.
>>>>   3. (The actual reason why I think it's actually better to drop the
>>>>   split/select and introduce a better mechanism) The behavior of a
>> split is
>>>>   to actually add an output selector. We can have just a single
>> selector on a
>>>>   single operator, but the API allows (I would even say
>>>> encourages) to
>> create
>>>>   chains of split/select, which leads to undefined behavior. Take
>>>> this
>> for
>>>>   example: ds.split().select("a", "b").select("c", "d"). Which
>>>> tags
>> should be
>>>>   forwarded? ("a", "b", "c", "d") (union) or () (intersection). In
>>>> my
>> opinion
>>>>   the most obvious answer in this case would be the intersection. Let's
>>>>   modify it slightly though and I would assume a different
>>>> behavior
>> (the
>>>>   union)
>>>>
>>>>           splitted = ds.split();
>>>>
>>>>           splitted.select("a", "b").map()
>>>>
>>>>           splitted.select("c", "d").map()
>>>>
>>>> Taking the 3rd argument into consideration I would be in favor of
>> removing
>>>> the current mechanism. I think the side outputs serve the purpose
>>>> much better with much cleaner semantics. I get the argument that
>>>> users are
>> now
>>>> forced to use processFunction if they want to use the side outputs.
>>>> If
>> this
>>>> is the main problem how about enabling them e.g. for flatMap as well?
>>>>
>>>> Best,
>>>>
>>>> Dawid
>>>> On 17/06/2019 08:51, Jark Wu wrote:
>>>>
>>>> +1 to keep the split/select API. I think if there are some problems
>>>> +with
>>>> the API, it's better to fix them instead of deprecating them.
>>>> And select/split are straightforward and convenient APIs. It's
>>>> worth to have them.
>>>>
>>>> Regards,
>>>> Jark
>>>>
>>>> On Mon, 17 Jun 2019 at 14:46, vino yang <[hidden email]> <
>> [hidden email]> wrote:
>>>>
>>>>
>>>> Hi,
>>>>
>>>> I also think it is valuable and reasonable to keep the split/select
>> APIs.
>>>> They are very convenient and widely used in our platform. I think
>>>> they
>> are
>>>> also used in other users' jobs.
>>>> If the community has doubts about this, IMHO, it would be better to
>> start a
>>>> user survey.
>>>>
>>>> Best,
>>>> Vino
>>>>
>>>> SHI Xiaogang <[hidden email]> <[hidden email]>
>> 于2019年6月17日周一 上午11:55写道:
>>>>
>>>>
>>>> Hi Xingcan,
>>>>
>>>> Thanks for bringing it up for discusson.
>>>>
>>>> I agree with you that we should not deprecate the split/select methods.
>>>> Their semantics are very clear and they are widely adopted by Flink
>>>>
>>>> users.
>>>>
>>>> We should fix these problems instead of simply deprecating the methods.
>>>>
>>>> Regards,
>>>> Xiaogang
>>>>
>>>> Xingcan Cui <[hidden email]> <[hidden email]> 于2019年6月15日周六
>> 下午4:13写道:
>>>>
>>>>
>>>> Hi all,
>>>>
>>>> Recently, I noticed that the split/select methods in DataStream API
>>>>
>>>> have
>>>>
>>>> been marked as deprecated since 1.7.2 and 1.8.0 (the related JIRA
>>>> issue
>>>> FLINK-11084 <https://issues.apache.org/jira/browse/FLINK-11084> <
>> https://issues.apache.org/jira/browse/FLINK-11084>).
>>>>
>>>> Although the two methods can be replaced by the more powerful side
>>>>
>>>> output
>>>>
>>>> feature[1], I still doubt whether we should really remove them in
>>>> the future.
>>>>
>>>> 1. From semantics, the split/select is the reverse operation to the
>>>>
>>>> union
>>>>
>>>> transformation. Without them, the DataStream API seems to be
>>>> missing a piece.
>>>>
>>>> 2. From accessibility, the side output only works for process
>>>>
>>>> functions,
>>>>
>>>> which means it forces the user to dive into a lower API.
>>>>
>>>> According to FLINK-11084 <
>>>> https://issues.apache.org/jira/browse/FLINK-11084> <
>> https://issues.apache.org/jira/browse/FLINK-11084>, there exist some
>>>> problems with the current implementation of the two methods. Maybe
>>>> we should fix the problems and re-active them again. Or if they
>>>> really
>>>>
>>>> need
>>>>
>>>> to
>>>>
>>>> be deprecated, we should at least mark the corresponding
>>>> documentation
>>>>
>>>> for
>>>>
>>>> that : )
>>>>
>>>> What do you think?
>>>>
>>>> Best,
>>>> Xingcan
>>>>
>>>> [1]
>>>>
>>>>
>>>>
>> https://ci.apache.org/projects/flink/flink-docs-master/dev/stream/side
>> _output.html
>>>>
>>>> <
>>>>
>>>>
>>>>
>> https://ci.apache.org/projects/flink/flink-docs-master/dev/stream/side
>> _output.html
>>>>
>>>>
>>
>>
>

Reply | Threaded
Open this post in threaded view
|

Re: About Deprecating split/select for DataStream API

dwysakowicz
Hi all,

I think we are getting closer to a consensus. I think most of us already
agree that the current behavior is broken. The remaining difference I
see is that I think those problems are caused by the design of the
split/select method. The current contract of the split method is that it
is actually applied to the previous operation, rather than it creates a
new operator. (I don't think this is an implementation detail. This is
the contract of the API. It was described in docs, conference talks,
workshops etc.).

I agree that we could reimplement the split with side outputs in a way
that it would add additional operator in a chain and emit results via
side outputs. This would change the core concepts of the split method
though, making the "fixed" split method a completely new one with a new
behavior. Therefore I am in favor of dropping the old method and
introducing a new one. This would be an explicit information for the
users that this is something different, so that they can make a
conscious choice. Moreover if we reimplement the split method in a way
that it introduces additional operator and emits results via side
outputs, this is basically equivalent to enabling side outputs in a
flatMap method. You can think of the split method as a flatMap method. I
see no benefit of having additional split method that would have a
limited functionality of such flatMap with side outputs.

So to sum up my stance:

1. I am against changing the behavior of current split/select methods.

2. I would be ok with introducing a similar, but new methods, with a new
behavior (but would prefer not to do that, we could achieve the same or
even more with existing APIs)

3. I would be in favor of enabling side outputs for flatMap method (if
it is possible)

Best,

Dawid

On 18/06/2019 03:45, Dian Fu wrote:

> Hi all,
>
> Thanks a lot for the discussion. I'm also in favor of rewriting/redesigning the split/select API instead of removing them. It has been a consensus that the side output API can achieve all the functionalities of the split/select API. The problem is whether we should also support some easy-to-use APIs on top of it. IMO, we should do that as long as the APIs have clear semantic and wide usage scenario. I think split/select API is such a kind of API.
>
> Regards,
> Dian
>
>> 在 2019年6月18日,上午12:30,[hidden email] 写道:
>>
>> Hi all,
>>
>> Thanks for sharing your thoughts on this topic.
>>
>> First, we must admit that the current implementation for split/select is flawed. I roughly went through the source codes, the problem may be that for consecutive select/split(s), the former one will be overridden by the later one during StreamGraph generation phase. That's why we forbid this consecutive logic in FLINK-11084.
>>
>> Now the question is whether we should guide users to migrate to the new side output feature or thoroughly rework the broken API with the correct semantics (instead of just trying to forbid all the "invalid" usages).
>>
>> Personally, I prefer the later solution because
>>
>> 1. The split/select may have been widely used without touching the broken part.
>> 2. Though restricted compared with side output, the semantics for split/select itself is acceptable since union does not support different data types either.
>> 3. We need a complete and easy-to-use transformation set for DataStream API. Enabling side output for flatMap may not be an ultimate solution.
>>
>> To summarize, maybe we should not easily deprecate the split/select public API. If we come to a consensus on that, how about rewriting it based on side output? (like the implementation for join on coGroup)
>>
>> Any feedback is welcome : )
>>
>> Best,
>> Xingcan
>>
>> -----Original Message-----
>> From: SHI Xiaogang <[hidden email]>
>> Sent: Monday, June 17, 2019 8:08 AM
>> To: Dawid Wysakowicz <[hidden email]>
>> Cc: [hidden email]
>> Subject: Re: About Deprecating split/select for DataStream API
>>
>> Hi Dawid,
>>
>> Thanks a lot for your example.
>>
>> I think most users will expect splitted1 to be empty in the example.
>>
>> The unexpected results produced, in my opinion, is due to our problematic implementation, instead of the confusing semantics.
>> We can fix the problem if we add a SELECT operator to filter out unexpected records (Of course, we can find some optimization to improve the efficiency.).
>>
>> After all, i prefer to fix the problems to make the results as expected.
>> What do you think?
>>
>> Regards,
>> Xiaogang
>>
>> Dawid Wysakowicz <[hidden email]> 于2019年6月17日周一 下午7:21写道:
>>
>>> Yes you are correct. The problem I described applies to the split not
>>> select as I wrote in the first email. Sorry for that.
>>>
>>> I will try to prepare a correct example. Let's have a look at this example:
>>>
>>>    val splitted1 = ds.split(if (1) then "a")
>>>
>>>    val splitted2 = ds.split(if (!=1) then "a")
>>>
>>> In those cases splitted1.select("a") -> will output all elements, the
>>> same for splitted2, because the OutputSelector(s) are applied to
>>> previous operator. The behavior I would assume is that splitted1
>>> outputs only "1"s, whereas splitted2 all but "1"s
>>>
>>> On the other hand in a call
>>>
>>>    val splitted1 = ds.split(if ("1" or "2") then
>>> "a").select("a").split(if ("3") then "b").select("b")
>>>
>>> I would assume an intersection of those two splits, so no results.
>>> What actually happens is that it will be "1", "2" & "3"s. Actually,
>>> right exceptions should be thrown in those cases not to produce
>>> confusing results, but this just shows that this API is broken, if we
>>> need to check for some prohibited configurations during runtime.
>>>
>>> Those weird behaviors are in my opinion results of the flawed API, as
>>> it actually assigns an output selector to the previous operator. In
>>> other words it modifies previous operator. I think it would be much
>>> cleaner if this happened inside an operator rather than separately.
>>> This is what SideOutputs do, as you define them inside the
>>> ProcessFunction, rather than afterwards. Therefore I am very much in
>>> favor of using them for those cases. Once again if the problem is that
>>> they are available only in the ProcessFunction I would prefer enabling
>>> them e.g. in FlatMap, rather than keeping the split/select.
>>>
>>>
>>>
>>> On 17/06/2019 09:40, SHI Xiaogang wrote:
>>>> Hi Dawid,
>>>>
>>>> As the select method is only allowed on SplitStreams, it's
>>>> impossible to construct the example ds.split().select("a", "b").select("c", "d").
>>>>
>>>> Are you meaning ds.split().select("a", "b").split().select("c", "d")?
>>>> If so, then the tagging in the first split operation should not
>>>> affect
>>> the
>>>> second one. Then
>>>>    splitted.select("a", "b") => empty
>>>>    splitted.select("c", "d") => ds
>>>>
>>>> I cannot quite catch your point here. It's appreciated if you can
>>> provide a
>>>> more concrete explanation?
>>>>
>>>> Regards,
>>>> Xiaogang Shi
>>>>
>>>>
>>>>
>>>>
>>>> Dawid Wysakowicz <[hidden email]> 于2019年6月17日周一 下午3:10写道:
>>>>
>>>>> Hi all,
>>>>>
>>>>> Thank you for starting the discussion. To start with I have to say
>>>>> I am not entirely against leaving them. On the other hand I totally
>>>>> disagree that the semantics are clearly defined. Actually the
>>>>> design is fundamentally flawed.
>>>>>
>>>>>   1. We use String as a selector for elements. This is not the cleanest
>>>>>   design, but I agree it is not the worst.
>>>>>   2. Users cannot define different types for different splits.
>>>>>   3. (The actual reason why I think it's actually better to drop the
>>>>>   split/select and introduce a better mechanism) The behavior of a
>>> split is
>>>>>   to actually add an output selector. We can have just a single
>>> selector on a
>>>>>   single operator, but the API allows (I would even say
>>>>> encourages) to
>>> create
>>>>>   chains of split/select, which leads to undefined behavior. Take
>>>>> this
>>> for
>>>>>   example: ds.split().select("a", "b").select("c", "d"). Which
>>>>> tags
>>> should be
>>>>>   forwarded? ("a", "b", "c", "d") (union) or () (intersection). In
>>>>> my
>>> opinion
>>>>>   the most obvious answer in this case would be the intersection. Let's
>>>>>   modify it slightly though and I would assume a different
>>>>> behavior
>>> (the
>>>>>   union)
>>>>>
>>>>>           splitted = ds.split();
>>>>>
>>>>>           splitted.select("a", "b").map()
>>>>>
>>>>>           splitted.select("c", "d").map()
>>>>>
>>>>> Taking the 3rd argument into consideration I would be in favor of
>>> removing
>>>>> the current mechanism. I think the side outputs serve the purpose
>>>>> much better with much cleaner semantics. I get the argument that
>>>>> users are
>>> now
>>>>> forced to use processFunction if they want to use the side outputs.
>>>>> If
>>> this
>>>>> is the main problem how about enabling them e.g. for flatMap as well?
>>>>>
>>>>> Best,
>>>>>
>>>>> Dawid
>>>>> On 17/06/2019 08:51, Jark Wu wrote:
>>>>>
>>>>> +1 to keep the split/select API. I think if there are some problems
>>>>> +with
>>>>> the API, it's better to fix them instead of deprecating them.
>>>>> And select/split are straightforward and convenient APIs. It's
>>>>> worth to have them.
>>>>>
>>>>> Regards,
>>>>> Jark
>>>>>
>>>>> On Mon, 17 Jun 2019 at 14:46, vino yang <[hidden email]> <
>>> [hidden email]> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> I also think it is valuable and reasonable to keep the split/select
>>> APIs.
>>>>> They are very convenient and widely used in our platform. I think
>>>>> they
>>> are
>>>>> also used in other users' jobs.
>>>>> If the community has doubts about this, IMHO, it would be better to
>>> start a
>>>>> user survey.
>>>>>
>>>>> Best,
>>>>> Vino
>>>>>
>>>>> SHI Xiaogang <[hidden email]> <[hidden email]>
>>> 于2019年6月17日周一 上午11:55写道:
>>>>>
>>>>> Hi Xingcan,
>>>>>
>>>>> Thanks for bringing it up for discusson.
>>>>>
>>>>> I agree with you that we should not deprecate the split/select methods.
>>>>> Their semantics are very clear and they are widely adopted by Flink
>>>>>
>>>>> users.
>>>>>
>>>>> We should fix these problems instead of simply deprecating the methods.
>>>>>
>>>>> Regards,
>>>>> Xiaogang
>>>>>
>>>>> Xingcan Cui <[hidden email]> <[hidden email]> 于2019年6月15日周六
>>> 下午4:13写道:
>>>>>
>>>>> Hi all,
>>>>>
>>>>> Recently, I noticed that the split/select methods in DataStream API
>>>>>
>>>>> have
>>>>>
>>>>> been marked as deprecated since 1.7.2 and 1.8.0 (the related JIRA
>>>>> issue
>>>>> FLINK-11084 <https://issues.apache.org/jira/browse/FLINK-11084> <
>>> https://issues.apache.org/jira/browse/FLINK-11084>).
>>>>> Although the two methods can be replaced by the more powerful side
>>>>>
>>>>> output
>>>>>
>>>>> feature[1], I still doubt whether we should really remove them in
>>>>> the future.
>>>>>
>>>>> 1. From semantics, the split/select is the reverse operation to the
>>>>>
>>>>> union
>>>>>
>>>>> transformation. Without them, the DataStream API seems to be
>>>>> missing a piece.
>>>>>
>>>>> 2. From accessibility, the side output only works for process
>>>>>
>>>>> functions,
>>>>>
>>>>> which means it forces the user to dive into a lower API.
>>>>>
>>>>> According to FLINK-11084 <
>>>>> https://issues.apache.org/jira/browse/FLINK-11084> <
>>> https://issues.apache.org/jira/browse/FLINK-11084>, there exist some
>>>>> problems with the current implementation of the two methods. Maybe
>>>>> we should fix the problems and re-active them again. Or if they
>>>>> really
>>>>>
>>>>> need
>>>>>
>>>>> to
>>>>>
>>>>> be deprecated, we should at least mark the corresponding
>>>>> documentation
>>>>>
>>>>> for
>>>>>
>>>>> that : )
>>>>>
>>>>> What do you think?
>>>>>
>>>>> Best,
>>>>> Xingcan
>>>>>
>>>>> [1]
>>>>>
>>>>>
>>>>>
>>> https://ci.apache.org/projects/flink/flink-docs-master/dev/stream/side
>>> _output.html
>>>>> <
>>>>>
>>>>>
>>>>>
>>> https://ci.apache.org/projects/flink/flink-docs-master/dev/stream/side
>>> _output.html
>>>>>
>>>


signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: About Deprecating split/select for DataStream API

xingcanc
Hi all,

@Dawid Thanks for your further explanation.

I also think now we can safely move one step forward to discuss how to fix this problem, i.e., whether to  (explicitly) replace the existing split/select with a new API or (implicitly) re-implement it.

Note that the two options are actually not contradictory because we can simultaneously expose the side output feature in DataStream API (e.g., in flatMap as you suggested) and change the behavior of the existing split/select methods. Suppose we can accept the former one since it sounds nothing bad, the real question is whether we should still keep a "fixed" split/select.

IMO, the answer is based on how we can reduce the side effect as much as possible. I’d like to list the pros and cons and suggest starting a vote to make the choice in a new thread.

Keep a "fixed" split/select:

+ A more complete DataStream API;
+ Transparent to users who only rely on the normal behavior of split/select;

- Users relying on the “abnormal” behavior may suffer from this;
- Extra work on implementation and maintenance;
- The contract of the API will not be correct anymore;

Discard the split/select:

+ A more concise DataStream API;
+ A more safely migration phase for users;

- Users are forced to update their codes;
- The contract of the API is still correct but is not valuable anymore;

Welcome to give your supplement on the reasons and after collecting all the ideas, I'll start a voting thread.

Best,
Xingcan

> On Jun 18, 2019, at 2:18 AM, Dawid Wysakowicz <[hidden email]> wrote:
>
> Hi all,
>
> I think we are getting closer to a consensus. I think most of us already
> agree that the current behavior is broken. The remaining difference I
> see is that I think those problems are caused by the design of the
> split/select method. The current contract of the split method is that it
> is actually applied to the previous operation, rather than it creates a
> new operator. (I don't think this is an implementation detail. This is
> the contract of the API. It was described in docs, conference talks,
> workshops etc.).
>
> I agree that we could reimplement the split with side outputs in a way
> that it would add additional operator in a chain and emit results via
> side outputs. This would change the core concepts of the split method
> though, making the "fixed" split method a completely new one with a new
> behavior. Therefore I am in favor of dropping the old method and
> introducing a new one. This would be an explicit information for the
> users that this is something different, so that they can make a
> conscious choice. Moreover if we reimplement the split method in a way
> that it introduces additional operator and emits results via side
> outputs, this is basically equivalent to enabling side outputs in a
> flatMap method. You can think of the split method as a flatMap method. I
> see no benefit of having additional split method that would have a
> limited functionality of such flatMap with side outputs.
>
> So to sum up my stance:
>
> 1. I am against changing the behavior of current split/select methods.
>
> 2. I would be ok with introducing a similar, but new methods, with a new
> behavior (but would prefer not to do that, we could achieve the same or
> even more with existing APIs)
>
> 3. I would be in favor of enabling side outputs for flatMap method (if
> it is possible)
>
> Best,
>
> Dawid
>
> On 18/06/2019 03:45, Dian Fu wrote:
>> Hi all,
>>
>> Thanks a lot for the discussion. I'm also in favor of rewriting/redesigning the split/select API instead of removing them. It has been a consensus that the side output API can achieve all the functionalities of the split/select API. The problem is whether we should also support some easy-to-use APIs on top of it. IMO, we should do that as long as the APIs have clear semantic and wide usage scenario. I think split/select API is such a kind of API.
>>
>> Regards,
>> Dian
>>
>>> 在 2019年6月18日,上午12:30,[hidden email] 写道:
>>>
>>> Hi all,
>>>
>>> Thanks for sharing your thoughts on this topic.
>>>
>>> First, we must admit that the current implementation for split/select is flawed. I roughly went through the source codes, the problem may be that for consecutive select/split(s), the former one will be overridden by the later one during StreamGraph generation phase. That's why we forbid this consecutive logic in FLINK-11084.
>>>
>>> Now the question is whether we should guide users to migrate to the new side output feature or thoroughly rework the broken API with the correct semantics (instead of just trying to forbid all the "invalid" usages).
>>>
>>> Personally, I prefer the later solution because
>>>
>>> 1. The split/select may have been widely used without touching the broken part.
>>> 2. Though restricted compared with side output, the semantics for split/select itself is acceptable since union does not support different data types either.
>>> 3. We need a complete and easy-to-use transformation set for DataStream API. Enabling side output for flatMap may not be an ultimate solution.
>>>
>>> To summarize, maybe we should not easily deprecate the split/select public API. If we come to a consensus on that, how about rewriting it based on side output? (like the implementation for join on coGroup)
>>>
>>> Any feedback is welcome : )
>>>
>>> Best,
>>> Xingcan
>>>
>>> -----Original Message-----
>>> From: SHI Xiaogang <[hidden email]>
>>> Sent: Monday, June 17, 2019 8:08 AM
>>> To: Dawid Wysakowicz <[hidden email]>
>>> Cc: [hidden email]
>>> Subject: Re: About Deprecating split/select for DataStream API
>>>
>>> Hi Dawid,
>>>
>>> Thanks a lot for your example.
>>>
>>> I think most users will expect splitted1 to be empty in the example.
>>>
>>> The unexpected results produced, in my opinion, is due to our problematic implementation, instead of the confusing semantics.
>>> We can fix the problem if we add a SELECT operator to filter out unexpected records (Of course, we can find some optimization to improve the efficiency.).
>>>
>>> After all, i prefer to fix the problems to make the results as expected.
>>> What do you think?
>>>
>>> Regards,
>>> Xiaogang
>>>
>>> Dawid Wysakowicz <[hidden email]> 于2019年6月17日周一 下午7:21写道:
>>>
>>>> Yes you are correct. The problem I described applies to the split not
>>>> select as I wrote in the first email. Sorry for that.
>>>>
>>>> I will try to prepare a correct example. Let's have a look at this example:
>>>>
>>>>   val splitted1 = ds.split(if (1) then "a")
>>>>
>>>>   val splitted2 = ds.split(if (!=1) then "a")
>>>>
>>>> In those cases splitted1.select("a") -> will output all elements, the
>>>> same for splitted2, because the OutputSelector(s) are applied to
>>>> previous operator. The behavior I would assume is that splitted1
>>>> outputs only "1"s, whereas splitted2 all but "1"s
>>>>
>>>> On the other hand in a call
>>>>
>>>>   val splitted1 = ds.split(if ("1" or "2") then
>>>> "a").select("a").split(if ("3") then "b").select("b")
>>>>
>>>> I would assume an intersection of those two splits, so no results.
>>>> What actually happens is that it will be "1", "2" & "3"s. Actually,
>>>> right exceptions should be thrown in those cases not to produce
>>>> confusing results, but this just shows that this API is broken, if we
>>>> need to check for some prohibited configurations during runtime.
>>>>
>>>> Those weird behaviors are in my opinion results of the flawed API, as
>>>> it actually assigns an output selector to the previous operator. In
>>>> other words it modifies previous operator. I think it would be much
>>>> cleaner if this happened inside an operator rather than separately.
>>>> This is what SideOutputs do, as you define them inside the
>>>> ProcessFunction, rather than afterwards. Therefore I am very much in
>>>> favor of using them for those cases. Once again if the problem is that
>>>> they are available only in the ProcessFunction I would prefer enabling
>>>> them e.g. in FlatMap, rather than keeping the split/select.
>>>>
>>>>
>>>>
>>>> On 17/06/2019 09:40, SHI Xiaogang wrote:
>>>>> Hi Dawid,
>>>>>
>>>>> As the select method is only allowed on SplitStreams, it's
>>>>> impossible to construct the example ds.split().select("a", "b").select("c", "d").
>>>>>
>>>>> Are you meaning ds.split().select("a", "b").split().select("c", "d")?
>>>>> If so, then the tagging in the first split operation should not
>>>>> affect
>>>> the
>>>>> second one. Then
>>>>>   splitted.select("a", "b") => empty
>>>>>   splitted.select("c", "d") => ds
>>>>>
>>>>> I cannot quite catch your point here. It's appreciated if you can
>>>> provide a
>>>>> more concrete explanation?
>>>>>
>>>>> Regards,
>>>>> Xiaogang Shi
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Dawid Wysakowicz <[hidden email]> 于2019年6月17日周一 下午3:10写道:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> Thank you for starting the discussion. To start with I have to say
>>>>>> I am not entirely against leaving them. On the other hand I totally
>>>>>> disagree that the semantics are clearly defined. Actually the
>>>>>> design is fundamentally flawed.
>>>>>>
>>>>>>  1. We use String as a selector for elements. This is not the cleanest
>>>>>>  design, but I agree it is not the worst.
>>>>>>  2. Users cannot define different types for different splits.
>>>>>>  3. (The actual reason why I think it's actually better to drop the
>>>>>>  split/select and introduce a better mechanism) The behavior of a
>>>> split is
>>>>>>  to actually add an output selector. We can have just a single
>>>> selector on a
>>>>>>  single operator, but the API allows (I would even say
>>>>>> encourages) to
>>>> create
>>>>>>  chains of split/select, which leads to undefined behavior. Take
>>>>>> this
>>>> for
>>>>>>  example: ds.split().select("a", "b").select("c", "d"). Which
>>>>>> tags
>>>> should be
>>>>>>  forwarded? ("a", "b", "c", "d") (union) or () (intersection). In
>>>>>> my
>>>> opinion
>>>>>>  the most obvious answer in this case would be the intersection. Let's
>>>>>>  modify it slightly though and I would assume a different
>>>>>> behavior
>>>> (the
>>>>>>  union)
>>>>>>
>>>>>>          splitted = ds.split();
>>>>>>
>>>>>>          splitted.select("a", "b").map()
>>>>>>
>>>>>>          splitted.select("c", "d").map()
>>>>>>
>>>>>> Taking the 3rd argument into consideration I would be in favor of
>>>> removing
>>>>>> the current mechanism. I think the side outputs serve the purpose
>>>>>> much better with much cleaner semantics. I get the argument that
>>>>>> users are
>>>> now
>>>>>> forced to use processFunction if they want to use the side outputs.
>>>>>> If
>>>> this
>>>>>> is the main problem how about enabling them e.g. for flatMap as well?
>>>>>>
>>>>>> Best,
>>>>>>
>>>>>> Dawid
>>>>>> On 17/06/2019 08:51, Jark Wu wrote:
>>>>>>
>>>>>> +1 to keep the split/select API. I think if there are some problems
>>>>>> +with
>>>>>> the API, it's better to fix them instead of deprecating them.
>>>>>> And select/split are straightforward and convenient APIs. It's
>>>>>> worth to have them.
>>>>>>
>>>>>> Regards,
>>>>>> Jark
>>>>>>
>>>>>> On Mon, 17 Jun 2019 at 14:46, vino yang <[hidden email]> <
>>>> [hidden email]> wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I also think it is valuable and reasonable to keep the split/select
>>>> APIs.
>>>>>> They are very convenient and widely used in our platform. I think
>>>>>> they
>>>> are
>>>>>> also used in other users' jobs.
>>>>>> If the community has doubts about this, IMHO, it would be better to
>>>> start a
>>>>>> user survey.
>>>>>>
>>>>>> Best,
>>>>>> Vino
>>>>>>
>>>>>> SHI Xiaogang <[hidden email]> <[hidden email]>
>>>> 于2019年6月17日周一 上午11:55写道:
>>>>>>
>>>>>> Hi Xingcan,
>>>>>>
>>>>>> Thanks for bringing it up for discusson.
>>>>>>
>>>>>> I agree with you that we should not deprecate the split/select methods.
>>>>>> Their semantics are very clear and they are widely adopted by Flink
>>>>>>
>>>>>> users.
>>>>>>
>>>>>> We should fix these problems instead of simply deprecating the methods.
>>>>>>
>>>>>> Regards,
>>>>>> Xiaogang
>>>>>>
>>>>>> Xingcan Cui <[hidden email]> <[hidden email]> 于2019年6月15日周六
>>>> 下午4:13写道:
>>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> Recently, I noticed that the split/select methods in DataStream API
>>>>>>
>>>>>> have
>>>>>>
>>>>>> been marked as deprecated since 1.7.2 and 1.8.0 (the related JIRA
>>>>>> issue
>>>>>> FLINK-11084 <https://issues.apache.org/jira/browse/FLINK-11084> <
>>>> https://issues.apache.org/jira/browse/FLINK-11084>).
>>>>>> Although the two methods can be replaced by the more powerful side
>>>>>>
>>>>>> output
>>>>>>
>>>>>> feature[1], I still doubt whether we should really remove them in
>>>>>> the future.
>>>>>>
>>>>>> 1. From semantics, the split/select is the reverse operation to the
>>>>>>
>>>>>> union
>>>>>>
>>>>>> transformation. Without them, the DataStream API seems to be
>>>>>> missing a piece.
>>>>>>
>>>>>> 2. From accessibility, the side output only works for process
>>>>>>
>>>>>> functions,
>>>>>>
>>>>>> which means it forces the user to dive into a lower API.
>>>>>>
>>>>>> According to FLINK-11084 <
>>>>>> https://issues.apache.org/jira/browse/FLINK-11084> <
>>>> https://issues.apache.org/jira/browse/FLINK-11084>, there exist some
>>>>>> problems with the current implementation of the two methods. Maybe
>>>>>> we should fix the problems and re-active them again. Or if they
>>>>>> really
>>>>>>
>>>>>> need
>>>>>>
>>>>>> to
>>>>>>
>>>>>> be deprecated, we should at least mark the corresponding
>>>>>> documentation
>>>>>>
>>>>>> for
>>>>>>
>>>>>> that : )
>>>>>>
>>>>>> What do you think?
>>>>>>
>>>>>> Best,
>>>>>> Xingcan
>>>>>>
>>>>>> [1]
>>>>>>
>>>>>>
>>>>>>
>>>> https://ci.apache.org/projects/flink/flink-docs-master/dev/stream/side
>>>> _output.html
>>>>>> <
>>>>>>
>>>>>>
>>>>>>
>>>> https://ci.apache.org/projects/flink/flink-docs-master/dev/stream/side
>>>> _output.html
>>>>>>
>>>>
>

Reply | Threaded
Open this post in threaded view
|

Re: About Deprecating split/select for DataStream API

SHI Xiaogang
Hi all,

I think many Flink users are not aware of the "abnormal" behavior of
current split/select implementation.
As split/select are widely used, maybe we should notice our users of the
abnormal behavior in the user mailing list and hear their suggestions on
how to fix it.
Then we can make decision with their feedbacks and start the vote.

Regards,
Xiaogang


Xingcan Cui <[hidden email]> 于2019年6月19日周三 上午5:14写道:

> Hi all,
>
> @Dawid Thanks for your further explanation.
>
> I also think now we can safely move one step forward to discuss how to fix
> this problem, i.e., whether to  (explicitly) replace the existing
> split/select with a new API or (implicitly) re-implement it.
>
> Note that the two options are actually not contradictory because we can
> simultaneously expose the side output feature in DataStream API (e.g., in
> flatMap as you suggested) and change the behavior of the existing
> split/select methods. Suppose we can accept the former one since it sounds
> nothing bad, the real question is whether we should still keep a "fixed"
> split/select.
>
> IMO, the answer is based on how we can reduce the side effect as much as
> possible. I’d like to list the pros and cons and suggest starting a vote to
> make the choice in a new thread.
>
> Keep a "fixed" split/select:
>
> + A more complete DataStream API;
> + Transparent to users who only rely on the normal behavior of
> split/select;
>
> - Users relying on the “abnormal” behavior may suffer from this;
> - Extra work on implementation and maintenance;
> - The contract of the API will not be correct anymore;
>
> Discard the split/select:
>
> + A more concise DataStream API;
> + A more safely migration phase for users;
>
> - Users are forced to update their codes;
> - The contract of the API is still correct but is not valuable anymore;
>
> Welcome to give your supplement on the reasons and after collecting all
> the ideas, I'll start a voting thread.
>
> Best,
> Xingcan
>
> > On Jun 18, 2019, at 2:18 AM, Dawid Wysakowicz <[hidden email]>
> wrote:
> >
> > Hi all,
> >
> > I think we are getting closer to a consensus. I think most of us already
> > agree that the current behavior is broken. The remaining difference I
> > see is that I think those problems are caused by the design of the
> > split/select method. The current contract of the split method is that it
> > is actually applied to the previous operation, rather than it creates a
> > new operator. (I don't think this is an implementation detail. This is
> > the contract of the API. It was described in docs, conference talks,
> > workshops etc.).
> >
> > I agree that we could reimplement the split with side outputs in a way
> > that it would add additional operator in a chain and emit results via
> > side outputs. This would change the core concepts of the split method
> > though, making the "fixed" split method a completely new one with a new
> > behavior. Therefore I am in favor of dropping the old method and
> > introducing a new one. This would be an explicit information for the
> > users that this is something different, so that they can make a
> > conscious choice. Moreover if we reimplement the split method in a way
> > that it introduces additional operator and emits results via side
> > outputs, this is basically equivalent to enabling side outputs in a
> > flatMap method. You can think of the split method as a flatMap method. I
> > see no benefit of having additional split method that would have a
> > limited functionality of such flatMap with side outputs.
> >
> > So to sum up my stance:
> >
> > 1. I am against changing the behavior of current split/select methods.
> >
> > 2. I would be ok with introducing a similar, but new methods, with a new
> > behavior (but would prefer not to do that, we could achieve the same or
> > even more with existing APIs)
> >
> > 3. I would be in favor of enabling side outputs for flatMap method (if
> > it is possible)
> >
> > Best,
> >
> > Dawid
> >
> > On 18/06/2019 03:45, Dian Fu wrote:
> >> Hi all,
> >>
> >> Thanks a lot for the discussion. I'm also in favor of
> rewriting/redesigning the split/select API instead of removing them. It has
> been a consensus that the side output API can achieve all the
> functionalities of the split/select API. The problem is whether we should
> also support some easy-to-use APIs on top of it. IMO, we should do that as
> long as the APIs have clear semantic and wide usage scenario. I think
> split/select API is such a kind of API.
> >>
> >> Regards,
> >> Dian
> >>
> >>> 在 2019年6月18日,上午12:30,[hidden email] 写道:
> >>>
> >>> Hi all,
> >>>
> >>> Thanks for sharing your thoughts on this topic.
> >>>
> >>> First, we must admit that the current implementation for split/select
> is flawed. I roughly went through the source codes, the problem may be that
> for consecutive select/split(s), the former one will be overridden by the
> later one during StreamGraph generation phase. That's why we forbid this
> consecutive logic in FLINK-11084.
> >>>
> >>> Now the question is whether we should guide users to migrate to the
> new side output feature or thoroughly rework the broken API with the
> correct semantics (instead of just trying to forbid all the "invalid"
> usages).
> >>>
> >>> Personally, I prefer the later solution because
> >>>
> >>> 1. The split/select may have been widely used without touching the
> broken part.
> >>> 2. Though restricted compared with side output, the semantics for
> split/select itself is acceptable since union does not support different
> data types either.
> >>> 3. We need a complete and easy-to-use transformation set for
> DataStream API. Enabling side output for flatMap may not be an ultimate
> solution.
> >>>
> >>> To summarize, maybe we should not easily deprecate the split/select
> public API. If we come to a consensus on that, how about rewriting it based
> on side output? (like the implementation for join on coGroup)
> >>>
> >>> Any feedback is welcome : )
> >>>
> >>> Best,
> >>> Xingcan
> >>>
> >>> -----Original Message-----
> >>> From: SHI Xiaogang <[hidden email]>
> >>> Sent: Monday, June 17, 2019 8:08 AM
> >>> To: Dawid Wysakowicz <[hidden email]>
> >>> Cc: [hidden email]
> >>> Subject: Re: About Deprecating split/select for DataStream API
> >>>
> >>> Hi Dawid,
> >>>
> >>> Thanks a lot for your example.
> >>>
> >>> I think most users will expect splitted1 to be empty in the example.
> >>>
> >>> The unexpected results produced, in my opinion, is due to our
> problematic implementation, instead of the confusing semantics.
> >>> We can fix the problem if we add a SELECT operator to filter out
> unexpected records (Of course, we can find some optimization to improve the
> efficiency.).
> >>>
> >>> After all, i prefer to fix the problems to make the results as
> expected.
> >>> What do you think?
> >>>
> >>> Regards,
> >>> Xiaogang
> >>>
> >>> Dawid Wysakowicz <[hidden email]> 于2019年6月17日周一 下午7:21写道:
> >>>
> >>>> Yes you are correct. The problem I described applies to the split not
> >>>> select as I wrote in the first email. Sorry for that.
> >>>>
> >>>> I will try to prepare a correct example. Let's have a look at this
> example:
> >>>>
> >>>>   val splitted1 = ds.split(if (1) then "a")
> >>>>
> >>>>   val splitted2 = ds.split(if (!=1) then "a")
> >>>>
> >>>> In those cases splitted1.select("a") -> will output all elements, the
> >>>> same for splitted2, because the OutputSelector(s) are applied to
> >>>> previous operator. The behavior I would assume is that splitted1
> >>>> outputs only "1"s, whereas splitted2 all but "1"s
> >>>>
> >>>> On the other hand in a call
> >>>>
> >>>>   val splitted1 = ds.split(if ("1" or "2") then
> >>>> "a").select("a").split(if ("3") then "b").select("b")
> >>>>
> >>>> I would assume an intersection of those two splits, so no results.
> >>>> What actually happens is that it will be "1", "2" & "3"s. Actually,
> >>>> right exceptions should be thrown in those cases not to produce
> >>>> confusing results, but this just shows that this API is broken, if we
> >>>> need to check for some prohibited configurations during runtime.
> >>>>
> >>>> Those weird behaviors are in my opinion results of the flawed API, as
> >>>> it actually assigns an output selector to the previous operator. In
> >>>> other words it modifies previous operator. I think it would be much
> >>>> cleaner if this happened inside an operator rather than separately.
> >>>> This is what SideOutputs do, as you define them inside the
> >>>> ProcessFunction, rather than afterwards. Therefore I am very much in
> >>>> favor of using them for those cases. Once again if the problem is
> that
> >>>> they are available only in the ProcessFunction I would prefer
> enabling
> >>>> them e.g. in FlatMap, rather than keeping the split/select.
> >>>>
> >>>>
> >>>>
> >>>> On 17/06/2019 09:40, SHI Xiaogang wrote:
> >>>>> Hi Dawid,
> >>>>>
> >>>>> As the select method is only allowed on SplitStreams, it's
> >>>>> impossible to construct the example ds.split().select("a",
> "b").select("c", "d").
> >>>>>
> >>>>> Are you meaning ds.split().select("a", "b").split().select("c", "d")?
> >>>>> If so, then the tagging in the first split operation should not
> >>>>> affect
> >>>> the
> >>>>> second one. Then
> >>>>>   splitted.select("a", "b") => empty
> >>>>>   splitted.select("c", "d") => ds
> >>>>>
> >>>>> I cannot quite catch your point here. It's appreciated if you can
> >>>> provide a
> >>>>> more concrete explanation?
> >>>>>
> >>>>> Regards,
> >>>>> Xiaogang Shi
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> Dawid Wysakowicz <[hidden email]> 于2019年6月17日周一 下午3:10写道:
> >>>>>
> >>>>>> Hi all,
> >>>>>>
> >>>>>> Thank you for starting the discussion. To start with I have to say
> >>>>>> I am not entirely against leaving them. On the other hand I totally
> >>>>>> disagree that the semantics are clearly defined. Actually the
> >>>>>> design is fundamentally flawed.
> >>>>>>
> >>>>>>  1. We use String as a selector for elements. This is not the
> cleanest
> >>>>>>  design, but I agree it is not the worst.
> >>>>>>  2. Users cannot define different types for different splits.
> >>>>>>  3. (The actual reason why I think it's actually better to drop the
> >>>>>>  split/select and introduce a better mechanism) The behavior of a
> >>>> split is
> >>>>>>  to actually add an output selector. We can have just a single
> >>>> selector on a
> >>>>>>  single operator, but the API allows (I would even say
> >>>>>> encourages) to
> >>>> create
> >>>>>>  chains of split/select, which leads to undefined behavior. Take
> >>>>>> this
> >>>> for
> >>>>>>  example: ds.split().select("a", "b").select("c", "d"). Which
> >>>>>> tags
> >>>> should be
> >>>>>>  forwarded? ("a", "b", "c", "d") (union) or () (intersection). In
> >>>>>> my
> >>>> opinion
> >>>>>>  the most obvious answer in this case would be the intersection.
> Let's
> >>>>>>  modify it slightly though and I would assume a different
> >>>>>> behavior
> >>>> (the
> >>>>>>  union)
> >>>>>>
> >>>>>>          splitted = ds.split();
> >>>>>>
> >>>>>>          splitted.select("a", "b").map()
> >>>>>>
> >>>>>>          splitted.select("c", "d").map()
> >>>>>>
> >>>>>> Taking the 3rd argument into consideration I would be in favor of
> >>>> removing
> >>>>>> the current mechanism. I think the side outputs serve the purpose
> >>>>>> much better with much cleaner semantics. I get the argument that
> >>>>>> users are
> >>>> now
> >>>>>> forced to use processFunction if they want to use the side outputs.
> >>>>>> If
> >>>> this
> >>>>>> is the main problem how about enabling them e.g. for flatMap as
> well?
> >>>>>>
> >>>>>> Best,
> >>>>>>
> >>>>>> Dawid
> >>>>>> On 17/06/2019 08:51, Jark Wu wrote:
> >>>>>>
> >>>>>> +1 to keep the split/select API. I think if there are some problems
> >>>>>> +with
> >>>>>> the API, it's better to fix them instead of deprecating them.
> >>>>>> And select/split are straightforward and convenient APIs. It's
> >>>>>> worth to have them.
> >>>>>>
> >>>>>> Regards,
> >>>>>> Jark
> >>>>>>
> >>>>>> On Mon, 17 Jun 2019 at 14:46, vino yang <[hidden email]> <
> >>>> [hidden email]> wrote:
> >>>>>>
> >>>>>> Hi,
> >>>>>>
> >>>>>> I also think it is valuable and reasonable to keep the split/select
> >>>> APIs.
> >>>>>> They are very convenient and widely used in our platform. I think
> >>>>>> they
> >>>> are
> >>>>>> also used in other users' jobs.
> >>>>>> If the community has doubts about this, IMHO, it would be better to
> >>>> start a
> >>>>>> user survey.
> >>>>>>
> >>>>>> Best,
> >>>>>> Vino
> >>>>>>
> >>>>>> SHI Xiaogang <[hidden email]> <[hidden email]>
> >>>> 于2019年6月17日周一 上午11:55写道:
> >>>>>>
> >>>>>> Hi Xingcan,
> >>>>>>
> >>>>>> Thanks for bringing it up for discusson.
> >>>>>>
> >>>>>> I agree with you that we should not deprecate the split/select
> methods.
> >>>>>> Their semantics are very clear and they are widely adopted by Flink
> >>>>>>
> >>>>>> users.
> >>>>>>
> >>>>>> We should fix these problems instead of simply deprecating the
> methods.
> >>>>>>
> >>>>>> Regards,
> >>>>>> Xiaogang
> >>>>>>
> >>>>>> Xingcan Cui <[hidden email]> <[hidden email]> 于2019年6月15日周六
> >>>> 下午4:13写道:
> >>>>>>
> >>>>>> Hi all,
> >>>>>>
> >>>>>> Recently, I noticed that the split/select methods in DataStream API
> >>>>>>
> >>>>>> have
> >>>>>>
> >>>>>> been marked as deprecated since 1.7.2 and 1.8.0 (the related JIRA
> >>>>>> issue
> >>>>>> FLINK-11084 <https://issues.apache.org/jira/browse/FLINK-11084> <
> >>>> https://issues.apache.org/jira/browse/FLINK-11084>).
> >>>>>> Although the two methods can be replaced by the more powerful side
> >>>>>>
> >>>>>> output
> >>>>>>
> >>>>>> feature[1], I still doubt whether we should really remove them in
> >>>>>> the future.
> >>>>>>
> >>>>>> 1. From semantics, the split/select is the reverse operation to the
> >>>>>>
> >>>>>> union
> >>>>>>
> >>>>>> transformation. Without them, the DataStream API seems to be
> >>>>>> missing a piece.
> >>>>>>
> >>>>>> 2. From accessibility, the side output only works for process
> >>>>>>
> >>>>>> functions,
> >>>>>>
> >>>>>> which means it forces the user to dive into a lower API.
> >>>>>>
> >>>>>> According to FLINK-11084 <
> >>>>>> https://issues.apache.org/jira/browse/FLINK-11084> <
> >>>> https://issues.apache.org/jira/browse/FLINK-11084>, there exist some
> >>>>>> problems with the current implementation of the two methods. Maybe
> >>>>>> we should fix the problems and re-active them again. Or if they
> >>>>>> really
> >>>>>>
> >>>>>> need
> >>>>>>
> >>>>>> to
> >>>>>>
> >>>>>> be deprecated, we should at least mark the corresponding
> >>>>>> documentation
> >>>>>>
> >>>>>> for
> >>>>>>
> >>>>>> that : )
> >>>>>>
> >>>>>> What do you think?
> >>>>>>
> >>>>>> Best,
> >>>>>> Xingcan
> >>>>>>
> >>>>>> [1]
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>
> https://ci.apache.org/projects/flink/flink-docs-master/dev/stream/side
> >>>> _output.html
> >>>>>> <
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>
> https://ci.apache.org/projects/flink/flink-docs-master/dev/stream/side
> >>>> _output.html
> >>>>>>
> >>>>
> >
>
>