Union a data stream with a product of itself

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Union a data stream with a product of itself

Vasiliki Kalavri
Hi squirrels,

when porting the gelly streaming code from 0.9 to 0.10 today with Paris, we
hit an exception in union: "*A DataStream cannot be unioned with itself*".

The code raising this exception looks like this:
stream.union(stream.map(...)).

Taking a look into the union code, we see that it's now not allowed to
union a stream, not only with itself, but with any product of itself.

First, we are wondering, why is that? Does it make building the stream
graph easier in some way?
Second, we might want to give a better error message there, e.g. "*A
DataStream cannot be unioned with itself or a product of itself*", and
finally, we should update the docs, which currently state that union a
stream with itself is allowed and that "*If you union a data stream with
itself you will still only get each element once.*"

Cheers,
-Vasia.
Reply | Threaded
Open this post in threaded view
|

Re: Union a data stream with a product of itself

Gyula Fóra-2
Yes, I am not sure if this the intentional behaviour. I think you are
supposed to be able to do the things you described.

stream.union(stream.map(..)) and things like this are fair operations. Also
maybe stream.union(stream) should just give stream instead of an error.

Could someone comment on this who knows the reasoning behind the current
mechanics?

Gyula

Vasiliki Kalavri <[hidden email]> ezt írta (időpont: 2015. nov.
24., K, 16:46):

> Hi squirrels,
>
> when porting the gelly streaming code from 0.9 to 0.10 today with Paris, we
> hit an exception in union: "*A DataStream cannot be unioned with itself*".
>
> The code raising this exception looks like this:
> stream.union(stream.map(...)).
>
> Taking a look into the union code, we see that it's now not allowed to
> union a stream, not only with itself, but with any product of itself.
>
> First, we are wondering, why is that? Does it make building the stream
> graph easier in some way?
> Second, we might want to give a better error message there, e.g. "*A
> DataStream cannot be unioned with itself or a product of itself*", and
> finally, we should update the docs, which currently state that union a
> stream with itself is allowed and that "*If you union a data stream with
> itself you will still only get each element once.*"
>
> Cheers,
> -Vasia.
>
Reply | Threaded
Open this post in threaded view
|

Re: Union a data stream with a product of itself

Stephan Ewen
"stream.union(stream.map(..))" should definitely be possible. Not sure why
this is not permitted.

"stream.union(stream)" would contain each element twice, so should either
give an error or actually union (or duplicate) elements...

Stephan


On Wed, Nov 25, 2015 at 10:42 AM, Gyula Fóra <[hidden email]> wrote:

> Yes, I am not sure if this the intentional behaviour. I think you are
> supposed to be able to do the things you described.
>
> stream.union(stream.map(..)) and things like this are fair operations. Also
> maybe stream.union(stream) should just give stream instead of an error.
>
> Could someone comment on this who knows the reasoning behind the current
> mechanics?
>
> Gyula
>
> Vasiliki Kalavri <[hidden email]> ezt írta (időpont: 2015. nov.
> 24., K, 16:46):
>
> > Hi squirrels,
> >
> > when porting the gelly streaming code from 0.9 to 0.10 today with Paris,
> we
> > hit an exception in union: "*A DataStream cannot be unioned with
> itself*".
> >
> > The code raising this exception looks like this:
> > stream.union(stream.map(...)).
> >
> > Taking a look into the union code, we see that it's now not allowed to
> > union a stream, not only with itself, but with any product of itself.
> >
> > First, we are wondering, why is that? Does it make building the stream
> > graph easier in some way?
> > Second, we might want to give a better error message there, e.g. "*A
> > DataStream cannot be unioned with itself or a product of itself*", and
> > finally, we should update the docs, which currently state that union a
> > stream with itself is allowed and that "*If you union a data stream with
> > itself you will still only get each element once.*"
> >
> > Cheers,
> > -Vasia.
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Union a data stream with a product of itself

Bruecke, Christoph
Hi,

the operation “stream.union(stream.map(id))” is equivalent to “stream.union(stream)” isn’t it? So it might also duplicate the data.

- Christoph


> On 25 Nov 2015, at 11:24, Stephan Ewen <[hidden email]> wrote:
>
> "stream.union(stream.map(..))" should definitely be possible. Not sure why
> this is not permitted.
>
> "stream.union(stream)" would contain each element twice, so should either
> give an error or actually union (or duplicate) elements...
>
> Stephan
>
>
> On Wed, Nov 25, 2015 at 10:42 AM, Gyula Fóra <[hidden email]> wrote:
>
>> Yes, I am not sure if this the intentional behaviour. I think you are
>> supposed to be able to do the things you described.
>>
>> stream.union(stream.map(..)) and things like this are fair operations. Also
>> maybe stream.union(stream) should just give stream instead of an error.
>>
>> Could someone comment on this who knows the reasoning behind the current
>> mechanics?
>>
>> Gyula
>>
>> Vasiliki Kalavri <[hidden email]> ezt írta (időpont: 2015. nov.
>> 24., K, 16:46):
>>
>>> Hi squirrels,
>>>
>>> when porting the gelly streaming code from 0.9 to 0.10 today with Paris,
>> we
>>> hit an exception in union: "*A DataStream cannot be unioned with
>> itself*".
>>>
>>> The code raising this exception looks like this:
>>> stream.union(stream.map(...)).
>>>
>>> Taking a look into the union code, we see that it's now not allowed to
>>> union a stream, not only with itself, but with any product of itself.
>>>
>>> First, we are wondering, why is that? Does it make building the stream
>>> graph easier in some way?
>>> Second, we might want to give a better error message there, e.g. "*A
>>> DataStream cannot be unioned with itself or a product of itself*", and
>>> finally, we should update the docs, which currently state that union a
>>> stream with itself is allowed and that "*If you union a data stream with
>>> itself you will still only get each element once.*"
>>>
>>> Cheers,
>>> -Vasia.
>>>
>>

Reply | Threaded
Open this post in threaded view
|

Re: Union a data stream with a product of itself

Gyula Fóra
Well it kind of depends on what definition of union are we using. If this
is a union in a set theoretical way we can argue that the union of a stream
with itself should be the same stream because it contains exactly the same
elements with the same timestamps and lineage.

On the other hand stream and stream.map(id) are not exactly the same as
they might have elements with different order (the lineage differs).

So I wouldnt say that any self-union semantics is the only possible one.

Gyula

Bruecke, Christoph <[hidden email]> ezt írta
(időpont: 2015. nov. 25., Sze, 13:47):

> Hi,
>
> the operation “stream.union(stream.map(id))” is equivalent to
> “stream.union(stream)” isn’t it? So it might also duplicate the data.
>
> - Christoph
>
>
> > On 25 Nov 2015, at 11:24, Stephan Ewen <[hidden email]> wrote:
> >
> > "stream.union(stream.map(..))" should definitely be possible. Not sure
> why
> > this is not permitted.
> >
> > "stream.union(stream)" would contain each element twice, so should either
> > give an error or actually union (or duplicate) elements...
> >
> > Stephan
> >
> >
> > On Wed, Nov 25, 2015 at 10:42 AM, Gyula Fóra <[hidden email]> wrote:
> >
> >> Yes, I am not sure if this the intentional behaviour. I think you are
> >> supposed to be able to do the things you described.
> >>
> >> stream.union(stream.map(..)) and things like this are fair operations.
> Also
> >> maybe stream.union(stream) should just give stream instead of an error.
> >>
> >> Could someone comment on this who knows the reasoning behind the current
> >> mechanics?
> >>
> >> Gyula
> >>
> >> Vasiliki Kalavri <[hidden email]> ezt írta (időpont: 2015.
> nov.
> >> 24., K, 16:46):
> >>
> >>> Hi squirrels,
> >>>
> >>> when porting the gelly streaming code from 0.9 to 0.10 today with
> Paris,
> >> we
> >>> hit an exception in union: "*A DataStream cannot be unioned with
> >> itself*".
> >>>
> >>> The code raising this exception looks like this:
> >>> stream.union(stream.map(...)).
> >>>
> >>> Taking a look into the union code, we see that it's now not allowed to
> >>> union a stream, not only with itself, but with any product of itself.
> >>>
> >>> First, we are wondering, why is that? Does it make building the stream
> >>> graph easier in some way?
> >>> Second, we might want to give a better error message there, e.g. "*A
> >>> DataStream cannot be unioned with itself or a product of itself*", and
> >>> finally, we should update the docs, which currently state that union a
> >>> stream with itself is allowed and that "*If you union a data stream
> with
> >>> itself you will still only get each element once.*"
> >>>
> >>> Cheers,
> >>> -Vasia.
> >>>
> >>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Union a data stream with a product of itself

Vasiliki Kalavri
So, do we all agree that the current behavior is not correct? Shall I open
a JIRA about this?

On 25 November 2015 at 13:58, Gyula Fóra <[hidden email]> wrote:

> Well it kind of depends on what definition of union are we using. If this
> is a union in a set theoretical way we can argue that the union of a stream
> with itself should be the same stream because it contains exactly the same
> elements with the same timestamps and lineage.
>
> On the other hand stream and stream.map(id) are not exactly the same as
> they might have elements with different order (the lineage differs).
>
> So I wouldnt say that any self-union semantics is the only possible one.
>
> Gyula
>
> Bruecke, Christoph <[hidden email]> ezt írta
> (időpont: 2015. nov. 25., Sze, 13:47):
>
> > Hi,
> >
> > the operation “stream.union(stream.map(id))” is equivalent to
> > “stream.union(stream)” isn’t it? So it might also duplicate the data.
> >
> > - Christoph
> >
> >
> > > On 25 Nov 2015, at 11:24, Stephan Ewen <[hidden email]> wrote:
> > >
> > > "stream.union(stream.map(..))" should definitely be possible. Not sure
> > why
> > > this is not permitted.
> > >
> > > "stream.union(stream)" would contain each element twice, so should
> either
> > > give an error or actually union (or duplicate) elements...
> > >
> > > Stephan
> > >
> > >
> > > On Wed, Nov 25, 2015 at 10:42 AM, Gyula Fóra <[hidden email]>
> wrote:
> > >
> > >> Yes, I am not sure if this the intentional behaviour. I think you are
> > >> supposed to be able to do the things you described.
> > >>
> > >> stream.union(stream.map(..)) and things like this are fair operations.
> > Also
> > >> maybe stream.union(stream) should just give stream instead of an
> error.
> > >>
> > >> Could someone comment on this who knows the reasoning behind the
> current
> > >> mechanics?
> > >>
> > >> Gyula
> > >>
> > >> Vasiliki Kalavri <[hidden email]> ezt írta (időpont: 2015.
> > nov.
> > >> 24., K, 16:46):
> > >>
> > >>> Hi squirrels,
> > >>>
> > >>> when porting the gelly streaming code from 0.9 to 0.10 today with
> > Paris,
> > >> we
> > >>> hit an exception in union: "*A DataStream cannot be unioned with
> > >> itself*".
> > >>>
> > >>> The code raising this exception looks like this:
> > >>> stream.union(stream.map(...)).
> > >>>
> > >>> Taking a look into the union code, we see that it's now not allowed
> to
> > >>> union a stream, not only with itself, but with any product of itself.
> > >>>
> > >>> First, we are wondering, why is that? Does it make building the
> stream
> > >>> graph easier in some way?
> > >>> Second, we might want to give a better error message there, e.g. "*A
> > >>> DataStream cannot be unioned with itself or a product of itself*",
> and
> > >>> finally, we should update the docs, which currently state that union
> a
> > >>> stream with itself is allowed and that "*If you union a data stream
> > with
> > >>> itself you will still only get each element once.*"
> > >>>
> > >>> Cheers,
> > >>> -Vasia.
> > >>>
> > >>
> >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Union a data stream with a product of itself

Gyula Fóra
Yes, please

Vasiliki Kalavri <[hidden email]> ezt írta (időpont: 2015. nov.
25., Sze, 14:37):

> So, do we all agree that the current behavior is not correct? Shall I open
> a JIRA about this?
>
> On 25 November 2015 at 13:58, Gyula Fóra <[hidden email]> wrote:
>
> > Well it kind of depends on what definition of union are we using. If this
> > is a union in a set theoretical way we can argue that the union of a
> stream
> > with itself should be the same stream because it contains exactly the
> same
> > elements with the same timestamps and lineage.
> >
> > On the other hand stream and stream.map(id) are not exactly the same as
> > they might have elements with different order (the lineage differs).
> >
> > So I wouldnt say that any self-union semantics is the only possible one.
> >
> > Gyula
> >
> > Bruecke, Christoph <[hidden email]> ezt írta
> > (időpont: 2015. nov. 25., Sze, 13:47):
> >
> > > Hi,
> > >
> > > the operation “stream.union(stream.map(id))” is equivalent to
> > > “stream.union(stream)” isn’t it? So it might also duplicate the data.
> > >
> > > - Christoph
> > >
> > >
> > > > On 25 Nov 2015, at 11:24, Stephan Ewen <[hidden email]> wrote:
> > > >
> > > > "stream.union(stream.map(..))" should definitely be possible. Not
> sure
> > > why
> > > > this is not permitted.
> > > >
> > > > "stream.union(stream)" would contain each element twice, so should
> > either
> > > > give an error or actually union (or duplicate) elements...
> > > >
> > > > Stephan
> > > >
> > > >
> > > > On Wed, Nov 25, 2015 at 10:42 AM, Gyula Fóra <[hidden email]>
> > wrote:
> > > >
> > > >> Yes, I am not sure if this the intentional behaviour. I think you
> are
> > > >> supposed to be able to do the things you described.
> > > >>
> > > >> stream.union(stream.map(..)) and things like this are fair
> operations.
> > > Also
> > > >> maybe stream.union(stream) should just give stream instead of an
> > error.
> > > >>
> > > >> Could someone comment on this who knows the reasoning behind the
> > current
> > > >> mechanics?
> > > >>
> > > >> Gyula
> > > >>
> > > >> Vasiliki Kalavri <[hidden email]> ezt írta (időpont:
> 2015.
> > > nov.
> > > >> 24., K, 16:46):
> > > >>
> > > >>> Hi squirrels,
> > > >>>
> > > >>> when porting the gelly streaming code from 0.9 to 0.10 today with
> > > Paris,
> > > >> we
> > > >>> hit an exception in union: "*A DataStream cannot be unioned with
> > > >> itself*".
> > > >>>
> > > >>> The code raising this exception looks like this:
> > > >>> stream.union(stream.map(...)).
> > > >>>
> > > >>> Taking a look into the union code, we see that it's now not allowed
> > to
> > > >>> union a stream, not only with itself, but with any product of
> itself.
> > > >>>
> > > >>> First, we are wondering, why is that? Does it make building the
> > stream
> > > >>> graph easier in some way?
> > > >>> Second, we might want to give a better error message there, e.g.
> "*A
> > > >>> DataStream cannot be unioned with itself or a product of itself*",
> > and
> > > >>> finally, we should update the docs, which currently state that
> union
> > a
> > > >>> stream with itself is allowed and that "*If you union a data stream
> > > with
> > > >>> itself you will still only get each element once.*"
> > > >>>
> > > >>> Cheers,
> > > >>> -Vasia.
> > > >>>
> > > >>
> > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Union a data stream with a product of itself

Vasiliki Kalavri
Here's the issue: https://issues.apache.org/jira/browse/FLINK-3080

-V.

On 25 November 2015 at 14:38, Gyula Fóra <[hidden email]> wrote:

> Yes, please
>
> Vasiliki Kalavri <[hidden email]> ezt írta (időpont: 2015. nov.
> 25., Sze, 14:37):
>
> > So, do we all agree that the current behavior is not correct? Shall I
> open
> > a JIRA about this?
> >
> > On 25 November 2015 at 13:58, Gyula Fóra <[hidden email]> wrote:
> >
> > > Well it kind of depends on what definition of union are we using. If
> this
> > > is a union in a set theoretical way we can argue that the union of a
> > stream
> > > with itself should be the same stream because it contains exactly the
> > same
> > > elements with the same timestamps and lineage.
> > >
> > > On the other hand stream and stream.map(id) are not exactly the same as
> > > they might have elements with different order (the lineage differs).
> > >
> > > So I wouldnt say that any self-union semantics is the only possible
> one.
> > >
> > > Gyula
> > >
> > > Bruecke, Christoph <[hidden email]> ezt írta
> > > (időpont: 2015. nov. 25., Sze, 13:47):
> > >
> > > > Hi,
> > > >
> > > > the operation “stream.union(stream.map(id))” is equivalent to
> > > > “stream.union(stream)” isn’t it? So it might also duplicate the data.
> > > >
> > > > - Christoph
> > > >
> > > >
> > > > > On 25 Nov 2015, at 11:24, Stephan Ewen <[hidden email]> wrote:
> > > > >
> > > > > "stream.union(stream.map(..))" should definitely be possible. Not
> > sure
> > > > why
> > > > > this is not permitted.
> > > > >
> > > > > "stream.union(stream)" would contain each element twice, so should
> > > either
> > > > > give an error or actually union (or duplicate) elements...
> > > > >
> > > > > Stephan
> > > > >
> > > > >
> > > > > On Wed, Nov 25, 2015 at 10:42 AM, Gyula Fóra <[hidden email]>
> > > wrote:
> > > > >
> > > > >> Yes, I am not sure if this the intentional behaviour. I think you
> > are
> > > > >> supposed to be able to do the things you described.
> > > > >>
> > > > >> stream.union(stream.map(..)) and things like this are fair
> > operations.
> > > > Also
> > > > >> maybe stream.union(stream) should just give stream instead of an
> > > error.
> > > > >>
> > > > >> Could someone comment on this who knows the reasoning behind the
> > > current
> > > > >> mechanics?
> > > > >>
> > > > >> Gyula
> > > > >>
> > > > >> Vasiliki Kalavri <[hidden email]> ezt írta (időpont:
> > 2015.
> > > > nov.
> > > > >> 24., K, 16:46):
> > > > >>
> > > > >>> Hi squirrels,
> > > > >>>
> > > > >>> when porting the gelly streaming code from 0.9 to 0.10 today with
> > > > Paris,
> > > > >> we
> > > > >>> hit an exception in union: "*A DataStream cannot be unioned with
> > > > >> itself*".
> > > > >>>
> > > > >>> The code raising this exception looks like this:
> > > > >>> stream.union(stream.map(...)).
> > > > >>>
> > > > >>> Taking a look into the union code, we see that it's now not
> allowed
> > > to
> > > > >>> union a stream, not only with itself, but with any product of
> > itself.
> > > > >>>
> > > > >>> First, we are wondering, why is that? Does it make building the
> > > stream
> > > > >>> graph easier in some way?
> > > > >>> Second, we might want to give a better error message there, e.g.
> > "*A
> > > > >>> DataStream cannot be unioned with itself or a product of
> itself*",
> > > and
> > > > >>> finally, we should update the docs, which currently state that
> > union
> > > a
> > > > >>> stream with itself is allowed and that "*If you union a data
> stream
> > > > with
> > > > >>> itself you will still only get each element once.*"
> > > > >>>
> > > > >>> Cheers,
> > > > >>> -Vasia.
> > > > >>>
> > > > >>
> > > >
> > > >
> > >
> >
>