(DEPRECATED) Apache Flink Mailing List archive.

Thoughts About Object Reuse and Collection Execution

Classic

List

Threaded

6 messages Options

Aljoscha Krettek-2

Thoughts About Object Reuse and Collection Execution

Hello Nation of Flink,
while figuring out this bug:
https://issues.apache.org/jira/browse/FLINK-1569
I came upon some difficulties. The problem is that the
KeyExtractorMappers always
return the same tuple. This is problematic, since Collection Execution
does simply store the returned values in a list. These elements are
not copied before they are stored when object reuse is enabled.
Therefore, the whole list will contain only that one reused element.

I see two options for solving this:
1. Change KeyExtractorMappers to always return a new tuple, thereby
making object-reuse mode in cluster execution useless for key
extractors.

2. Change collection execution mapper to always make copies of the
returned elements. This would make object-reuse in collection
execution pretty much obsolete, IMHO.

How should we proceed with this?

Cheers,
Aljoscha

Stephan Ewen

Re: Thoughts About Object Reuse and Collection Execution

I vote to have the key extractor return a new value each time. That means
that objects are not reused everywhere where it is possible, but still in
most places, which still helps.

What still puzzles me: I thought that the collection execution stores
copies of the returned records by default (reuse safe mode).
Am 27.02.2015 15:36 schrieb "Aljoscha Krettek" <[hidden email]>:

> Hello Nation of Flink,
> while figuring out this bug:
> https://issues.apache.org/jira/browse/FLINK-1569
> I came upon some difficulties. The problem is that the
> KeyExtractorMappers always
> return the same tuple. This is problematic, since Collection Execution
> does simply store the returned values in a list. These elements are
> not copied before they are stored when object reuse is enabled.
> Therefore, the whole list will contain only that one reused element.
>
> I see two options for solving this:
> 1. Change KeyExtractorMappers to always return a new tuple, thereby
> making object-reuse mode in cluster execution useless for key
> extractors.
>
> 2. Change collection execution mapper to always make copies of the
> returned elements. This would make object-reuse in collection
> execution pretty much obsolete, IMHO.
>
> How should we proceed with this?
>
> Cheers,
> Aljoscha
>

Ted Dunning

Re: Thoughts About Object Reuse and Collection Execution

This is going to have profound performance implications if this is the only
path for iteration.

On Fri, Feb 27, 2015 at 10:58 PM, Stephan Ewen <[hidden email]> wrote:

> I vote to have the key extractor return a new value each time. That means
> that objects are not reused everywhere where it is possible, but still in
> most places, which still helps.
>
> What still puzzles me: I thought that the collection execution stores
> copies of the returned records by default (reuse safe mode).
> Am 27.02.2015 15:36 schrieb "Aljoscha Krettek" <[hidden email]>:
>
> > Hello Nation of Flink,
> > while figuring out this bug:
> > https://issues.apache.org/jira/browse/FLINK-1569
> > I came upon some difficulties. The problem is that the
> > KeyExtractorMappers always
> > return the same tuple. This is problematic, since Collection Execution
> > does simply store the returned values in a list. These elements are
> > not copied before they are stored when object reuse is enabled.
> > Therefore, the whole list will contain only that one reused element.
> >
> > I see two options for solving this:
> > 1. Change KeyExtractorMappers to always return a new tuple, thereby
> > making object-reuse mode in cluster execution useless for key
> > extractors.
> >
> > 2. Change collection execution mapper to always make copies of the
> > returned elements. This would make object-reuse in collection
> > execution pretty much obsolete, IMHO.
> >
> > How should we proceed with this?
> >
> > Cheers,
> > Aljoscha
> >
>

Aljoscha Krettek-2

Re: Thoughts About Object Reuse and Collection Execution

@Stephan, they are not copied when object reuse is enabled. This might
be a problem, though, so maybe we should just change it in that place.

On Sat, Feb 28, 2015 at 7:57 AM, Ted Dunning <[hidden email]> wrote:

> This is going to have profound performance implications if this is the only
> path for iteration.
>
>
>
> On Fri, Feb 27, 2015 at 10:58 PM, Stephan Ewen <[hidden email]> wrote:
>
>> I vote to have the key extractor return a new value each time. That means
>> that objects are not reused everywhere where it is possible, but still in
>> most places, which still helps.
>>
>> What still puzzles me: I thought that the collection execution stores
>> copies of the returned records by default (reuse safe mode).
>> Am 27.02.2015 15:36 schrieb "Aljoscha Krettek" <[hidden email]>:
>>
>> > Hello Nation of Flink,
>> > while figuring out this bug:
>> > https://issues.apache.org/jira/browse/FLINK-1569
>> > I came upon some difficulties. The problem is that the
>> > KeyExtractorMappers always
>> > return the same tuple. This is problematic, since Collection Execution
>> > does simply store the returned values in a list. These elements are
>> > not copied before they are stored when object reuse is enabled.
>> > Therefore, the whole list will contain only that one reused element.
>> >
>> > I see two options for solving this:
>> > 1. Change KeyExtractorMappers to always return a new tuple, thereby
>> > making object-reuse mode in cluster execution useless for key
>> > extractors.
>> >
>> > 2. Change collection execution mapper to always make copies of the
>> > returned elements. This would make object-reuse in collection
>> > execution pretty much obsolete, IMHO.
>> >
>> > How should we proceed with this?
>> >
>> > Cheers,
>> > Aljoscha
>> >
>>

Stephan Ewen

Re: Thoughts About Object Reuse and Collection Execution

@Ted Here is a bit of background about how things are currently done in
the Flink runtime:

There are two execution modes for the runtime: "reuse" and "non-reuse".
- The "non-reuse" mode will create new objects for every record received
from the network, or taken out of a sort-buffer or hash-table.
It has the advantage that is some user code group operation
(groupReduce) materializes the group, it works simply (dedicated objects
for each element)

- The "reuse" mode will have one or two objects that will be reused each
time a record is received from the network, or taken out of a sort-buffer
or hash-table.
It behaves similar as the object reuse in Hadoop and saves in garbage
collection, but requires the user code to sometimes be aware of object
reuse

Flink has multiple runtime backends that can execute a program. Some do not
support both modes:

- Flink's own runtime. Works in both "reuse" and "non-reuse" mode,
depending on what users select.

- Java Collections (non-parallel, for testing and lightweight embedding).
Tolerates objects reuse in user code but does not attempt to reuse by
itself.

- Tez (coming up) will in the long run support also both modes (reuse and
non-reuse

Where the internal algorithms (sorting/ hashing) do not affect any user
code behavior, they always work in "reusing" mode.

Greetings,
Stephan

On Sat, Feb 28, 2015 at 10:33 AM, Aljoscha Krettek <[hidden email]>
wrote:

> @Stephan, they are not copied when object reuse is enabled. This might
> be a problem, though, so maybe we should just change it in that place.
>
> On Sat, Feb 28, 2015 at 7:57 AM, Ted Dunning <[hidden email]>
> wrote:
> > This is going to have profound performance implications if this is the
> only
> > path for iteration.
> >
> >
> >
> > On Fri, Feb 27, 2015 at 10:58 PM, Stephan Ewen <[hidden email]> wrote:
> >
> >> I vote to have the key extractor return a new value each time. That
> means
> >> that objects are not reused everywhere where it is possible, but still
> in
> >> most places, which still helps.
> >>
> >> What still puzzles me: I thought that the collection execution stores
> >> copies of the returned records by default (reuse safe mode).
> >> Am 27.02.2015 15:36 schrieb "Aljoscha Krettek" <[hidden email]>:
> >>
> >> > Hello Nation of Flink,
> >> > while figuring out this bug:
> >> > https://issues.apache.org/jira/browse/FLINK-1569
> >> > I came upon some difficulties. The problem is that the
> >> > KeyExtractorMappers always
> >> > return the same tuple. This is problematic, since Collection Execution
> >> > does simply store the returned values in a list. These elements are
> >> > not copied before they are stored when object reuse is enabled.
> >> > Therefore, the whole list will contain only that one reused element.
> >> >
> >> > I see two options for solving this:
> >> > 1. Change KeyExtractorMappers to always return a new tuple, thereby
> >> > making object-reuse mode in cluster execution useless for key
> >> > extractors.
> >> >
> >> > 2. Change collection execution mapper to always make copies of the
> >> > returned elements. This would make object-reuse in collection
> >> > execution pretty much obsolete, IMHO.
> >> >
> >> > How should we proceed with this?
> >> >
> >> > Cheers,
> >> > Aljoscha
> >> >
> >>
>

Ted Dunning

Re: Thoughts About Object Reuse and Collection Execution

On Mon, Mar 2, 2015 at 5:17 PM, Stephan Ewen <[hidden email]> wrote:

> There are two execution modes for the runtime: "reuse" and "non-reuse".

That makes a fair bit of sense.