(DEPRECATED) Apache Flink Mailing List archive.

how to split data-sets efficiently?

Classic

List

Threaded

5 messages Options

Martin Neumann

how to split data-sets efficiently?

Hej,

I have a dataset of StringID's and I want to map them to Longs by using a
hash function. I will use the LongID's in a series of Iterative
computations and then map back to StringID's.
Currently I have a map operation that creates tuples with the string and
the long. I have an other mapper cleaning out the String's.

Is there a way to do a operation that allows for more the one output set
(basically split a set into 2 sets)? This would reduce the complexity of
the code a lot.
Also how does the optimizer deal with this case? Does it join both map
operation's together and actually run it as if it would be a split?

cheers Martin

Ufuk Celebi

Re: how to split data-sets efficiently?

Hey Martin,

On 27 Jul 2014, at 12:56, Martin Neumann <[hidden email]> wrote:

> Is there a way to do a operation that allows for more the one output set
> (basically split a set into 2 sets)? This would reduce the complexity of
> the code a lot.

What exactly do you mean with split?

I am not sure if this is what you want, but you can just apply two transformations on the same input data set.

DataSet<String> input = ...;

DataSet<String> firstSet = input.map(...)

DataSet<String> secondSet = input.map(...)

Does this help?

Chesnay Schepler

Re: how to split data-sets efficiently?

i think this is what martin is currently doing:

StringIDs --map-> (StringIDs,LongIDs) --map-> LongIDs

and he wants to use both the second and third set. he asks for a way to
replace the second map operation. (since it seems unnecessary to create
an extra map for that)

i believe the appropriate way would be to use projections instead of a
map operation. something like:

mapped = stringIDs.map(...)
longids = mapped.project(1).types(Long)

you would end up with a Tuple1 set though.

On 27.7.2014 13:21, Ufuk Celebi wrote:

> Hey Martin,
>
> On 27 Jul 2014, at 12:56, Martin Neumann <[hidden email]> wrote:
>
>> Is there a way to do a operation that allows for more the one output set
>> (basically split a set into 2 sets)? This would reduce the complexity of
>> the code a lot.
> What exactly do you mean with split?
>
> I am not sure if this is what you want, but you can just apply two transformations on the same input data set.
>
> DataSet<String> input = ...;
>
> DataSet<String> firstSet = input.map(...)
>
> DataSet<String> secondSet = input.map(...)
>
> Does this help?

Stephan Ewen

Re: how to split data-sets efficiently?

Hi!

"Splitting", in the sense that one function returns two different data
sets, is currently not supported.

I guess you have to go with Ufuk's suggestion. IN your case, I guess it
would look somewhat like this:

DataSet<Tuple2<Long, String>> mapped = ogiginalStrings.map(HashIdMapper());

DataSet<Long> ids = mapped.map(new ProjectTo2());

DataSet<Long> result = ids.runTheGraphAlgorithm(...)

result.join(mapped).where(...).equalTo(...).with(new MapBackToStrings());

Greetings,
Stephan

Stephan Ewen

Re: how to split data-sets efficiently?

Hey!

A similar issue has arisen in different context. We should solve both
problems homogeneously.

Can you participate in the discussion here:
https://issues.apache.org/jira/browse/FLINK-87

Greetings,
Stephan

On Mon, Jul 28, 2014 at 3:42 PM, Stephan Ewen <[hidden email]> wrote:

> Hi!
>
> "Splitting", in the sense that one function returns two different data
> sets, is currently not supported.
>
> I guess you have to go with Ufuk's suggestion. IN your case, I guess it
> would look somewhat like this:
>
>
> DataSet<Tuple2<Long, String>> mapped = ogiginalStrings.map(HashIdMapper());
>
> DataSet<Long> ids = mapped.map(new ProjectTo2());
>
> DataSet<Long> result = ids.runTheGraphAlgorithm(...)
>
> result.join(mapped).where(...).equalTo(...).with(new MapBackToStrings());
>
>
> Greetings,
> Stephan
>