(DEPRECATED) Apache Flink Mailing List archive.

Outer-join operator integration with DataSet API (FLINK-2576)

Classic

List

Threaded

3 messages Options

Johann Kovacs

Outer-join operator integration with DataSet API (FLINK-2576)

Hi all,

we (Ricky and I) are currently working on the outer join
implementation for Flink (FLINK-687, previous pull requests #907,
#1052).

I am now looking for advice on 2 issues specifically regarding the
integration of the outer join operator with the DataSet API
(FLINK-2576).

1. There are several options of exposing the operator to the user via
the DataSet API and I'd just like to hear your preferences between the
following options (or other suggestions if I missed something):
a. DataSet#outerJoin(DataSet other, OuterJoinType outerJoinType)
[i.e. asking the user to pass an enum left-, right-, or full outer
join]
b. DataSet#join(DataSet other, JoinType joinType) [i.e. like option
a, but generalized to work for all: inner-, left-, right-, full outer
joins]
c. DataSet#left/right/fullOuterJoin(DataSet other) [i.e. a fully
qualified method for each operator]

Personally I'm partial towards options a and c, although a does have
the advantage of not blowing up the API too much (imagine adding
additional optional parameters, such as JoinHint, to each of option
c's methods).

2. I would have liked to implement the outer join operator API by
reusing as much code & functionality as possible from
org.apache.flink.api.java.operators.JoinOperator and JoinOperatorBase
(especially all the KeySelector, semantic annotations, and tuple
unwrapping stuff...) but I feel like this would bite me sooner or
later due to incompatibilities or other minor differences between the
behaviour of those operators.
I imagine this is the reason why lots of this functionality was
duplicated for the CoGroup operator implementation. Which makes me
think I should probably go the same route and duplicate the necessary
APIs, and then maybe try to refactor later?
Any opinions or hints regarding this?

Thanks in advance,
Johann

Till Rohrmann

Re: Outer-join operator integration with DataSet API (FLINK-2576)

Hi Johann,

I'd prefer 1.c, because the different join variants are semantically
different and this should be IMO reflected in the API. Moreover, the
`JoinHints` are used to give hints for the selection of the underlying
strategy for the different join variants. For `leftOuterJoin` you could
either use a sort-merge-join or a hash-join where the left side is the
probe side. For the `rightOuterJoin` you will have different strategies:
Either you use a sort-merge-join or a hash-join where the right side is the
probe side. Thus, the `JoinHints` for both variants cannot be the same.

Since I'm not too familiar with the code base of the join operator
implementation, I cannot give you a really good advice for question 2. In
general it would be good to avoid as much redundant code as possible since
it makes the code base harder to maintain. But if this should entail a
disproportional larger work effort, then I think it's also fine to follow
the CoGroup way.

Cheers,
Till

On Tue, Sep 1, 2015 at 6:02 PM, Johann Kovacs <[hidden email]> wrote:

> Hi all,
>
> we (Ricky and I) are currently working on the outer join
> implementation for Flink (FLINK-687, previous pull requests #907,
> #1052).
>
> I am now looking for advice on 2 issues specifically regarding the
> integration of the outer join operator with the DataSet API
> (FLINK-2576).
>
> 1. There are several options of exposing the operator to the user via
> the DataSet API and I'd just like to hear your preferences between the
> following options (or other suggestions if I missed something):
> a. DataSet#outerJoin(DataSet other, OuterJoinType outerJoinType)
> [i.e. asking the user to pass an enum left-, right-, or full outer
> join]
> b. DataSet#join(DataSet other, JoinType joinType) [i.e. like option
> a, but generalized to work for all: inner-, left-, right-, full outer
> joins]
> c. DataSet#left/right/fullOuterJoin(DataSet other) [i.e. a fully
> qualified method for each operator]
>
> Personally I'm partial towards options a and c, although a does have
> the advantage of not blowing up the API too much (imagine adding
> additional optional parameters, such as JoinHint, to each of option
> c's methods).
>
> 2. I would have liked to implement the outer join operator API by
> reusing as much code & functionality as possible from
> org.apache.flink.api.java.operators.JoinOperator and JoinOperatorBase
> (especially all the KeySelector, semantic annotations, and tuple
> unwrapping stuff...) but I feel like this would bite me sooner or
> later due to incompatibilities or other minor differences between the
> behaviour of those operators.
> I imagine this is the reason why lots of this functionality was
> duplicated for the CoGroup operator implementation. Which makes me
> think I should probably go the same route and duplicate the necessary
> APIs, and then maybe try to refactor later?
> Any opinions or hints regarding this?
>
> Thanks in advance,
> Johann
>

Fabian Hueske-2

Re: Outer-join operator integration with DataSet API (FLINK-2576)

Hi Johann, hi Ricky,

Thanks for reaching out to the mailing list before taking action!

I do also prefer option c.
In principle, all inner join strategies can also be applied for all outer
joins (for some hash strategies, a special HashTable implementation is
required).
I propose to add two methods for each join type, with and without JoinHint.
The JoinHint variant should fail for strategies which are not applicable,
yet (all hash-based).

You are right that there is quite a bit of duplicated code in the API code.
It could really use some refactoring and deduplication.
However, I see this as a separate issue and would also go for the separate
operator approach in the beginning.
Maybe you can implement a common operator base for all outer joins.

If you are interested, you can help refactoring the API as a follow up
issue ;-)

Cheers, Fabian

2015-09-01 18:34 GMT+02:00 Till Rohrmann <[hidden email]>:

> Hi Johann,
>
> I'd prefer 1.c, because the different join variants are semantically
> different and this should be IMO reflected in the API. Moreover, the
> `JoinHints` are used to give hints for the selection of the underlying
> strategy for the different join variants. For `leftOuterJoin` you could
> either use a sort-merge-join or a hash-join where the left side is the
> probe side. For the `rightOuterJoin` you will have different strategies:
> Either you use a sort-merge-join or a hash-join where the right side is the
> probe side. Thus, the `JoinHints` for both variants cannot be the same.
>
> Since I'm not too familiar with the code base of the join operator
> implementation, I cannot give you a really good advice for question 2. In
> general it would be good to avoid as much redundant code as possible since
> it makes the code base harder to maintain. But if this should entail a
> disproportional larger work effort, then I think it's also fine to follow
> the CoGroup way.
>
> Cheers,
> Till
>
> On Tue, Sep 1, 2015 at 6:02 PM, Johann Kovacs <[hidden email]> wrote:
>
> > Hi all,
> >
> > we (Ricky and I) are currently working on the outer join
> > implementation for Flink (FLINK-687, previous pull requests #907,
> > #1052).
> >
> > I am now looking for advice on 2 issues specifically regarding the
> > integration of the outer join operator with the DataSet API
> > (FLINK-2576).
> >
> > 1. There are several options of exposing the operator to the user via
> > the DataSet API and I'd just like to hear your preferences between the
> > following options (or other suggestions if I missed something):
> > a. DataSet#outerJoin(DataSet other, OuterJoinType outerJoinType)
> > [i.e. asking the user to pass an enum left-, right-, or full outer
> > join]
> > b. DataSet#join(DataSet other, JoinType joinType) [i.e. like option
> > a, but generalized to work for all: inner-, left-, right-, full outer
> > joins]
> > c. DataSet#left/right/fullOuterJoin(DataSet other) [i.e. a fully
> > qualified method for each operator]
> >
> > Personally I'm partial towards options a and c, although a does have
> > the advantage of not blowing up the API too much (imagine adding
> > additional optional parameters, such as JoinHint, to each of option
> > c's methods).
> >
> > 2. I would have liked to implement the outer join operator API by
> > reusing as much code & functionality as possible from
> > org.apache.flink.api.java.operators.JoinOperator and JoinOperatorBase
> > (especially all the KeySelector, semantic annotations, and tuple
> > unwrapping stuff...) but I feel like this would bite me sooner or
> > later due to incompatibilities or other minor differences between the
> > behaviour of those operators.
> > I imagine this is the reason why lots of this functionality was
> > duplicated for the CoGroup operator implementation. Which makes me
> > think I should probably go the same route and duplicate the necessary
> > APIs, and then maybe try to refactor later?
> > Any opinions or hints regarding this?
> >
> > Thanks in advance,
> > Johann
> >
>