Hi all,
we (Ricky and I) are currently working on the outer join implementation for Flink (FLINK-687, previous pull requests #907, #1052). I am now looking for advice on 2 issues specifically regarding the integration of the outer join operator with the DataSet API (FLINK-2576). 1. There are several options of exposing the operator to the user via the DataSet API and I'd just like to hear your preferences between the following options (or other suggestions if I missed something): a. DataSet#outerJoin(DataSet other, OuterJoinType outerJoinType) [i.e. asking the user to pass an enum left-, right-, or full outer join] b. DataSet#join(DataSet other, JoinType joinType) [i.e. like option a, but generalized to work for all: inner-, left-, right-, full outer joins] c. DataSet#left/right/fullOuterJoin(DataSet other) [i.e. a fully qualified method for each operator] Personally I'm partial towards options a and c, although a does have the advantage of not blowing up the API too much (imagine adding additional optional parameters, such as JoinHint, to each of option c's methods). 2. I would have liked to implement the outer join operator API by reusing as much code & functionality as possible from org.apache.flink.api.java.operators.JoinOperator and JoinOperatorBase (especially all the KeySelector, semantic annotations, and tuple unwrapping stuff...) but I feel like this would bite me sooner or later due to incompatibilities or other minor differences between the behaviour of those operators. I imagine this is the reason why lots of this functionality was duplicated for the CoGroup operator implementation. Which makes me think I should probably go the same route and duplicate the necessary APIs, and then maybe try to refactor later? Any opinions or hints regarding this? Thanks in advance, Johann |
Hi Johann,
I'd prefer 1.c, because the different join variants are semantically different and this should be IMO reflected in the API. Moreover, the `JoinHints` are used to give hints for the selection of the underlying strategy for the different join variants. For `leftOuterJoin` you could either use a sort-merge-join or a hash-join where the left side is the probe side. For the `rightOuterJoin` you will have different strategies: Either you use a sort-merge-join or a hash-join where the right side is the probe side. Thus, the `JoinHints` for both variants cannot be the same. Since I'm not too familiar with the code base of the join operator implementation, I cannot give you a really good advice for question 2. In general it would be good to avoid as much redundant code as possible since it makes the code base harder to maintain. But if this should entail a disproportional larger work effort, then I think it's also fine to follow the CoGroup way. Cheers, Till On Tue, Sep 1, 2015 at 6:02 PM, Johann Kovacs <[hidden email]> wrote: > Hi all, > > we (Ricky and I) are currently working on the outer join > implementation for Flink (FLINK-687, previous pull requests #907, > #1052). > > I am now looking for advice on 2 issues specifically regarding the > integration of the outer join operator with the DataSet API > (FLINK-2576). > > 1. There are several options of exposing the operator to the user via > the DataSet API and I'd just like to hear your preferences between the > following options (or other suggestions if I missed something): > a. DataSet#outerJoin(DataSet other, OuterJoinType outerJoinType) > [i.e. asking the user to pass an enum left-, right-, or full outer > join] > b. DataSet#join(DataSet other, JoinType joinType) [i.e. like option > a, but generalized to work for all: inner-, left-, right-, full outer > joins] > c. DataSet#left/right/fullOuterJoin(DataSet other) [i.e. a fully > qualified method for each operator] > > Personally I'm partial towards options a and c, although a does have > the advantage of not blowing up the API too much (imagine adding > additional optional parameters, such as JoinHint, to each of option > c's methods). > > 2. I would have liked to implement the outer join operator API by > reusing as much code & functionality as possible from > org.apache.flink.api.java.operators.JoinOperator and JoinOperatorBase > (especially all the KeySelector, semantic annotations, and tuple > unwrapping stuff...) but I feel like this would bite me sooner or > later due to incompatibilities or other minor differences between the > behaviour of those operators. > I imagine this is the reason why lots of this functionality was > duplicated for the CoGroup operator implementation. Which makes me > think I should probably go the same route and duplicate the necessary > APIs, and then maybe try to refactor later? > Any opinions or hints regarding this? > > Thanks in advance, > Johann > |
Hi Johann, hi Ricky,
Thanks for reaching out to the mailing list before taking action! I do also prefer option c. In principle, all inner join strategies can also be applied for all outer joins (for some hash strategies, a special HashTable implementation is required). I propose to add two methods for each join type, with and without JoinHint. The JoinHint variant should fail for strategies which are not applicable, yet (all hash-based). You are right that there is quite a bit of duplicated code in the API code. It could really use some refactoring and deduplication. However, I see this as a separate issue and would also go for the separate operator approach in the beginning. Maybe you can implement a common operator base for all outer joins. If you are interested, you can help refactoring the API as a follow up issue ;-) Cheers, Fabian 2015-09-01 18:34 GMT+02:00 Till Rohrmann <[hidden email]>: > Hi Johann, > > I'd prefer 1.c, because the different join variants are semantically > different and this should be IMO reflected in the API. Moreover, the > `JoinHints` are used to give hints for the selection of the underlying > strategy for the different join variants. For `leftOuterJoin` you could > either use a sort-merge-join or a hash-join where the left side is the > probe side. For the `rightOuterJoin` you will have different strategies: > Either you use a sort-merge-join or a hash-join where the right side is the > probe side. Thus, the `JoinHints` for both variants cannot be the same. > > Since I'm not too familiar with the code base of the join operator > implementation, I cannot give you a really good advice for question 2. In > general it would be good to avoid as much redundant code as possible since > it makes the code base harder to maintain. But if this should entail a > disproportional larger work effort, then I think it's also fine to follow > the CoGroup way. > > Cheers, > Till > > On Tue, Sep 1, 2015 at 6:02 PM, Johann Kovacs <[hidden email]> wrote: > > > Hi all, > > > > we (Ricky and I) are currently working on the outer join > > implementation for Flink (FLINK-687, previous pull requests #907, > > #1052). > > > > I am now looking for advice on 2 issues specifically regarding the > > integration of the outer join operator with the DataSet API > > (FLINK-2576). > > > > 1. There are several options of exposing the operator to the user via > > the DataSet API and I'd just like to hear your preferences between the > > following options (or other suggestions if I missed something): > > a. DataSet#outerJoin(DataSet other, OuterJoinType outerJoinType) > > [i.e. asking the user to pass an enum left-, right-, or full outer > > join] > > b. DataSet#join(DataSet other, JoinType joinType) [i.e. like option > > a, but generalized to work for all: inner-, left-, right-, full outer > > joins] > > c. DataSet#left/right/fullOuterJoin(DataSet other) [i.e. a fully > > qualified method for each operator] > > > > Personally I'm partial towards options a and c, although a does have > > the advantage of not blowing up the API too much (imagine adding > > additional optional parameters, such as JoinHint, to each of option > > c's methods). > > > > 2. I would have liked to implement the outer join operator API by > > reusing as much code & functionality as possible from > > org.apache.flink.api.java.operators.JoinOperator and JoinOperatorBase > > (especially all the KeySelector, semantic annotations, and tuple > > unwrapping stuff...) but I feel like this would bite me sooner or > > later due to incompatibilities or other minor differences between the > > behaviour of those operators. > > I imagine this is the reason why lots of this functionality was > > duplicated for the CoGroup operator implementation. Which makes me > > think I should probably go the same route and duplicate the necessary > > APIs, and then maybe try to refactor later? > > Any opinions or hints regarding this? > > > > Thanks in advance, > > Johann > > > |
Free forum by Nabble | Edit this page |