Hi Guys,
while porting the Java API to Scala I'm noticing how complicated things are because of how our TypeComparators work: 1) There is only one type of comparator per TypeInformation which is created by the TypeInformation. Therefore, our KeySelectors are not actually implemented as comparators but as generated mappers that emit a Tuple2, because you wouldn't for example be able to generate a SelectorFunctionComparator for a TupleTypeInfo. (There's also a lot of magic going on with wrapping and unwrapping those tuples in Reduce, Join, and CoGroup.) 2) Comparators cannot really interoperate, there is special case code for the combinations that work. This will only get worse when we properly introduce POJO types, which should work well with tuple comparators and the other comparators. My proposal is this: No more TypeComparator on a per type basis. Just a generic comparator and PairComparator that work on Comparable. What used to be TypeComparators become SelectionExtractors that return a Comparable. Make Tuple comparable or add new ComparableTuple. The TupleSelectionExtractor would return a comparable tuple of the appropriate length (same for POJOs). For Tuple extractors that operate on only one field they would immediately return that field, without wrapping it in a tuple. This would directly support our existing KeySelector functions since the already return Comparable, when returning a tuple in a key selector function this would be compatible with a TupleSelectionExtractor (on the other join side, for example). That's my idea. What do you think? I think the current state is not maintainable, so we should do something quickly. :D Cheers, Aljoscha |
The design of the comparators so far was to make them work on the binary
data. That we need to retain, in my opinion, otherwise there is no way to get good performance out of working on serialized data. I personally think that creating a tuple2 (key/value pair) when using selector functions is actually good: The key type (being treated by its dedicated comparator) benefits from all the optimizations implemented for that type (bin copying, normalized keys, ...) That would be very hard to incorporate into any comparator that just deserializes some comparable. Also, the key extractor can contain sort of heavy magic (such as to block keys), whatever a user put in there. If we put that into the comparator, it gets called for every comparison! I do agree, though, that we need to come up with a better interface that seamlessly allows working on binary versions and on objects, without duplicating too much code. From your suggestion, I am not sure I got everything. Could you post a concrete example or code? Stephan On Wed, Aug 27, 2014 at 5:02 PM, Aljoscha Krettek <[hidden email]> wrote: > Hi Guys, > while porting the Java API to Scala I'm noticing how complicated > things are because of how our TypeComparators work: 1) There is only > one type of comparator per TypeInformation which is created by the > TypeInformation. Therefore, our KeySelectors are not actually > implemented as comparators but as generated mappers that emit a > Tuple2, because you wouldn't for example be able to generate a > SelectorFunctionComparator for a TupleTypeInfo. (There's also a lot > of magic going on with wrapping and unwrapping those tuples in Reduce, > Join, and CoGroup.) 2) Comparators cannot really interoperate, there > is special case code for the combinations that work. This will only > get worse when we properly introduce POJO types, which should work > well with tuple comparators and the other comparators. > > My proposal is this: No more TypeComparator on a per type basis. Just > a generic comparator and PairComparator that work on Comparable. What > used to be TypeComparators become SelectionExtractors that return a > Comparable. Make Tuple comparable or add new ComparableTuple. The > TupleSelectionExtractor would return a comparable tuple of the > appropriate length (same for POJOs). For Tuple extractors that operate > on only one field they would immediately return that field, without > wrapping it in a tuple. This would directly support our existing > KeySelector functions since the already return Comparable, when > returning a tuple in a key selector function this would be compatible > with a TupleSelectionExtractor (on the other join side, for example). > > That's my idea. What do you think? I think the current state is not > maintainable, so we should do something quickly. :D > > Cheers, > Aljoscha > |
It would work on binary data, for example for tuples it would become:
public Comparable extractKeys(DataInputView in) { Object[] fields = ... // extract only relevant fields from binary input and save in fields return Tuple(fields) // something like that } And for normalized keys something similar can be done. Aljoscha On Wed, Aug 27, 2014 at 8:39 PM, Stephan Ewen <[hidden email]> wrote: > The design of the comparators so far was to make them work on the binary > data. That we need to retain, in my opinion, otherwise there is no > way to get good performance out of working on serialized data. > > I personally think that creating a tuple2 (key/value pair) when using > selector functions is actually good: > The key type (being treated by its dedicated comparator) benefits from all > the optimizations implemented for that type (bin copying, normalized keys, > ...) > That would be very hard to incorporate into any comparator that just > deserializes some comparable. > > Also, the key extractor can contain sort of heavy magic (such as to block > keys), whatever a user put in there. If we put that into the comparator, it > gets called for > every comparison! > > I do agree, though, that we need to come up with a better interface that > seamlessly allows working on binary versions and on objects, without > duplicating too much code. > > From your suggestion, I am not sure I got everything. Could you post a > concrete example or code? > > Stephan > > > > > On Wed, Aug 27, 2014 at 5:02 PM, Aljoscha Krettek <[hidden email]> > wrote: > >> Hi Guys, >> while porting the Java API to Scala I'm noticing how complicated >> things are because of how our TypeComparators work: 1) There is only >> one type of comparator per TypeInformation which is created by the >> TypeInformation. Therefore, our KeySelectors are not actually >> implemented as comparators but as generated mappers that emit a >> Tuple2, because you wouldn't for example be able to generate a >> SelectorFunctionComparator for a TupleTypeInfo. (There's also a lot >> of magic going on with wrapping and unwrapping those tuples in Reduce, >> Join, and CoGroup.) 2) Comparators cannot really interoperate, there >> is special case code for the combinations that work. This will only >> get worse when we properly introduce POJO types, which should work >> well with tuple comparators and the other comparators. >> >> My proposal is this: No more TypeComparator on a per type basis. Just >> a generic comparator and PairComparator that work on Comparable. What >> used to be TypeComparators become SelectionExtractors that return a >> Comparable. Make Tuple comparable or add new ComparableTuple. The >> TupleSelectionExtractor would return a comparable tuple of the >> appropriate length (same for POJOs). For Tuple extractors that operate >> on only one field they would immediately return that field, without >> wrapping it in a tuple. This would directly support our existing >> KeySelector functions since the already return Comparable, when >> returning a tuple in a key selector function this would be compatible >> with a TupleSelectionExtractor (on the other join side, for example). >> >> That's my idea. What do you think? I think the current state is not >> maintainable, so we should do something quickly. :D >> >> Cheers, >> Aljoscha >> |
A lot of comparisons (and that is what makes it fast) can happen on the
binary data, without turning them into a comparable. The latest pull request in the String representation for example improved String-keyed sorting quite a bit: https://github.com/apache/incubator-flink/pull/4 On Wed, Aug 27, 2014 at 9:00 PM, Aljoscha Krettek <[hidden email]> wrote: > It would work on binary data, for example for tuples it would become: > > public Comparable extractKeys(DataInputView in) { > Object[] fields = ... > // extract only relevant fields from binary input and save in fields > return Tuple(fields) // something like that > } > > And for normalized keys something similar can be done. > > Aljoscha > > On Wed, Aug 27, 2014 at 8:39 PM, Stephan Ewen <[hidden email]> wrote: > > The design of the comparators so far was to make them work on the binary > > data. That we need to retain, in my opinion, otherwise there is no > > way to get good performance out of working on serialized data. > > > > I personally think that creating a tuple2 (key/value pair) when using > > selector functions is actually good: > > The key type (being treated by its dedicated comparator) benefits from > all > > the optimizations implemented for that type (bin copying, normalized > keys, > > ...) > > That would be very hard to incorporate into any comparator that just > > deserializes some comparable. > > > > Also, the key extractor can contain sort of heavy magic (such as to block > > keys), whatever a user put in there. If we put that into the comparator, > it > > gets called for > > every comparison! > > > > I do agree, though, that we need to come up with a better interface that > > seamlessly allows working on binary versions and on objects, without > > duplicating too much code. > > > > From your suggestion, I am not sure I got everything. Could you post a > > concrete example or code? > > > > Stephan > > > > > > > > > > On Wed, Aug 27, 2014 at 5:02 PM, Aljoscha Krettek <[hidden email]> > > wrote: > > > >> Hi Guys, > >> while porting the Java API to Scala I'm noticing how complicated > >> things are because of how our TypeComparators work: 1) There is only > >> one type of comparator per TypeInformation which is created by the > >> TypeInformation. Therefore, our KeySelectors are not actually > >> implemented as comparators but as generated mappers that emit a > >> Tuple2, because you wouldn't for example be able to generate a > >> SelectorFunctionComparator for a TupleTypeInfo. (There's also a lot > >> of magic going on with wrapping and unwrapping those tuples in Reduce, > >> Join, and CoGroup.) 2) Comparators cannot really interoperate, there > >> is special case code for the combinations that work. This will only > >> get worse when we properly introduce POJO types, which should work > >> well with tuple comparators and the other comparators. > >> > >> My proposal is this: No more TypeComparator on a per type basis. Just > >> a generic comparator and PairComparator that work on Comparable. What > >> used to be TypeComparators become SelectionExtractors that return a > >> Comparable. Make Tuple comparable or add new ComparableTuple. The > >> TupleSelectionExtractor would return a comparable tuple of the > >> appropriate length (same for POJOs). For Tuple extractors that operate > >> on only one field they would immediately return that field, without > >> wrapping it in a tuple. This would directly support our existing > >> KeySelector functions since the already return Comparable, when > >> returning a tuple in a key selector function this would be compatible > >> with a TupleSelectionExtractor (on the other join side, for example). > >> > >> That's my idea. What do you think? I think the current state is not > >> maintainable, so we should do something quickly. :D > >> > >> Cheers, > >> Aljoscha > >> > |
Ok, then we have to find some other solution for interop between
comparators. What's the performance increase from that pull request? On Wed, Aug 27, 2014 at 9:24 PM, Stephan Ewen <[hidden email]> wrote: > A lot of comparisons (and that is what makes it fast) can happen on the > binary data, without turning them into a comparable. > > The latest pull request in the String representation for example improved > String-keyed sorting quite a bit: > > https://github.com/apache/incubator-flink/pull/4 > > > > > > On Wed, Aug 27, 2014 at 9:00 PM, Aljoscha Krettek <[hidden email]> > wrote: > >> It would work on binary data, for example for tuples it would become: >> >> public Comparable extractKeys(DataInputView in) { >> Object[] fields = ... >> // extract only relevant fields from binary input and save in fields >> return Tuple(fields) // something like that >> } >> >> And for normalized keys something similar can be done. >> >> Aljoscha >> >> On Wed, Aug 27, 2014 at 8:39 PM, Stephan Ewen <[hidden email]> wrote: >> > The design of the comparators so far was to make them work on the binary >> > data. That we need to retain, in my opinion, otherwise there is no >> > way to get good performance out of working on serialized data. >> > >> > I personally think that creating a tuple2 (key/value pair) when using >> > selector functions is actually good: >> > The key type (being treated by its dedicated comparator) benefits from >> all >> > the optimizations implemented for that type (bin copying, normalized >> keys, >> > ...) >> > That would be very hard to incorporate into any comparator that just >> > deserializes some comparable. >> > >> > Also, the key extractor can contain sort of heavy magic (such as to block >> > keys), whatever a user put in there. If we put that into the comparator, >> it >> > gets called for >> > every comparison! >> > >> > I do agree, though, that we need to come up with a better interface that >> > seamlessly allows working on binary versions and on objects, without >> > duplicating too much code. >> > >> > From your suggestion, I am not sure I got everything. Could you post a >> > concrete example or code? >> > >> > Stephan >> > >> > >> > >> > >> > On Wed, Aug 27, 2014 at 5:02 PM, Aljoscha Krettek <[hidden email]> >> > wrote: >> > >> >> Hi Guys, >> >> while porting the Java API to Scala I'm noticing how complicated >> >> things are because of how our TypeComparators work: 1) There is only >> >> one type of comparator per TypeInformation which is created by the >> >> TypeInformation. Therefore, our KeySelectors are not actually >> >> implemented as comparators but as generated mappers that emit a >> >> Tuple2, because you wouldn't for example be able to generate a >> >> SelectorFunctionComparator for a TupleTypeInfo. (There's also a lot >> >> of magic going on with wrapping and unwrapping those tuples in Reduce, >> >> Join, and CoGroup.) 2) Comparators cannot really interoperate, there >> >> is special case code for the combinations that work. This will only >> >> get worse when we properly introduce POJO types, which should work >> >> well with tuple comparators and the other comparators. >> >> >> >> My proposal is this: No more TypeComparator on a per type basis. Just >> >> a generic comparator and PairComparator that work on Comparable. What >> >> used to be TypeComparators become SelectionExtractors that return a >> >> Comparable. Make Tuple comparable or add new ComparableTuple. The >> >> TupleSelectionExtractor would return a comparable tuple of the >> >> appropriate length (same for POJOs). For Tuple extractors that operate >> >> on only one field they would immediately return that field, without >> >> wrapping it in a tuple. This would directly support our existing >> >> KeySelector functions since the already return Comparable, when >> >> returning a tuple in a key selector function this would be compatible >> >> with a TupleSelectionExtractor (on the other join side, for example). >> >> >> >> That's my idea. What do you think? I think the current state is not >> >> maintainable, so we should do something quickly. :D >> >> >> >> Cheers, >> >> Aljoscha >> >> >> |
i cant recall definitely what the numbers were so I'll just quote myself
from the PR: measurements were done using System.nanoTime() time necessary for the comparison Strings consisted of 90 characters difference in the beginning of the string New 4259 Old 23431 difference in the middle of the string New 13560 Old 30757 difference at the end of the string New 29385 Old 28293 for smaller strings (<10 i think) it was roughly 50% faster. On 27.8.2014 22:10, Aljoscha Krettek wrote: > Ok, then we have to find some other solution for interop between > comparators. What's the performance increase from that pull request? > > On Wed, Aug 27, 2014 at 9:24 PM, Stephan Ewen <[hidden email]> wrote: >> A lot of comparisons (and that is what makes it fast) can happen on the >> binary data, without turning them into a comparable. >> >> The latest pull request in the String representation for example improved >> String-keyed sorting quite a bit: >> >> https://github.com/apache/incubator-flink/pull/4 >> >> >> >> >> >> On Wed, Aug 27, 2014 at 9:00 PM, Aljoscha Krettek <[hidden email]> >> wrote: >> >>> It would work on binary data, for example for tuples it would become: >>> >>> public Comparable extractKeys(DataInputView in) { >>> Object[] fields = ... >>> // extract only relevant fields from binary input and save in fields >>> return Tuple(fields) // something like that >>> } >>> >>> And for normalized keys something similar can be done. >>> >>> Aljoscha >>> >>> On Wed, Aug 27, 2014 at 8:39 PM, Stephan Ewen <[hidden email]> wrote: >>>> The design of the comparators so far was to make them work on the binary >>>> data. That we need to retain, in my opinion, otherwise there is no >>>> way to get good performance out of working on serialized data. >>>> >>>> I personally think that creating a tuple2 (key/value pair) when using >>>> selector functions is actually good: >>>> The key type (being treated by its dedicated comparator) benefits from >>> all >>>> the optimizations implemented for that type (bin copying, normalized >>> keys, >>>> ...) >>>> That would be very hard to incorporate into any comparator that just >>>> deserializes some comparable. >>>> >>>> Also, the key extractor can contain sort of heavy magic (such as to block >>>> keys), whatever a user put in there. If we put that into the comparator, >>> it >>>> gets called for >>>> every comparison! >>>> >>>> I do agree, though, that we need to come up with a better interface that >>>> seamlessly allows working on binary versions and on objects, without >>>> duplicating too much code. >>>> >>>> From your suggestion, I am not sure I got everything. Could you post a >>>> concrete example or code? >>>> >>>> Stephan >>>> >>>> >>>> >>>> >>>> On Wed, Aug 27, 2014 at 5:02 PM, Aljoscha Krettek <[hidden email]> >>>> wrote: >>>> >>>>> Hi Guys, >>>>> while porting the Java API to Scala I'm noticing how complicated >>>>> things are because of how our TypeComparators work: 1) There is only >>>>> one type of comparator per TypeInformation which is created by the >>>>> TypeInformation. Therefore, our KeySelectors are not actually >>>>> implemented as comparators but as generated mappers that emit a >>>>> Tuple2, because you wouldn't for example be able to generate a >>>>> SelectorFunctionComparator for a TupleTypeInfo. (There's also a lot >>>>> of magic going on with wrapping and unwrapping those tuples in Reduce, >>>>> Join, and CoGroup.) 2) Comparators cannot really interoperate, there >>>>> is special case code for the combinations that work. This will only >>>>> get worse when we properly introduce POJO types, which should work >>>>> well with tuple comparators and the other comparators. >>>>> >>>>> My proposal is this: No more TypeComparator on a per type basis. Just >>>>> a generic comparator and PairComparator that work on Comparable. What >>>>> used to be TypeComparators become SelectionExtractors that return a >>>>> Comparable. Make Tuple comparable or add new ComparableTuple. The >>>>> TupleSelectionExtractor would return a comparable tuple of the >>>>> appropriate length (same for POJOs). For Tuple extractors that operate >>>>> on only one field they would immediately return that field, without >>>>> wrapping it in a tuple. This would directly support our existing >>>>> KeySelector functions since the already return Comparable, when >>>>> returning a tuple in a key selector function this would be compatible >>>>> with a TupleSelectionExtractor (on the other join side, for example). >>>>> >>>>> That's my idea. What do you think? I think the current state is not >>>>> maintainable, so we should do something quickly. :D >>>>> >>>>> Cheers, >>>>> Aljoscha >>>>> |
Ok, but that's micro benchmarks. How much time is really spent on
those comparisons when a real job is executed? Does anyone know? Also, the tuple pair comparators never work on binary representation, right? On Wed, Aug 27, 2014 at 10:19 PM, Chesnay Schepler <[hidden email]> wrote: > i cant recall definitely what the numbers were so I'll just quote myself > from the PR: > > measurements were done using System.nanoTime() > time necessary for the comparison > Strings consisted of 90 characters > > difference in the beginning of the string > New 4259 > Old 23431 > > difference in the middle of the string > New 13560 > Old 30757 > > difference at the end of the string > New 29385 > Old 28293 > > for smaller strings (<10 i think) it was roughly 50% faster. > > > > On 27.8.2014 22:10, Aljoscha Krettek wrote: >> >> Ok, then we have to find some other solution for interop between >> comparators. What's the performance increase from that pull request? >> >> On Wed, Aug 27, 2014 at 9:24 PM, Stephan Ewen <[hidden email]> wrote: >>> >>> A lot of comparisons (and that is what makes it fast) can happen on the >>> binary data, without turning them into a comparable. >>> >>> The latest pull request in the String representation for example improved >>> String-keyed sorting quite a bit: >>> >>> https://github.com/apache/incubator-flink/pull/4 >>> >>> >>> >>> >>> >>> On Wed, Aug 27, 2014 at 9:00 PM, Aljoscha Krettek <[hidden email]> >>> wrote: >>> >>>> It would work on binary data, for example for tuples it would become: >>>> >>>> public Comparable extractKeys(DataInputView in) { >>>> Object[] fields = ... >>>> // extract only relevant fields from binary input and save in >>>> fields >>>> return Tuple(fields) // something like that >>>> } >>>> >>>> And for normalized keys something similar can be done. >>>> >>>> Aljoscha >>>> >>>> On Wed, Aug 27, 2014 at 8:39 PM, Stephan Ewen <[hidden email]> wrote: >>>>> >>>>> The design of the comparators so far was to make them work on the >>>>> binary >>>>> data. That we need to retain, in my opinion, otherwise there is no >>>>> way to get good performance out of working on serialized data. >>>>> >>>>> I personally think that creating a tuple2 (key/value pair) when using >>>>> selector functions is actually good: >>>>> The key type (being treated by its dedicated comparator) benefits from >>>> >>>> all >>>>> >>>>> the optimizations implemented for that type (bin copying, normalized >>>> >>>> keys, >>>>> >>>>> ...) >>>>> That would be very hard to incorporate into any comparator that just >>>>> deserializes some comparable. >>>>> >>>>> Also, the key extractor can contain sort of heavy magic (such as to >>>>> block >>>>> keys), whatever a user put in there. If we put that into the >>>>> comparator, >>>> >>>> it >>>>> >>>>> gets called for >>>>> every comparison! >>>>> >>>>> I do agree, though, that we need to come up with a better interface >>>>> that >>>>> seamlessly allows working on binary versions and on objects, without >>>>> duplicating too much code. >>>>> >>>>> From your suggestion, I am not sure I got everything. Could you post a >>>>> concrete example or code? >>>>> >>>>> Stephan >>>>> >>>>> >>>>> >>>>> >>>>> On Wed, Aug 27, 2014 at 5:02 PM, Aljoscha Krettek <[hidden email]> >>>>> wrote: >>>>> >>>>>> Hi Guys, >>>>>> while porting the Java API to Scala I'm noticing how complicated >>>>>> things are because of how our TypeComparators work: 1) There is only >>>>>> one type of comparator per TypeInformation which is created by the >>>>>> TypeInformation. Therefore, our KeySelectors are not actually >>>>>> implemented as comparators but as generated mappers that emit a >>>>>> Tuple2, because you wouldn't for example be able to generate a >>>>>> SelectorFunctionComparator for a TupleTypeInfo. (There's also a lot >>>>>> of magic going on with wrapping and unwrapping those tuples in Reduce, >>>>>> Join, and CoGroup.) 2) Comparators cannot really interoperate, there >>>>>> is special case code for the combinations that work. This will only >>>>>> get worse when we properly introduce POJO types, which should work >>>>>> well with tuple comparators and the other comparators. >>>>>> >>>>>> My proposal is this: No more TypeComparator on a per type basis. Just >>>>>> a generic comparator and PairComparator that work on Comparable. What >>>>>> used to be TypeComparators become SelectionExtractors that return a >>>>>> Comparable. Make Tuple comparable or add new ComparableTuple. The >>>>>> TupleSelectionExtractor would return a comparable tuple of the >>>>>> appropriate length (same for POJOs). For Tuple extractors that operate >>>>>> on only one field they would immediately return that field, without >>>>>> wrapping it in a tuple. This would directly support our existing >>>>>> KeySelector functions since the already return Comparable, when >>>>>> returning a tuple in a key selector function this would be compatible >>>>>> with a TupleSelectionExtractor (on the other join side, for example). >>>>>> >>>>>> That's my idea. What do you think? I think the current state is not >>>>>> maintainable, so we should do something quickly. :D >>>>>> >>>>>> Cheers, >>>>>> Aljoscha >>>>>> > |
I agree with Aljoscha that we should have further numbers on this
(independently of what we settle on with the comparators). @Chesnay: do you feel like giving it a try or is the Python mystery eating up all your time? ;-) On Wed, Aug 27, 2014 at 10:27 PM, Aljoscha Krettek <[hidden email]> wrote: > Ok, but that's micro benchmarks. How much time is really spent on > those comparisons when a real job is executed? Does anyone know? Also, > the tuple pair comparators never work on binary representation, right? > > On Wed, Aug 27, 2014 at 10:19 PM, Chesnay Schepler > <[hidden email]> wrote: > > i cant recall definitely what the numbers were so I'll just quote myself > > from the PR: > > > > measurements were done using System.nanoTime() > > time necessary for the comparison > > Strings consisted of 90 characters > > > > difference in the beginning of the string > > New 4259 > > Old 23431 > > > > difference in the middle of the string > > New 13560 > > Old 30757 > > > > difference at the end of the string > > New 29385 > > Old 28293 > > > > for smaller strings (<10 i think) it was roughly 50% faster. > > > > > > > > On 27.8.2014 22:10, Aljoscha Krettek wrote: > >> > >> Ok, then we have to find some other solution for interop between > >> comparators. What's the performance increase from that pull request? > >> > >> On Wed, Aug 27, 2014 at 9:24 PM, Stephan Ewen <[hidden email]> wrote: > >>> > >>> A lot of comparisons (and that is what makes it fast) can happen on the > >>> binary data, without turning them into a comparable. > >>> > >>> The latest pull request in the String representation for example > improved > >>> String-keyed sorting quite a bit: > >>> > >>> https://github.com/apache/incubator-flink/pull/4 > >>> > >>> > >>> > >>> > >>> > >>> On Wed, Aug 27, 2014 at 9:00 PM, Aljoscha Krettek <[hidden email] > > > >>> wrote: > >>> > >>>> It would work on binary data, for example for tuples it would become: > >>>> > >>>> public Comparable extractKeys(DataInputView in) { > >>>> Object[] fields = ... > >>>> // extract only relevant fields from binary input and save in > >>>> fields > >>>> return Tuple(fields) // something like that > >>>> } > >>>> > >>>> And for normalized keys something similar can be done. > >>>> > >>>> Aljoscha > >>>> > >>>> On Wed, Aug 27, 2014 at 8:39 PM, Stephan Ewen <[hidden email]> > wrote: > >>>>> > >>>>> The design of the comparators so far was to make them work on the > >>>>> binary > >>>>> data. That we need to retain, in my opinion, otherwise there is no > >>>>> way to get good performance out of working on serialized data. > >>>>> > >>>>> I personally think that creating a tuple2 (key/value pair) when using > >>>>> selector functions is actually good: > >>>>> The key type (being treated by its dedicated comparator) benefits > from > >>>> > >>>> all > >>>>> > >>>>> the optimizations implemented for that type (bin copying, normalized > >>>> > >>>> keys, > >>>>> > >>>>> ...) > >>>>> That would be very hard to incorporate into any comparator that just > >>>>> deserializes some comparable. > >>>>> > >>>>> Also, the key extractor can contain sort of heavy magic (such as to > >>>>> block > >>>>> keys), whatever a user put in there. If we put that into the > >>>>> comparator, > >>>> > >>>> it > >>>>> > >>>>> gets called for > >>>>> every comparison! > >>>>> > >>>>> I do agree, though, that we need to come up with a better interface > >>>>> that > >>>>> seamlessly allows working on binary versions and on objects, without > >>>>> duplicating too much code. > >>>>> > >>>>> From your suggestion, I am not sure I got everything. Could you > post a > >>>>> concrete example or code? > >>>>> > >>>>> Stephan > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> On Wed, Aug 27, 2014 at 5:02 PM, Aljoscha Krettek < > [hidden email]> > >>>>> wrote: > >>>>> > >>>>>> Hi Guys, > >>>>>> while porting the Java API to Scala I'm noticing how complicated > >>>>>> things are because of how our TypeComparators work: 1) There is only > >>>>>> one type of comparator per TypeInformation which is created by the > >>>>>> TypeInformation. Therefore, our KeySelectors are not actually > >>>>>> implemented as comparators but as generated mappers that emit a > >>>>>> Tuple2, because you wouldn't for example be able to generate a > >>>>>> SelectorFunctionComparator for a TupleTypeInfo. (There's also a lot > >>>>>> of magic going on with wrapping and unwrapping those tuples in > Reduce, > >>>>>> Join, and CoGroup.) 2) Comparators cannot really interoperate, there > >>>>>> is special case code for the combinations that work. This will only > >>>>>> get worse when we properly introduce POJO types, which should work > >>>>>> well with tuple comparators and the other comparators. > >>>>>> > >>>>>> My proposal is this: No more TypeComparator on a per type basis. > Just > >>>>>> a generic comparator and PairComparator that work on Comparable. > What > >>>>>> used to be TypeComparators become SelectionExtractors that return a > >>>>>> Comparable. Make Tuple comparable or add new ComparableTuple. The > >>>>>> TupleSelectionExtractor would return a comparable tuple of the > >>>>>> appropriate length (same for POJOs). For Tuple extractors that > operate > >>>>>> on only one field they would immediately return that field, without > >>>>>> wrapping it in a tuple. This would directly support our existing > >>>>>> KeySelector functions since the already return Comparable, when > >>>>>> returning a tuple in a key selector function this would be > compatible > >>>>>> with a TupleSelectionExtractor (on the other join side, for > example). > >>>>>> > >>>>>> That's my idea. What do you think? I think the current state is not > >>>>>> maintainable, so we should do something quickly. :D > >>>>>> > >>>>>> Cheers, > >>>>>> Aljoscha > >>>>>> > > > |
@aljoscha : yes, the tuple comparators don't work on binary data.
On 28.8.2014 0:10, Ufuk Celebi wrote: > I agree with Aljoscha that we should have further numbers on this > (independently of what we settle on with the comparators). > > @Chesnay: do you feel like giving it a try or is the Python mystery eating > up all your time? ;-) > > > On Wed, Aug 27, 2014 at 10:27 PM, Aljoscha Krettek <[hidden email]> > wrote: > >> Ok, but that's micro benchmarks. How much time is really spent on >> those comparisons when a real job is executed? Does anyone know? Also, >> the tuple pair comparators never work on binary representation, right? >> >> On Wed, Aug 27, 2014 at 10:19 PM, Chesnay Schepler >> <[hidden email]> wrote: >>> i cant recall definitely what the numbers were so I'll just quote myself >>> from the PR: >>> >>> measurements were done using System.nanoTime() >>> time necessary for the comparison >>> Strings consisted of 90 characters >>> >>> difference in the beginning of the string >>> New 4259 >>> Old 23431 >>> >>> difference in the middle of the string >>> New 13560 >>> Old 30757 >>> >>> difference at the end of the string >>> New 29385 >>> Old 28293 >>> >>> for smaller strings (<10 i think) it was roughly 50% faster. >>> >>> >>> >>> On 27.8.2014 22:10, Aljoscha Krettek wrote: >>>> Ok, then we have to find some other solution for interop between >>>> comparators. What's the performance increase from that pull request? >>>> >>>> On Wed, Aug 27, 2014 at 9:24 PM, Stephan Ewen <[hidden email]> wrote: >>>>> A lot of comparisons (and that is what makes it fast) can happen on the >>>>> binary data, without turning them into a comparable. >>>>> >>>>> The latest pull request in the String representation for example >> improved >>>>> String-keyed sorting quite a bit: >>>>> >>>>> https://github.com/apache/incubator-flink/pull/4 >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On Wed, Aug 27, 2014 at 9:00 PM, Aljoscha Krettek <[hidden email] >>>>> wrote: >>>>> >>>>>> It would work on binary data, for example for tuples it would become: >>>>>> >>>>>> public Comparable extractKeys(DataInputView in) { >>>>>> Object[] fields = ... >>>>>> // extract only relevant fields from binary input and save in >>>>>> fields >>>>>> return Tuple(fields) // something like that >>>>>> } >>>>>> >>>>>> And for normalized keys something similar can be done. >>>>>> >>>>>> Aljoscha >>>>>> >>>>>> On Wed, Aug 27, 2014 at 8:39 PM, Stephan Ewen <[hidden email]> >> wrote: >>>>>>> The design of the comparators so far was to make them work on the >>>>>>> binary >>>>>>> data. That we need to retain, in my opinion, otherwise there is no >>>>>>> way to get good performance out of working on serialized data. >>>>>>> >>>>>>> I personally think that creating a tuple2 (key/value pair) when using >>>>>>> selector functions is actually good: >>>>>>> The key type (being treated by its dedicated comparator) benefits >> from >>>>>> all >>>>>>> the optimizations implemented for that type (bin copying, normalized >>>>>> keys, >>>>>>> ...) >>>>>>> That would be very hard to incorporate into any comparator that just >>>>>>> deserializes some comparable. >>>>>>> >>>>>>> Also, the key extractor can contain sort of heavy magic (such as to >>>>>>> block >>>>>>> keys), whatever a user put in there. If we put that into the >>>>>>> comparator, >>>>>> it >>>>>>> gets called for >>>>>>> every comparison! >>>>>>> >>>>>>> I do agree, though, that we need to come up with a better interface >>>>>>> that >>>>>>> seamlessly allows working on binary versions and on objects, without >>>>>>> duplicating too much code. >>>>>>> >>>>>>> From your suggestion, I am not sure I got everything. Could you >> post a >>>>>>> concrete example or code? >>>>>>> >>>>>>> Stephan >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Wed, Aug 27, 2014 at 5:02 PM, Aljoscha Krettek < >> [hidden email]> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi Guys, >>>>>>>> while porting the Java API to Scala I'm noticing how complicated >>>>>>>> things are because of how our TypeComparators work: 1) There is only >>>>>>>> one type of comparator per TypeInformation which is created by the >>>>>>>> TypeInformation. Therefore, our KeySelectors are not actually >>>>>>>> implemented as comparators but as generated mappers that emit a >>>>>>>> Tuple2, because you wouldn't for example be able to generate a >>>>>>>> SelectorFunctionComparator for a TupleTypeInfo. (There's also a lot >>>>>>>> of magic going on with wrapping and unwrapping those tuples in >> Reduce, >>>>>>>> Join, and CoGroup.) 2) Comparators cannot really interoperate, there >>>>>>>> is special case code for the combinations that work. This will only >>>>>>>> get worse when we properly introduce POJO types, which should work >>>>>>>> well with tuple comparators and the other comparators. >>>>>>>> >>>>>>>> My proposal is this: No more TypeComparator on a per type basis. >> Just >>>>>>>> a generic comparator and PairComparator that work on Comparable. >> What >>>>>>>> used to be TypeComparators become SelectionExtractors that return a >>>>>>>> Comparable. Make Tuple comparable or add new ComparableTuple. The >>>>>>>> TupleSelectionExtractor would return a comparable tuple of the >>>>>>>> appropriate length (same for POJOs). For Tuple extractors that >> operate >>>>>>>> on only one field they would immediately return that field, without >>>>>>>> wrapping it in a tuple. This would directly support our existing >>>>>>>> KeySelector functions since the already return Comparable, when >>>>>>>> returning a tuple in a key selector function this would be >> compatible >>>>>>>> with a TupleSelectionExtractor (on the other join side, for >> example). >>>>>>>> That's my idea. What do you think? I think the current state is not >>>>>>>> maintainable, so we should do something quickly. :D >>>>>>>> >>>>>>>> Cheers, >>>>>>>> Aljoscha >>>>>>>> |
I think that the microbenchmarks are a very good start.
The serialization/deserialization is ultimately the Achilles Heel of our approach. So that one, in itself, must be as fast as possible. On Thu, Aug 28, 2014 at 10:22 AM, Chesnay Schepler < [hidden email]> wrote: > @aljoscha : yes, the tuple comparators don't work on binary data. > > > On 28.8.2014 0:10, Ufuk Celebi wrote: > >> I agree with Aljoscha that we should have further numbers on this >> (independently of what we settle on with the comparators). >> >> @Chesnay: do you feel like giving it a try or is the Python mystery eating >> up all your time? ;-) >> >> >> On Wed, Aug 27, 2014 at 10:27 PM, Aljoscha Krettek <[hidden email]> >> wrote: >> >> Ok, but that's micro benchmarks. How much time is really spent on >>> those comparisons when a real job is executed? Does anyone know? Also, >>> the tuple pair comparators never work on binary representation, right? >>> >>> On Wed, Aug 27, 2014 at 10:19 PM, Chesnay Schepler >>> <[hidden email]> wrote: >>> >>>> i cant recall definitely what the numbers were so I'll just quote myself >>>> from the PR: >>>> >>>> measurements were done using System.nanoTime() >>>> time necessary for the comparison >>>> Strings consisted of 90 characters >>>> >>>> difference in the beginning of the string >>>> New 4259 >>>> Old 23431 >>>> >>>> difference in the middle of the string >>>> New 13560 >>>> Old 30757 >>>> >>>> difference at the end of the string >>>> New 29385 >>>> Old 28293 >>>> >>>> for smaller strings (<10 i think) it was roughly 50% faster. >>>> >>>> >>>> >>>> On 27.8.2014 22:10, Aljoscha Krettek wrote: >>>> >>>>> Ok, then we have to find some other solution for interop between >>>>> comparators. What's the performance increase from that pull request? >>>>> >>>>> On Wed, Aug 27, 2014 at 9:24 PM, Stephan Ewen <[hidden email]> >>>>> wrote: >>>>> >>>>>> A lot of comparisons (and that is what makes it fast) can happen on >>>>>> the >>>>>> binary data, without turning them into a comparable. >>>>>> >>>>>> The latest pull request in the String representation for example >>>>>> >>>>> improved >>> >>>> String-keyed sorting quite a bit: >>>>>> >>>>>> https://github.com/apache/incubator-flink/pull/4 >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Wed, Aug 27, 2014 at 9:00 PM, Aljoscha Krettek < >>>>>> [hidden email] >>>>>> wrote: >>>>>> >>>>>> It would work on binary data, for example for tuples it would become: >>>>>>> >>>>>>> public Comparable extractKeys(DataInputView in) { >>>>>>> Object[] fields = ... >>>>>>> // extract only relevant fields from binary input and save in >>>>>>> fields >>>>>>> return Tuple(fields) // something like that >>>>>>> } >>>>>>> >>>>>>> And for normalized keys something similar can be done. >>>>>>> >>>>>>> Aljoscha >>>>>>> >>>>>>> On Wed, Aug 27, 2014 at 8:39 PM, Stephan Ewen <[hidden email]> >>>>>>> >>>>>> wrote: >>> >>>> The design of the comparators so far was to make them work on the >>>>>>>> binary >>>>>>>> data. That we need to retain, in my opinion, otherwise there is no >>>>>>>> way to get good performance out of working on serialized data. >>>>>>>> >>>>>>>> I personally think that creating a tuple2 (key/value pair) when >>>>>>>> using >>>>>>>> selector functions is actually good: >>>>>>>> The key type (being treated by its dedicated comparator) benefits >>>>>>>> >>>>>>> from >>> >>>> all >>>>>>> >>>>>>>> the optimizations implemented for that type (bin copying, normalized >>>>>>>> >>>>>>> keys, >>>>>>> >>>>>>>> ...) >>>>>>>> That would be very hard to incorporate into any comparator that just >>>>>>>> deserializes some comparable. >>>>>>>> >>>>>>>> Also, the key extractor can contain sort of heavy magic (such as to >>>>>>>> block >>>>>>>> keys), whatever a user put in there. If we put that into the >>>>>>>> comparator, >>>>>>>> >>>>>>> it >>>>>>> >>>>>>>> gets called for >>>>>>>> every comparison! >>>>>>>> >>>>>>>> I do agree, though, that we need to come up with a better interface >>>>>>>> that >>>>>>>> seamlessly allows working on binary versions and on objects, without >>>>>>>> duplicating too much code. >>>>>>>> >>>>>>>> From your suggestion, I am not sure I got everything. Could you >>>>>>>> >>>>>>> post a >>> >>>> concrete example or code? >>>>>>>> >>>>>>>> Stephan >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Wed, Aug 27, 2014 at 5:02 PM, Aljoscha Krettek < >>>>>>>> >>>>>>> [hidden email]> >>> >>>> wrote: >>>>>>>> >>>>>>>> Hi Guys, >>>>>>>>> while porting the Java API to Scala I'm noticing how complicated >>>>>>>>> things are because of how our TypeComparators work: 1) There is >>>>>>>>> only >>>>>>>>> one type of comparator per TypeInformation which is created by the >>>>>>>>> TypeInformation. Therefore, our KeySelectors are not actually >>>>>>>>> implemented as comparators but as generated mappers that emit a >>>>>>>>> Tuple2, because you wouldn't for example be able to generate a >>>>>>>>> SelectorFunctionComparator for a TupleTypeInfo. (There's also a >>>>>>>>> lot >>>>>>>>> of magic going on with wrapping and unwrapping those tuples in >>>>>>>>> >>>>>>>> Reduce, >>> >>>> Join, and CoGroup.) 2) Comparators cannot really interoperate, there >>>>>>>>> is special case code for the combinations that work. This will only >>>>>>>>> get worse when we properly introduce POJO types, which should work >>>>>>>>> well with tuple comparators and the other comparators. >>>>>>>>> >>>>>>>>> My proposal is this: No more TypeComparator on a per type basis. >>>>>>>>> >>>>>>>> Just >>> >>>> a generic comparator and PairComparator that work on Comparable. >>>>>>>>> >>>>>>>> What >>> >>>> used to be TypeComparators become SelectionExtractors that return a >>>>>>>>> Comparable. Make Tuple comparable or add new ComparableTuple. The >>>>>>>>> TupleSelectionExtractor would return a comparable tuple of the >>>>>>>>> appropriate length (same for POJOs). For Tuple extractors that >>>>>>>>> >>>>>>>> operate >>> >>>> on only one field they would immediately return that field, without >>>>>>>>> wrapping it in a tuple. This would directly support our existing >>>>>>>>> KeySelector functions since the already return Comparable, when >>>>>>>>> returning a tuple in a key selector function this would be >>>>>>>>> >>>>>>>> compatible >>> >>>> with a TupleSelectionExtractor (on the other join side, for >>>>>>>>> >>>>>>>> example). >>> >>>> That's my idea. What do you think? I think the current state is not >>>>>>>>> maintainable, so we should do something quickly. :D >>>>>>>>> >>>>>>>>> Cheers, >>>>>>>>> Aljoscha >>>>>>>>> >>>>>>>>> > |
Free forum by Nabble | Edit this page |