Hey,
As we were implementing the aggregation operators, we found that the working logic of the min and max aggregation in the batch API seems a little strange. So let's assume that the user only want to make one aggregation at a time, wouldn't it make more sense to return the element of the dataset which has the minimal value (or the first one having it) instead of creating a new element with the minimum value as the field and the other fields taken from the last data element? For the sum aggregation this makes sense, but shouldn't min and max actually return an element of the dataset? (well of course if you use the .and operator this gets more tricky) Cheers, Gyula |
On Fri, Sep 5, 2014 at 10:30 PM, Gyula Fóra <[hidden email]> wrote:
> For the sum aggregation this makes sense, but shouldn't min and max > actually return an element of the dataset? > There are also the minBy and maxBy methods, which return the Tuple with the minimum/maximum value whereas the min and max methods just work on the field. I also have the feeling that this might be unintuitive and that users would expect minBy/maxBy semantics to be the default. |
I don't like the semantics of the current aggregation operator either.
I'd be happy to discuss whether and how we should change it. Some time ago, I sketched an alternative in the old Stratosphere-Github wik which might be a good starting point for a discussion: https://github.com/stratosphere/stratosphere/wiki/Design-of-Aggregate-Operator Cheers, Fabian 2014-09-06 12:01 GMT+02:00 Ufuk Celebi <[hidden email]>: > On Fri, Sep 5, 2014 at 10:30 PM, Gyula Fóra <[hidden email]> wrote: > > > For the sum aggregation this makes sense, but shouldn't min and max > > actually return an element of the dataset? > > > > There are also the minBy and maxBy methods, which return the Tuple with the > minimum/maximum value whereas the min and max methods just work on the > field. > > I also have the feeling that this might be unintuitive and that users would > expect minBy/maxBy semantics to be the default. > |
I agree, the aggregations were a quick shot and should be reworked. They
are a bit inspired by the SQL style aggregations, so MIN and MAX give the minimum and maximum value of the column. MinBy and MaxBy are not aggregations, they are rather "selectors", which grab the tuple with that characteristic. At least in SQL terms... On Sat, Sep 6, 2014 at 1:01 PM, Fabian Hueske <[hidden email]> wrote: > I don't like the semantics of the current aggregation operator either. > I'd be happy to discuss whether and how we should change it. > > Some time ago, I sketched an alternative in the old Stratosphere-Github wik > which might be a good starting point for a discussion: > > > https://github.com/stratosphere/stratosphere/wiki/Design-of-Aggregate-Operator > > Cheers, Fabian > > > > 2014-09-06 12:01 GMT+02:00 Ufuk Celebi <[hidden email]>: > > > On Fri, Sep 5, 2014 at 10:30 PM, Gyula Fóra <[hidden email]> > wrote: > > > > > For the sum aggregation this makes sense, but shouldn't min and max > > > actually return an element of the dataset? > > > > > > > There are also the minBy and maxBy methods, which return the Tuple with > the > > minimum/maximum value whereas the min and max methods just work on the > > field. > > > > I also have the feeling that this might be unintuitive and that users > would > > expect minBy/maxBy semantics to be the default. > > > |
I also agree on using the minBy as the default mechanism.
If both min and minBy are needed, it would seem more natural for min (and also for sum) to return only the given field of the tuple in my opinion. More generally a reduce function with a custom return type would also be useful in my view. In that case the user would also give a value of type T to begin the reduction with, and implement a function which reduces a value and a value of type T and return a value of type T. Would that make sense? |
Having aggregation functions only returning a single value, is not very
helpful IMO. First, an aggregation function should also work on grouped data sets, i.e., return one aggregate for each group. Hence, the grouping keys must be included in the result somehow. Second, imaging a use case where the min, max, and avg value of some fields of a tuple are needed. If this would be computed with multiple independent aggregation functions, the data set would be shuffled and reduced three times and possibly joined again. I think it should be possible to combine multiple aggregation functions, e.g., compute a result with field 2 as grouping key, the minimum and maximum of field 3 and the average of field 5. Basically, have something like the project operator but with aggregation functions and keys. This is also what I sketched in my proposal. @Hermann: Regarding the reduce function with custom return type, do you have some concrete use case in mind for that? Cheers, Fabian 2014-09-08 14:20 GMT+02:00 Hermann Gábor <[hidden email]>: > I also agree on using the minBy as the default mechanism. > > If both min and minBy are needed, it would seem more natural for min (and > also for sum) to return only the given field of the tuple in my opinion. > > More generally a reduce function with a custom return type would also be > useful in my view. In that case the user would also give a value of type T > to begin the reduction with, and implement a function which reduces a value > and a value of type T and return a value of type T. Would that make sense? > |
Let's come up with a comprehensive design that works for both Batch and
Streaming API. It would be good to include aggregation functions that internally break down into multiple aggregations (like AVG breaking down to a count and sum), respecting that no aggregate is computed twice unnecessarily. On Tue, Sep 9, 2014 at 12:30 AM, Fabian Hueske <[hidden email]> wrote: > Having aggregation functions only returning a single value, is not very > helpful IMO. > First, an aggregation function should also work on grouped data sets, i.e., > return one aggregate for each group. Hence, the grouping keys must be > included in the result somehow. > Second, imaging a use case where the min, max, and avg value of some fields > of a tuple are needed. If this would be computed with multiple independent > aggregation functions, the data set would be shuffled and reduced three > times and possibly joined again. > > I think it should be possible to combine multiple aggregation functions, > e.g., compute a result with field 2 as grouping key, the minimum and > maximum of field 3 and the average of field 5. > Basically, have something like the project operator but with aggregation > functions and keys. This is also what I sketched in my proposal. > > @Hermann: Regarding the reduce function with custom return type, do you > have some concrete use case in mind for that? > > Cheers, Fabian > > 2014-09-08 14:20 GMT+02:00 Hermann Gábor <[hidden email]>: > > > I also agree on using the minBy as the default mechanism. > > > > If both min and minBy are needed, it would seem more natural for min (and > > also for sum) to return only the given field of the tuple in my opinion. > > > > More generally a reduce function with a custom return type would also be > > useful in my view. In that case the user would also give a value of type > T > > to begin the reduction with, and implement a function which reduces a > value > > and a value of type T and return a value of type T. Would that make > sense? > > > |
In reply to this post by Fabian Hueske
The only advantage of returning a single value instead of the whole tuple
would be having smaller data. I agree, it is not that useful, and the logic that you proposed earlier could simply provide this with a single aggregation. In addition, isn't it possible to provide the mechanism in your proposal, without the user needing to set the return types? Can the types be extracted from the tuple and the aggregation (e.g. average should be a Double)? A simple example of the custom return type reduce function is a modified WordCount: public class WC { public String word; public int count; // [...] } public class WordCounter implements ReduceFunction<String, WC> { @Override public WC reduce(String word, WC reductionValue) { return new WC(word, 1 + reductionValue.count); } } groupedWords.reduce(new WordCounter(), new WC(null, 0)); (Of course this can be easily done with an aggregation, but this was the simplest use case I could come up with.) The only advantage here is also the smaller/clearer value and maybe generality. Functional languages like Haskell support this kind of reduction on collections (that is the reason I thought about this). On the other side, there are many drawbacks of a reduce function like this (it cannot combine two separately reduced set of data, the user must provide an initial value and every reduction like this can be done with larger tuples). It is not clear for me whether it would be better or not, but I thought it's worth consideration. Cheers, Gabor On Tue, Sep 9, 2014 at 12:30 AM, Fabian Hueske <[hidden email]> wrote: > Having aggregation functions only returning a single value, is not very > helpful IMO. > First, an aggregation function should also work on grouped data sets, i.e., > return one aggregate for each group. Hence, the grouping keys must be > included in the result somehow. > Second, imaging a use case where the min, max, and avg value of some fields > of a tuple are needed. If this would be computed with multiple independent > aggregation functions, the data set would be shuffled and reduced three > times and possibly joined again. > > I think it should be possible to combine multiple aggregation functions, > e.g., compute a result with field 2 as grouping key, the minimum and > maximum of field 3 and the average of field 5. > Basically, have something like the project operator but with aggregation > functions and keys. This is also what I sketched in my proposal. > > @Hermann: Regarding the reduce function with custom return type, do you > have some concrete use case in mind for that? > > Cheers, Fabian > > 2014-09-08 14:20 GMT+02:00 Hermann Gábor <[hidden email]>: > > > I also agree on using the minBy as the default mechanism. > > > > If both min and minBy are needed, it would seem more natural for min (and > > also for sum) to return only the given field of the tuple in my opinion. > > > > More generally a reduce function with a custom return type would also be > > useful in my view. In that case the user would also give a value of type > T > > to begin the reduction with, and implement a function which reduces a > value > > and a value of type T and return a value of type T. Would that make > sense? > > > |
Yes, I guess we could infer the types from the input or the used
aggregation function (Avg -> Double, Cnt -> Long). We also thought about removing the explicit types from the project() operator (FLINK-1040). I see, the custom type reduce might be useful. However, we should be careful not to bloat the API too much. Not sure if it is useful/important enough. Opinions? 2014-09-09 11:42 GMT+02:00 Hermann Gábor <[hidden email]>: > The only advantage of returning a single value instead of the whole tuple > would be having smaller data. I agree, it is not that useful, and the logic > that you proposed earlier could simply provide this with a single > aggregation. > > In addition, isn't it possible to provide the mechanism in your proposal, > without > the user needing to set the return types? Can the types be extracted from > the > tuple and the aggregation (e.g. average should be a Double)? > > A simple example of the custom return type reduce function is a modified > WordCount: > > public class WC { > public String word; > public int count; > // [...] > } > > public class WordCounter implements ReduceFunction<String, WC> { > > @Override > public WC reduce(String word, WC reductionValue) { > return new WC(word, 1 + reductionValue.count); > } > } > > groupedWords.reduce(new WordCounter(), new WC(null, 0)); > > > (Of course this can be easily done with an aggregation, but this was > the simplest use case I could come up with.) > > The only advantage here is also the smaller/clearer value and maybe > generality. > Functional languages like Haskell support this kind of reduction on > collections > (that is the reason I thought about this). On the other side, there are > many drawbacks > of a reduce function like this (it cannot combine two separately reduced > set of data, > the user must provide an initial value and every reduction like this can be > done with larger > tuples). It is not clear for me whether it would be better or not, but I > thought it's worth consideration. > > Cheers, > Gabor > > > > On Tue, Sep 9, 2014 at 12:30 AM, Fabian Hueske <[hidden email]> wrote: > > > Having aggregation functions only returning a single value, is not very > > helpful IMO. > > First, an aggregation function should also work on grouped data sets, > i.e., > > return one aggregate for each group. Hence, the grouping keys must be > > included in the result somehow. > > Second, imaging a use case where the min, max, and avg value of some > fields > > of a tuple are needed. If this would be computed with multiple > independent > > aggregation functions, the data set would be shuffled and reduced three > > times and possibly joined again. > > > > I think it should be possible to combine multiple aggregation functions, > > e.g., compute a result with field 2 as grouping key, the minimum and > > maximum of field 3 and the average of field 5. > > Basically, have something like the project operator but with aggregation > > functions and keys. This is also what I sketched in my proposal. > > > > @Hermann: Regarding the reduce function with custom return type, do you > > have some concrete use case in mind for that? > > > > Cheers, Fabian > > > > 2014-09-08 14:20 GMT+02:00 Hermann Gábor <[hidden email]>: > > > > > I also agree on using the minBy as the default mechanism. > > > > > > If both min and minBy are needed, it would seem more natural for min > (and > > > also for sum) to return only the given field of the tuple in my > opinion. > > > > > > More generally a reduce function with a custom return type would also > be > > > useful in my view. In that case the user would also give a value of > type > > T > > > to begin the reduction with, and implement a function which reduces a > > value > > > and a value of type T and return a value of type T. Would that make > > sense? > > > > > > |
Free forum by Nabble | Edit this page |