Aggregations

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Aggregations

Gyula Fóra
Hey,

As we were implementing the aggregation operators, we found that the
working logic of the min and max aggregation in the batch API seems a
little strange.

So let's assume that the user only want to make one aggregation at a time,
wouldn't it make more sense to return the element of the dataset which has
the minimal value (or the first one having it) instead of creating a new
element with the minimum value as the field and the other fields taken from
the last data element?

For the sum aggregation this makes sense, but shouldn't min and max
actually return an element of the dataset?

(well of course if you use the .and operator this gets more tricky)

Cheers,
Gyula
Reply | Threaded
Open this post in threaded view
|

Re: Aggregations

Ufuk Celebi-2
On Fri, Sep 5, 2014 at 10:30 PM, Gyula Fóra <[hidden email]> wrote:

> For the sum aggregation this makes sense, but shouldn't min and max
> actually return an element of the dataset?
>

There are also the minBy and maxBy methods, which return the Tuple with the
minimum/maximum value whereas the min and max methods just work on the
field.

I also have the feeling that this might be unintuitive and that users would
expect minBy/maxBy semantics to be the default.
Reply | Threaded
Open this post in threaded view
|

Re: Aggregations

Fabian Hueske
I don't like the semantics of the current aggregation operator either.
I'd be happy to discuss whether and how we should change it.

Some time ago, I sketched an alternative in the old Stratosphere-Github wik
which might be a good starting point for a discussion:

https://github.com/stratosphere/stratosphere/wiki/Design-of-Aggregate-Operator

Cheers, Fabian



2014-09-06 12:01 GMT+02:00 Ufuk Celebi <[hidden email]>:

> On Fri, Sep 5, 2014 at 10:30 PM, Gyula Fóra <[hidden email]> wrote:
>
> > For the sum aggregation this makes sense, but shouldn't min and max
> > actually return an element of the dataset?
> >
>
> There are also the minBy and maxBy methods, which return the Tuple with the
> minimum/maximum value whereas the min and max methods just work on the
> field.
>
> I also have the feeling that this might be unintuitive and that users would
> expect minBy/maxBy semantics to be the default.
>
Reply | Threaded
Open this post in threaded view
|

Re: Aggregations

Stephan Ewen
I agree, the aggregations were a quick shot and should be reworked. They
are a bit inspired by the SQL style aggregations, so MIN and MAX give the
minimum and maximum value of the column.

MinBy and MaxBy are not aggregations, they are rather "selectors", which
grab the tuple with that characteristic. At least in SQL terms...



On Sat, Sep 6, 2014 at 1:01 PM, Fabian Hueske <[hidden email]> wrote:

> I don't like the semantics of the current aggregation operator either.
> I'd be happy to discuss whether and how we should change it.
>
> Some time ago, I sketched an alternative in the old Stratosphere-Github wik
> which might be a good starting point for a discussion:
>
>
> https://github.com/stratosphere/stratosphere/wiki/Design-of-Aggregate-Operator
>
> Cheers, Fabian
>
>
>
> 2014-09-06 12:01 GMT+02:00 Ufuk Celebi <[hidden email]>:
>
> > On Fri, Sep 5, 2014 at 10:30 PM, Gyula Fóra <[hidden email]>
> wrote:
> >
> > > For the sum aggregation this makes sense, but shouldn't min and max
> > > actually return an element of the dataset?
> > >
> >
> > There are also the minBy and maxBy methods, which return the Tuple with
> the
> > minimum/maximum value whereas the min and max methods just work on the
> > field.
> >
> > I also have the feeling that this might be unintuitive and that users
> would
> > expect minBy/maxBy semantics to be the default.
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Aggregations

Hermann Gábor
I also agree on using the minBy as the default mechanism.

If both min and minBy are needed, it would seem more natural for min (and
also for sum) to return only the given field of the tuple in my opinion.

More generally a reduce function with a custom return type would also be
useful in my view. In that case the user would also give a value of type T
to begin the reduction with, and implement a function which reduces a value
and a value of type T and return a value of type T. Would that make sense?
Reply | Threaded
Open this post in threaded view
|

Re: Aggregations

Fabian Hueske
Having aggregation functions only returning a single value, is not very
helpful IMO.
First, an aggregation function should also work on grouped data sets, i.e.,
return one aggregate for each group. Hence, the grouping keys must be
included in the result somehow.
Second, imaging a use case where the min, max, and avg value of some fields
of a tuple are needed. If this would be computed with multiple independent
aggregation functions, the data set would be shuffled and reduced three
times and possibly joined again.

I think it should be possible to combine multiple aggregation functions,
e.g., compute a result with field 2 as grouping key, the minimum and
maximum of field 3 and the average of field 5.
Basically, have something like the project operator but with aggregation
functions and keys. This is also what I sketched in my proposal.

@Hermann: Regarding the reduce function with custom return type, do you
have some concrete use case in mind for that?

Cheers, Fabian

2014-09-08 14:20 GMT+02:00 Hermann Gábor <[hidden email]>:

> I also agree on using the minBy as the default mechanism.
>
> If both min and minBy are needed, it would seem more natural for min (and
> also for sum) to return only the given field of the tuple in my opinion.
>
> More generally a reduce function with a custom return type would also be
> useful in my view. In that case the user would also give a value of type T
> to begin the reduction with, and implement a function which reduces a value
> and a value of type T and return a value of type T. Would that make sense?
>
Reply | Threaded
Open this post in threaded view
|

Re: Aggregations

Stephan Ewen
Let's come up with a comprehensive design that works for both Batch and
Streaming API.

It would be good to include aggregation functions that internally break
down into multiple aggregations (like AVG breaking down to a count and
sum), respecting that no aggregate is computed twice unnecessarily.

On Tue, Sep 9, 2014 at 12:30 AM, Fabian Hueske <[hidden email]> wrote:

> Having aggregation functions only returning a single value, is not very
> helpful IMO.
> First, an aggregation function should also work on grouped data sets, i.e.,
> return one aggregate for each group. Hence, the grouping keys must be
> included in the result somehow.
> Second, imaging a use case where the min, max, and avg value of some fields
> of a tuple are needed. If this would be computed with multiple independent
> aggregation functions, the data set would be shuffled and reduced three
> times and possibly joined again.
>
> I think it should be possible to combine multiple aggregation functions,
> e.g., compute a result with field 2 as grouping key, the minimum and
> maximum of field 3 and the average of field 5.
> Basically, have something like the project operator but with aggregation
> functions and keys. This is also what I sketched in my proposal.
>
> @Hermann: Regarding the reduce function with custom return type, do you
> have some concrete use case in mind for that?
>
> Cheers, Fabian
>
> 2014-09-08 14:20 GMT+02:00 Hermann Gábor <[hidden email]>:
>
> > I also agree on using the minBy as the default mechanism.
> >
> > If both min and minBy are needed, it would seem more natural for min (and
> > also for sum) to return only the given field of the tuple in my opinion.
> >
> > More generally a reduce function with a custom return type would also be
> > useful in my view. In that case the user would also give a value of type
> T
> > to begin the reduction with, and implement a function which reduces a
> value
> > and a value of type T and return a value of type T. Would that make
> sense?
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Aggregations

Hermann Gábor
In reply to this post by Fabian Hueske
The only advantage of returning a single value instead of the whole tuple
would be having smaller data. I agree, it is not that useful, and the logic
that you proposed earlier could simply provide this with a single
aggregation.

In addition, isn't it possible to provide the mechanism in your proposal,
without
the user needing to set the return types? Can the types be extracted from
the
tuple and the aggregation (e.g. average should be a Double)?

A simple example of the custom return type reduce function is a modified
WordCount:

        public class WC {
                public String word;
                public int count;
                // [...]
        }

        public class WordCounter implements ReduceFunction<String, WC> {

                @Override
                public WC reduce(String word, WC reductionValue) {
                        return new WC(word, 1 + reductionValue.count);
                }
        }

        groupedWords.reduce(new WordCounter(), new WC(null, 0));


(Of course this can be easily done with an aggregation, but this was
the simplest use case I could come up with.)

The only advantage here is also the smaller/clearer value and maybe
generality.
Functional languages like Haskell support this kind of reduction on
collections
(that is the reason I thought about this). On the other side, there are
many drawbacks
of a reduce function like this (it cannot combine two separately reduced
set of data,
the user must provide an initial value and every reduction like this can be
done with larger
tuples). It is not clear for me whether it would be better or not, but I
thought it's worth consideration.

Cheers,
Gabor



On Tue, Sep 9, 2014 at 12:30 AM, Fabian Hueske <[hidden email]> wrote:

> Having aggregation functions only returning a single value, is not very
> helpful IMO.
> First, an aggregation function should also work on grouped data sets, i.e.,
> return one aggregate for each group. Hence, the grouping keys must be
> included in the result somehow.
> Second, imaging a use case where the min, max, and avg value of some fields
> of a tuple are needed. If this would be computed with multiple independent
> aggregation functions, the data set would be shuffled and reduced three
> times and possibly joined again.
>
> I think it should be possible to combine multiple aggregation functions,
> e.g., compute a result with field 2 as grouping key, the minimum and
> maximum of field 3 and the average of field 5.
> Basically, have something like the project operator but with aggregation
> functions and keys. This is also what I sketched in my proposal.
>
> @Hermann: Regarding the reduce function with custom return type, do you
> have some concrete use case in mind for that?
>
> Cheers, Fabian
>
> 2014-09-08 14:20 GMT+02:00 Hermann Gábor <[hidden email]>:
>
> > I also agree on using the minBy as the default mechanism.
> >
> > If both min and minBy are needed, it would seem more natural for min (and
> > also for sum) to return only the given field of the tuple in my opinion.
> >
> > More generally a reduce function with a custom return type would also be
> > useful in my view. In that case the user would also give a value of type
> T
> > to begin the reduction with, and implement a function which reduces a
> value
> > and a value of type T and return a value of type T. Would that make
> sense?
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Aggregations

Fabian Hueske
Yes, I guess we could infer the types from the input or the used
aggregation function (Avg -> Double, Cnt -> Long).
We also thought about removing the explicit types from the project()
operator (FLINK-1040).

I see, the custom type reduce might be useful.
However, we should be careful not to bloat the API too much.
Not sure if it is useful/important enough.

Opinions?

2014-09-09 11:42 GMT+02:00 Hermann Gábor <[hidden email]>:

> The only advantage of returning a single value instead of the whole tuple
> would be having smaller data. I agree, it is not that useful, and the logic
> that you proposed earlier could simply provide this with a single
> aggregation.
>
> In addition, isn't it possible to provide the mechanism in your proposal,
> without
> the user needing to set the return types? Can the types be extracted from
> the
> tuple and the aggregation (e.g. average should be a Double)?
>
> A simple example of the custom return type reduce function is a modified
> WordCount:
>
>         public class WC {
>                 public String word;
>                 public int count;
>                 // [...]
>         }
>
>         public class WordCounter implements ReduceFunction<String, WC> {
>
>                 @Override
>                 public WC reduce(String word, WC reductionValue) {
>                         return new WC(word, 1 + reductionValue.count);
>                 }
>         }
>
>         groupedWords.reduce(new WordCounter(), new WC(null, 0));
>
>
> (Of course this can be easily done with an aggregation, but this was
> the simplest use case I could come up with.)
>
> The only advantage here is also the smaller/clearer value and maybe
> generality.
> Functional languages like Haskell support this kind of reduction on
> collections
> (that is the reason I thought about this). On the other side, there are
> many drawbacks
> of a reduce function like this (it cannot combine two separately reduced
> set of data,
> the user must provide an initial value and every reduction like this can be
> done with larger
> tuples). It is not clear for me whether it would be better or not, but I
> thought it's worth consideration.
>
> Cheers,
> Gabor
>
>
>
> On Tue, Sep 9, 2014 at 12:30 AM, Fabian Hueske <[hidden email]> wrote:
>
> > Having aggregation functions only returning a single value, is not very
> > helpful IMO.
> > First, an aggregation function should also work on grouped data sets,
> i.e.,
> > return one aggregate for each group. Hence, the grouping keys must be
> > included in the result somehow.
> > Second, imaging a use case where the min, max, and avg value of some
> fields
> > of a tuple are needed. If this would be computed with multiple
> independent
> > aggregation functions, the data set would be shuffled and reduced three
> > times and possibly joined again.
> >
> > I think it should be possible to combine multiple aggregation functions,
> > e.g., compute a result with field 2 as grouping key, the minimum and
> > maximum of field 3 and the average of field 5.
> > Basically, have something like the project operator but with aggregation
> > functions and keys. This is also what I sketched in my proposal.
> >
> > @Hermann: Regarding the reduce function with custom return type, do you
> > have some concrete use case in mind for that?
> >
> > Cheers, Fabian
> >
> > 2014-09-08 14:20 GMT+02:00 Hermann Gábor <[hidden email]>:
> >
> > > I also agree on using the minBy as the default mechanism.
> > >
> > > If both min and minBy are needed, it would seem more natural for min
> (and
> > > also for sum) to return only the given field of the tuple in my
> opinion.
> > >
> > > More generally a reduce function with a custom return type would also
> be
> > > useful in my view. In that case the user would also give a value of
> type
> > T
> > > to begin the reduction with, and implement a function which reduces a
> > value
> > > and a value of type T and return a value of type T. Would that make
> > sense?
> > >
> >
>