Aggregation Design Questions

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Aggregation Design Questions

Lisonbee, Todd
Hello,

I'm working on adding Standard Deviation and others to the list of Aggregations,
https://issues.apache.org/jira/browse/FLINK-3613

Unfortunately, I didn't get very far because the general design of Aggreation on DataSets needs to change and each solution seems to have drawbacks.  For example, one easy solution would be to modify AggregateOperator to extend CustomUnaryOperation but that seems weird because then it wouldn't be an Operator.

I wrote a design explaining some of the current limitations and background,
https://issues.apache.org/jira/secure/attachment/12794820/DataSet-Aggregation-Design-March2016-v1.txt

The design is in progress.  I wanted to check in with people before going much further.

I'd appreciate any feedback.

Thanks,

Todd
Reply | Threaded
Open this post in threaded view
|

RE: Aggregation Design Questions

Lisonbee, Todd
I wrote another design for a summarize() function on DataSet.
https://issues.apache.org/jira/browse/FLINK-3664

I think this would be a better place for me to start than working on generic Aggregations.  (I could move ahead with it immediately and there are no tricky decisions if people more or less liked the design).

Any support for a summarize() function?

        // Summarize a DataSet of Tuples by collecting single pass statistics for all columns
        // example usage:

        Dataset<Tuple3<Double, String, Boolean>> input = // [...]
        Tuple3<DoubleColumnSummary,StringColumnSummary,BooleanColumnSummary> summary = input.summarize()
        summary.getField(0).stddev()
        summary.getField(1).maxStringLength()

Thanks.


-----Original Message-----
From: Lisonbee, Todd [mailto:[hidden email]]
Sent: Wednesday, March 23, 2016 9:46 AM
To: [hidden email]
Subject: Aggregation Design Questions

Hello,

I'm working on adding Standard Deviation and others to the list of Aggregations,
https://issues.apache.org/jira/browse/FLINK-3613

Unfortunately, I didn't get very far because the general design of Aggreation on DataSets needs to change and each solution seems to have drawbacks.  For example, one easy solution would be to modify AggregateOperator to extend CustomUnaryOperation but that seems weird because then it wouldn't be an Operator.

I wrote a design explaining some of the current limitations and background, https://issues.apache.org/jira/secure/attachment/12794820/DataSet-Aggregation-Design-March2016-v1.txt

The design is in progress.  I wanted to check in with people before going much further.

I'd appreciate any feedback.

Thanks,

Todd
Reply | Threaded
Open this post in threaded view
|

Re: Aggregation Design Questions

Stephan Ewen
Hi!

Sorry for taking so long to react. I am looking through this now as well...

Stephan


On Wed, Mar 23, 2016 at 6:59 PM, Lisonbee, Todd <[hidden email]>
wrote:

> I wrote another design for a summarize() function on DataSet.
> https://issues.apache.org/jira/browse/FLINK-3664
>
> I think this would be a better place for me to start than working on
> generic Aggregations.  (I could move ahead with it immediately and there
> are no tricky decisions if people more or less liked the design).
>
> Any support for a summarize() function?
>
>         // Summarize a DataSet of Tuples by collecting single pass
> statistics for all columns
>         // example usage:
>
>         Dataset<Tuple3<Double, String, Boolean>> input = // [...]
>
> Tuple3<DoubleColumnSummary,StringColumnSummary,BooleanColumnSummary>
> summary = input.summarize()
>         summary.getField(0).stddev()
>         summary.getField(1).maxStringLength()
>
> Thanks.
>
>
> -----Original Message-----
> From: Lisonbee, Todd [mailto:[hidden email]]
> Sent: Wednesday, March 23, 2016 9:46 AM
> To: [hidden email]
> Subject: Aggregation Design Questions
>
> Hello,
>
> I'm working on adding Standard Deviation and others to the list of
> Aggregations,
> https://issues.apache.org/jira/browse/FLINK-3613
>
> Unfortunately, I didn't get very far because the general design of
> Aggreation on DataSets needs to change and each solution seems to have
> drawbacks.  For example, one easy solution would be to modify
> AggregateOperator to extend CustomUnaryOperation but that seems weird
> because then it wouldn't be an Operator.
>
> I wrote a design explaining some of the current limitations and
> background,
> https://issues.apache.org/jira/secure/attachment/12794820/DataSet-Aggregation-Design-March2016-v1.txt
>
> The design is in progress.  I wanted to check in with people before going
> much further.
>
> I'd appreciate any feedback.
>
> Thanks,
>
> Todd
>