[DISCUSS] proposal for the User Defined AGGregate (UDAGG)

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

[DISCUSS] proposal for the User Defined AGGregate (UDAGG)

Shaoxuan Wang
Hello everyone,

I am writing this email to propose a new User Defined Aggregate interface.
We were trying to leverage the existing Aggregate interface, but
unfortunately we realized that it is not sufficient to meet all our needs.
Here are the obstacles we have observed:
1) The current aggregate interface is not very concise to users. One needs
to know the design details of the intermediate Row buffer before implements
an Aggregate. Seven functions are needed even for a simple Count aggregate.
We'd better to make the UDAGG interface much more concisely.
2) the current aggregate function can be only applied on one single column.
There are many scenarios which require the aggregate function taking
multiple columns as the inputs.
3) “Retraction” is not covered in the current Aggregate.

For #1, I am thinking instead of letting users to manipulate the
intermediate buffer, we could potentially put the entire Aggregate instance
or a subclass instance of Aggregate to the Row buffer, such that the user
does not need to know how the Aggregate state is maintained by the
framework.
But to achieve this goal, we probably need a new dataStream API. The
existing reduce API does not work with two different types of inputs (in
this proposal, the inputs will be upstream values, and the instance of the
current accumulated Aggregate), while the fold API is not able to merge the
two Aggregate results (which is usually needed for merging two session
windows).

For #3, besides the aggregate itself, there are a few other things need to
be taken care of to fully support the retractions. I will share a separate
concrete proposal about how to generate and process retractions, and how it
works along with this new proposed UDAGG.

I would like really appreciate if you can share your opinions on this
proposal, especially for the needed dataStream API for #1. Also, if there
is any other good things you think to be better added for UDAGG, please
feel free to share with us. I will draft my proposal in a google doc and
share to the flink DEV group very soon.

Thanks,
Shaoxuan
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] proposal for the User Defined AGGregate (UDAGG)

Fabian Hueske-2
Hi Shaoxuan,

user-defined aggregates would be a great addition to the Table API / SQL.
I completely agree that the current (internal) interface is not well suited
as an external interface and needs to be redesigned if exposed to users.

We need to careful think about this new interface and how we can integrate
it with the DataStream (and DataSet) API to support all required
operations, esp. with respect to null aggregates and support for combining
/ merging.
I agree that for efficient execution, we should avoid WindowFunctions
(large state) and FoldFunction (not mergeable). If we need a new interface
in the DataStream API, we need to discuss this in more detail.
I think we need a bit more information about the proposed UDAGG interface
to discuss how this can be mapped to DataStream operators.

Support for retraction will be required for our future plans with the
streaming Table API / SQL interface.

Looking forward to your proposal,
Fabian

2017-01-10 15:40 GMT+01:00 Shaoxuan Wang <[hidden email]>:

> Hello everyone,
>
> I am writing this email to propose a new User Defined Aggregate interface.
> We were trying to leverage the existing Aggregate interface, but
> unfortunately we realized that it is not sufficient to meet all our needs.
> Here are the obstacles we have observed:
> 1) The current aggregate interface is not very concise to users. One needs
> to know the design details of the intermediate Row buffer before implements
> an Aggregate. Seven functions are needed even for a simple Count aggregate.
> We'd better to make the UDAGG interface much more concisely.
> 2) the current aggregate function can be only applied on one single column.
> There are many scenarios which require the aggregate function taking
> multiple columns as the inputs.
> 3) “Retraction” is not covered in the current Aggregate.
>
> For #1, I am thinking instead of letting users to manipulate the
> intermediate buffer, we could potentially put the entire Aggregate instance
> or a subclass instance of Aggregate to the Row buffer, such that the user
> does not need to know how the Aggregate state is maintained by the
> framework.
> But to achieve this goal, we probably need a new dataStream API. The
> existing reduce API does not work with two different types of inputs (in
> this proposal, the inputs will be upstream values, and the instance of the
> current accumulated Aggregate), while the fold API is not able to merge the
> two Aggregate results (which is usually needed for merging two session
> windows).
>
> For #3, besides the aggregate itself, there are a few other things need to
> be taken care of to fully support the retractions. I will share a separate
> concrete proposal about how to generate and process retractions, and how it
> works along with this new proposed UDAGG.
>
> I would like really appreciate if you can share your opinions on this
> proposal, especially for the needed dataStream API for #1. Also, if there
> is any other good things you think to be better added for UDAGG, please
> feel free to share with us. I will draft my proposal in a google doc and
> share to the flink DEV group very soon.
>
> Thanks,
> Shaoxuan
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] proposal for the User Defined AGGregate (UDAGG)

Shaoxuan Wang
Hi everyone,
I have drafted the design doc (link is provided below) for UDAGG, and
created the JIRA (FLINK-5564) to track the progress of this design.
Special thanks to Stephan and Fabian for their advice and help.

Please check the design doc, feel free to share your comments in the google
doc:
https://docs.google.com/document/d/19JXK8jLIi8IqV9yf7hOs_Oz67yXOypY7Uh5gIOK2r-U/edit

Regards,
Shaoxuan

On Wed, Jan 11, 2017 at 6:09 AM, Fabian Hueske <[hidden email]> wrote:

> Hi Shaoxuan,
>
> user-defined aggregates would be a great addition to the Table API / SQL.
> I completely agree that the current (internal) interface is not well suited
> as an external interface and needs to be redesigned if exposed to users.
>
> We need to careful think about this new interface and how we can integrate
> it with the DataStream (and DataSet) API to support all required
> operations, esp. with respect to null aggregates and support for combining
> / merging.
> I agree that for efficient execution, we should avoid WindowFunctions
> (large state) and FoldFunction (not mergeable). If we need a new interface
> in the DataStream API, we need to discuss this in more detail.
> I think we need a bit more information about the proposed UDAGG interface
> to discuss how this can be mapped to DataStream operators.
>
> Support for retraction will be required for our future plans with the
> streaming Table API / SQL interface.
>
> Looking forward to your proposal,
> Fabian
>
> 2017-01-10 15:40 GMT+01:00 Shaoxuan Wang <[hidden email]>:
>
> > Hello everyone,
> >
> > I am writing this email to propose a new User Defined Aggregate
> interface.
> > We were trying to leverage the existing Aggregate interface, but
> > unfortunately we realized that it is not sufficient to meet all our
> needs.
> > Here are the obstacles we have observed:
> > 1) The current aggregate interface is not very concise to users. One
> needs
> > to know the design details of the intermediate Row buffer before
> implements
> > an Aggregate. Seven functions are needed even for a simple Count
> aggregate.
> > We'd better to make the UDAGG interface much more concisely.
> > 2) the current aggregate function can be only applied on one single
> column.
> > There are many scenarios which require the aggregate function taking
> > multiple columns as the inputs.
> > 3) “Retraction” is not covered in the current Aggregate.
> >
> > For #1, I am thinking instead of letting users to manipulate the
> > intermediate buffer, we could potentially put the entire Aggregate
> instance
> > or a subclass instance of Aggregate to the Row buffer, such that the user
> > does not need to know how the Aggregate state is maintained by the
> > framework.
> > But to achieve this goal, we probably need a new dataStream API. The
> > existing reduce API does not work with two different types of inputs (in
> > this proposal, the inputs will be upstream values, and the instance of
> the
> > current accumulated Aggregate), while the fold API is not able to merge
> the
> > two Aggregate results (which is usually needed for merging two session
> > windows).
> >
> > For #3, besides the aggregate itself, there are a few other things need
> to
> > be taken care of to fully support the retractions. I will share a
> separate
> > concrete proposal about how to generate and process retractions, and how
> it
> > works along with this new proposed UDAGG.
> >
> > I would like really appreciate if you can share your opinions on this
> > proposal, especially for the needed dataStream API for #1. Also, if there
> > is any other good things you think to be better added for UDAGG, please
> > feel free to share with us. I will draft my proposal in a google doc and
> > share to the flink DEV group very soon.
> >
> > Thanks,
> > Shaoxuan
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] proposal for the User Defined AGGregate (UDAGG)

Fabian Hueske-2
Hi Shaoxuan,

thanks a lot for this great design doc.
I think user defined aggregation functions are a very important feature for
the Table API and SQL.

Have you thought about how the aggregation functions will be embedded in
Flink functions?
At the moment, we have a generic Flink function which is configured with
aggregation functions, i.e., we do not leverage code generation here.
Do you plan to embed built-in and user-defined aggregations functions that
implement the proposed API with code generation?

Can you maybe extend the JIRA or design document with this information?

Thank you,
Fabian

2017-01-18 20:55 GMT+01:00 Shaoxuan Wang <[hidden email]>:

> Hi everyone,
> I have drafted the design doc (link is provided below) for UDAGG, and
> created the JIRA (FLINK-5564) to track the progress of this design.
> Special thanks to Stephan and Fabian for their advice and help.
>
> Please check the design doc, feel free to share your comments in the google
> doc:
> https://docs.google.com/document/d/19JXK8jLIi8IqV9yf7hOs_Oz6
> 7yXOypY7Uh5gIOK2r-U/edit
>
> Regards,
> Shaoxuan
>
> On Wed, Jan 11, 2017 at 6:09 AM, Fabian Hueske <[hidden email]> wrote:
>
> > Hi Shaoxuan,
> >
> > user-defined aggregates would be a great addition to the Table API / SQL.
> > I completely agree that the current (internal) interface is not well
> suited
> > as an external interface and needs to be redesigned if exposed to users.
> >
> > We need to careful think about this new interface and how we can
> integrate
> > it with the DataStream (and DataSet) API to support all required
> > operations, esp. with respect to null aggregates and support for
> combining
> > / merging.
> > I agree that for efficient execution, we should avoid WindowFunctions
> > (large state) and FoldFunction (not mergeable). If we need a new
> interface
> > in the DataStream API, we need to discuss this in more detail.
> > I think we need a bit more information about the proposed UDAGG interface
> > to discuss how this can be mapped to DataStream operators.
> >
> > Support for retraction will be required for our future plans with the
> > streaming Table API / SQL interface.
> >
> > Looking forward to your proposal,
> > Fabian
> >
> > 2017-01-10 15:40 GMT+01:00 Shaoxuan Wang <[hidden email]>:
> >
> > > Hello everyone,
> > >
> > > I am writing this email to propose a new User Defined Aggregate
> > interface.
> > > We were trying to leverage the existing Aggregate interface, but
> > > unfortunately we realized that it is not sufficient to meet all our
> > needs.
> > > Here are the obstacles we have observed:
> > > 1) The current aggregate interface is not very concise to users. One
> > needs
> > > to know the design details of the intermediate Row buffer before
> > implements
> > > an Aggregate. Seven functions are needed even for a simple Count
> > aggregate.
> > > We'd better to make the UDAGG interface much more concisely.
> > > 2) the current aggregate function can be only applied on one single
> > column.
> > > There are many scenarios which require the aggregate function taking
> > > multiple columns as the inputs.
> > > 3) “Retraction” is not covered in the current Aggregate.
> > >
> > > For #1, I am thinking instead of letting users to manipulate the
> > > intermediate buffer, we could potentially put the entire Aggregate
> > instance
> > > or a subclass instance of Aggregate to the Row buffer, such that the
> user
> > > does not need to know how the Aggregate state is maintained by the
> > > framework.
> > > But to achieve this goal, we probably need a new dataStream API. The
> > > existing reduce API does not work with two different types of inputs
> (in
> > > this proposal, the inputs will be upstream values, and the instance of
> > the
> > > current accumulated Aggregate), while the fold API is not able to merge
> > the
> > > two Aggregate results (which is usually needed for merging two session
> > > windows).
> > >
> > > For #3, besides the aggregate itself, there are a few other things need
> > to
> > > be taken care of to fully support the retractions. I will share a
> > separate
> > > concrete proposal about how to generate and process retractions, and
> how
> > it
> > > works along with this new proposed UDAGG.
> > >
> > > I would like really appreciate if you can share your opinions on this
> > > proposal, especially for the needed dataStream API for #1. Also, if
> there
> > > is any other good things you think to be better added for UDAGG, please
> > > feel free to share with us. I will draft my proposal in a google doc
> and
> > > share to the flink DEV group very soon.
> > >
> > > Thanks,
> > > Shaoxuan
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] proposal for the User Defined AGGregate (UDAGG)

Shaoxuan Wang
Hi Fabian,
Thanks for the carefully checking on the proposal.
Yes, code generation is in my plan. As shown in "2.3 UDAGG interface", the
input and return types of the new proposed UDAGG functions are dynamically
given by the users ("[user defined xxx inputs/types]"). All embed built-in
functions for this new API have to be generated via codegen. I will update
Jira and doc.

Thanks,
Shaoxuan


On Sat, Jan 21, 2017 at 7:29 AM, Fabian Hueske <[hidden email]> wrote:

> Hi Shaoxuan,
>
> thanks a lot for this great design doc.
> I think user defined aggregation functions are a very important feature for
> the Table API and SQL.
>
> Have you thought about how the aggregation functions will be embedded in
> Flink functions?
> At the moment, we have a generic Flink function which is configured with
> aggregation functions, i.e., we do not leverage code generation here.
> Do you plan to embed built-in and user-defined aggregations functions that
> implement the proposed API with code generation?
>
> Can you maybe extend the JIRA or design document with this information?
>
> Thank you,
> Fabian
>
> 2017-01-18 20:55 GMT+01:00 Shaoxuan Wang <[hidden email]>:
>
> > Hi everyone,
> > I have drafted the design doc (link is provided below) for UDAGG, and
> > created the JIRA (FLINK-5564) to track the progress of this design.
> > Special thanks to Stephan and Fabian for their advice and help.
> >
> > Please check the design doc, feel free to share your comments in the
> google
> > doc:
> > https://docs.google.com/document/d/19JXK8jLIi8IqV9yf7hOs_Oz6
> > 7yXOypY7Uh5gIOK2r-U/edit
> >
> > Regards,
> > Shaoxuan
> >
> > On Wed, Jan 11, 2017 at 6:09 AM, Fabian Hueske <[hidden email]>
> wrote:
> >
> > > Hi Shaoxuan,
> > >
> > > user-defined aggregates would be a great addition to the Table API /
> SQL.
> > > I completely agree that the current (internal) interface is not well
> > suited
> > > as an external interface and needs to be redesigned if exposed to
> users.
> > >
> > > We need to careful think about this new interface and how we can
> > integrate
> > > it with the DataStream (and DataSet) API to support all required
> > > operations, esp. with respect to null aggregates and support for
> > combining
> > > / merging.
> > > I agree that for efficient execution, we should avoid WindowFunctions
> > > (large state) and FoldFunction (not mergeable). If we need a new
> > interface
> > > in the DataStream API, we need to discuss this in more detail.
> > > I think we need a bit more information about the proposed UDAGG
> interface
> > > to discuss how this can be mapped to DataStream operators.
> > >
> > > Support for retraction will be required for our future plans with the
> > > streaming Table API / SQL interface.
> > >
> > > Looking forward to your proposal,
> > > Fabian
> > >
> > > 2017-01-10 15:40 GMT+01:00 Shaoxuan Wang <[hidden email]>:
> > >
> > > > Hello everyone,
> > > >
> > > > I am writing this email to propose a new User Defined Aggregate
> > > interface.
> > > > We were trying to leverage the existing Aggregate interface, but
> > > > unfortunately we realized that it is not sufficient to meet all our
> > > needs.
> > > > Here are the obstacles we have observed:
> > > > 1) The current aggregate interface is not very concise to users. One
> > > needs
> > > > to know the design details of the intermediate Row buffer before
> > > implements
> > > > an Aggregate. Seven functions are needed even for a simple Count
> > > aggregate.
> > > > We'd better to make the UDAGG interface much more concisely.
> > > > 2) the current aggregate function can be only applied on one single
> > > column.
> > > > There are many scenarios which require the aggregate function taking
> > > > multiple columns as the inputs.
> > > > 3) “Retraction” is not covered in the current Aggregate.
> > > >
> > > > For #1, I am thinking instead of letting users to manipulate the
> > > > intermediate buffer, we could potentially put the entire Aggregate
> > > instance
> > > > or a subclass instance of Aggregate to the Row buffer, such that the
> > user
> > > > does not need to know how the Aggregate state is maintained by the
> > > > framework.
> > > > But to achieve this goal, we probably need a new dataStream API. The
> > > > existing reduce API does not work with two different types of inputs
> > (in
> > > > this proposal, the inputs will be upstream values, and the instance
> of
> > > the
> > > > current accumulated Aggregate), while the fold API is not able to
> merge
> > > the
> > > > two Aggregate results (which is usually needed for merging two
> session
> > > > windows).
> > > >
> > > > For #3, besides the aggregate itself, there are a few other things
> need
> > > to
> > > > be taken care of to fully support the retractions. I will share a
> > > separate
> > > > concrete proposal about how to generate and process retractions, and
> > how
> > > it
> > > > works along with this new proposed UDAGG.
> > > >
> > > > I would like really appreciate if you can share your opinions on this
> > > > proposal, especially for the needed dataStream API for #1. Also, if
> > there
> > > > is any other good things you think to be better added for UDAGG,
> please
> > > > feel free to share with us. I will draft my proposal in a google doc
> > and
> > > > share to the flink DEV group very soon.
> > > >
> > > > Thanks,
> > > > Shaoxuan
> > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] proposal for the User Defined AGGregate (UDAGG)

Fabian Hueske-2
Thanks for the clarification Shaoxuan.

Cheers, Fabian

2017-01-22 4:08 GMT+01:00 Shaoxuan Wang <[hidden email]>:

> Hi Fabian,
> Thanks for the carefully checking on the proposal.
> Yes, code generation is in my plan. As shown in "2.3 UDAGG interface", the
> input and return types of the new proposed UDAGG functions are dynamically
> given by the users ("[user defined xxx inputs/types]"). All embed built-in
> functions for this new API have to be generated via codegen. I will update
> Jira and doc.
>
> Thanks,
> Shaoxuan
>
>
> On Sat, Jan 21, 2017 at 7:29 AM, Fabian Hueske <[hidden email]> wrote:
>
> > Hi Shaoxuan,
> >
> > thanks a lot for this great design doc.
> > I think user defined aggregation functions are a very important feature
> for
> > the Table API and SQL.
> >
> > Have you thought about how the aggregation functions will be embedded in
> > Flink functions?
> > At the moment, we have a generic Flink function which is configured with
> > aggregation functions, i.e., we do not leverage code generation here.
> > Do you plan to embed built-in and user-defined aggregations functions
> that
> > implement the proposed API with code generation?
> >
> > Can you maybe extend the JIRA or design document with this information?
> >
> > Thank you,
> > Fabian
> >
> > 2017-01-18 20:55 GMT+01:00 Shaoxuan Wang <[hidden email]>:
> >
> > > Hi everyone,
> > > I have drafted the design doc (link is provided below) for UDAGG, and
> > > created the JIRA (FLINK-5564) to track the progress of this design.
> > > Special thanks to Stephan and Fabian for their advice and help.
> > >
> > > Please check the design doc, feel free to share your comments in the
> > google
> > > doc:
> > > https://docs.google.com/document/d/19JXK8jLIi8IqV9yf7hOs_Oz6
> > > 7yXOypY7Uh5gIOK2r-U/edit
> > >
> > > Regards,
> > > Shaoxuan
> > >
> > > On Wed, Jan 11, 2017 at 6:09 AM, Fabian Hueske <[hidden email]>
> > wrote:
> > >
> > > > Hi Shaoxuan,
> > > >
> > > > user-defined aggregates would be a great addition to the Table API /
> > SQL.
> > > > I completely agree that the current (internal) interface is not well
> > > suited
> > > > as an external interface and needs to be redesigned if exposed to
> > users.
> > > >
> > > > We need to careful think about this new interface and how we can
> > > integrate
> > > > it with the DataStream (and DataSet) API to support all required
> > > > operations, esp. with respect to null aggregates and support for
> > > combining
> > > > / merging.
> > > > I agree that for efficient execution, we should avoid WindowFunctions
> > > > (large state) and FoldFunction (not mergeable). If we need a new
> > > interface
> > > > in the DataStream API, we need to discuss this in more detail.
> > > > I think we need a bit more information about the proposed UDAGG
> > interface
> > > > to discuss how this can be mapped to DataStream operators.
> > > >
> > > > Support for retraction will be required for our future plans with the
> > > > streaming Table API / SQL interface.
> > > >
> > > > Looking forward to your proposal,
> > > > Fabian
> > > >
> > > > 2017-01-10 15:40 GMT+01:00 Shaoxuan Wang <[hidden email]>:
> > > >
> > > > > Hello everyone,
> > > > >
> > > > > I am writing this email to propose a new User Defined Aggregate
> > > > interface.
> > > > > We were trying to leverage the existing Aggregate interface, but
> > > > > unfortunately we realized that it is not sufficient to meet all our
> > > > needs.
> > > > > Here are the obstacles we have observed:
> > > > > 1) The current aggregate interface is not very concise to users.
> One
> > > > needs
> > > > > to know the design details of the intermediate Row buffer before
> > > > implements
> > > > > an Aggregate. Seven functions are needed even for a simple Count
> > > > aggregate.
> > > > > We'd better to make the UDAGG interface much more concisely.
> > > > > 2) the current aggregate function can be only applied on one single
> > > > column.
> > > > > There are many scenarios which require the aggregate function
> taking
> > > > > multiple columns as the inputs.
> > > > > 3) “Retraction” is not covered in the current Aggregate.
> > > > >
> > > > > For #1, I am thinking instead of letting users to manipulate the
> > > > > intermediate buffer, we could potentially put the entire Aggregate
> > > > instance
> > > > > or a subclass instance of Aggregate to the Row buffer, such that
> the
> > > user
> > > > > does not need to know how the Aggregate state is maintained by the
> > > > > framework.
> > > > > But to achieve this goal, we probably need a new dataStream API.
> The
> > > > > existing reduce API does not work with two different types of
> inputs
> > > (in
> > > > > this proposal, the inputs will be upstream values, and the instance
> > of
> > > > the
> > > > > current accumulated Aggregate), while the fold API is not able to
> > merge
> > > > the
> > > > > two Aggregate results (which is usually needed for merging two
> > session
> > > > > windows).
> > > > >
> > > > > For #3, besides the aggregate itself, there are a few other things
> > need
> > > > to
> > > > > be taken care of to fully support the retractions. I will share a
> > > > separate
> > > > > concrete proposal about how to generate and process retractions,
> and
> > > how
> > > > it
> > > > > works along with this new proposed UDAGG.
> > > > >
> > > > > I would like really appreciate if you can share your opinions on
> this
> > > > > proposal, especially for the needed dataStream API for #1. Also, if
> > > there
> > > > > is any other good things you think to be better added for UDAGG,
> > please
> > > > > feel free to share with us. I will draft my proposal in a google
> doc
> > > and
> > > > > share to the flink DEV group very soon.
> > > > >
> > > > > Thanks,
> > > > > Shaoxuan
> > > > >
> > > >
> > >
> >
>