Hello everyone,
I am writing this email to propose a new User Defined Aggregate interface. We were trying to leverage the existing Aggregate interface, but unfortunately we realized that it is not sufficient to meet all our needs. Here are the obstacles we have observed: 1) The current aggregate interface is not very concise to users. One needs to know the design details of the intermediate Row buffer before implements an Aggregate. Seven functions are needed even for a simple Count aggregate. We'd better to make the UDAGG interface much more concisely. 2) the current aggregate function can be only applied on one single column. There are many scenarios which require the aggregate function taking multiple columns as the inputs. 3) “Retraction” is not covered in the current Aggregate. For #1, I am thinking instead of letting users to manipulate the intermediate buffer, we could potentially put the entire Aggregate instance or a subclass instance of Aggregate to the Row buffer, such that the user does not need to know how the Aggregate state is maintained by the framework. But to achieve this goal, we probably need a new dataStream API. The existing reduce API does not work with two different types of inputs (in this proposal, the inputs will be upstream values, and the instance of the current accumulated Aggregate), while the fold API is not able to merge the two Aggregate results (which is usually needed for merging two session windows). For #3, besides the aggregate itself, there are a few other things need to be taken care of to fully support the retractions. I will share a separate concrete proposal about how to generate and process retractions, and how it works along with this new proposed UDAGG. I would like really appreciate if you can share your opinions on this proposal, especially for the needed dataStream API for #1. Also, if there is any other good things you think to be better added for UDAGG, please feel free to share with us. I will draft my proposal in a google doc and share to the flink DEV group very soon. Thanks, Shaoxuan |
Hi Shaoxuan,
user-defined aggregates would be a great addition to the Table API / SQL. I completely agree that the current (internal) interface is not well suited as an external interface and needs to be redesigned if exposed to users. We need to careful think about this new interface and how we can integrate it with the DataStream (and DataSet) API to support all required operations, esp. with respect to null aggregates and support for combining / merging. I agree that for efficient execution, we should avoid WindowFunctions (large state) and FoldFunction (not mergeable). If we need a new interface in the DataStream API, we need to discuss this in more detail. I think we need a bit more information about the proposed UDAGG interface to discuss how this can be mapped to DataStream operators. Support for retraction will be required for our future plans with the streaming Table API / SQL interface. Looking forward to your proposal, Fabian 2017-01-10 15:40 GMT+01:00 Shaoxuan Wang <[hidden email]>: > Hello everyone, > > I am writing this email to propose a new User Defined Aggregate interface. > We were trying to leverage the existing Aggregate interface, but > unfortunately we realized that it is not sufficient to meet all our needs. > Here are the obstacles we have observed: > 1) The current aggregate interface is not very concise to users. One needs > to know the design details of the intermediate Row buffer before implements > an Aggregate. Seven functions are needed even for a simple Count aggregate. > We'd better to make the UDAGG interface much more concisely. > 2) the current aggregate function can be only applied on one single column. > There are many scenarios which require the aggregate function taking > multiple columns as the inputs. > 3) “Retraction” is not covered in the current Aggregate. > > For #1, I am thinking instead of letting users to manipulate the > intermediate buffer, we could potentially put the entire Aggregate instance > or a subclass instance of Aggregate to the Row buffer, such that the user > does not need to know how the Aggregate state is maintained by the > framework. > But to achieve this goal, we probably need a new dataStream API. The > existing reduce API does not work with two different types of inputs (in > this proposal, the inputs will be upstream values, and the instance of the > current accumulated Aggregate), while the fold API is not able to merge the > two Aggregate results (which is usually needed for merging two session > windows). > > For #3, besides the aggregate itself, there are a few other things need to > be taken care of to fully support the retractions. I will share a separate > concrete proposal about how to generate and process retractions, and how it > works along with this new proposed UDAGG. > > I would like really appreciate if you can share your opinions on this > proposal, especially for the needed dataStream API for #1. Also, if there > is any other good things you think to be better added for UDAGG, please > feel free to share with us. I will draft my proposal in a google doc and > share to the flink DEV group very soon. > > Thanks, > Shaoxuan > |
Hi everyone,
I have drafted the design doc (link is provided below) for UDAGG, and created the JIRA (FLINK-5564) to track the progress of this design. Special thanks to Stephan and Fabian for their advice and help. Please check the design doc, feel free to share your comments in the google doc: https://docs.google.com/document/d/19JXK8jLIi8IqV9yf7hOs_Oz67yXOypY7Uh5gIOK2r-U/edit Regards, Shaoxuan On Wed, Jan 11, 2017 at 6:09 AM, Fabian Hueske <[hidden email]> wrote: > Hi Shaoxuan, > > user-defined aggregates would be a great addition to the Table API / SQL. > I completely agree that the current (internal) interface is not well suited > as an external interface and needs to be redesigned if exposed to users. > > We need to careful think about this new interface and how we can integrate > it with the DataStream (and DataSet) API to support all required > operations, esp. with respect to null aggregates and support for combining > / merging. > I agree that for efficient execution, we should avoid WindowFunctions > (large state) and FoldFunction (not mergeable). If we need a new interface > in the DataStream API, we need to discuss this in more detail. > I think we need a bit more information about the proposed UDAGG interface > to discuss how this can be mapped to DataStream operators. > > Support for retraction will be required for our future plans with the > streaming Table API / SQL interface. > > Looking forward to your proposal, > Fabian > > 2017-01-10 15:40 GMT+01:00 Shaoxuan Wang <[hidden email]>: > > > Hello everyone, > > > > I am writing this email to propose a new User Defined Aggregate > interface. > > We were trying to leverage the existing Aggregate interface, but > > unfortunately we realized that it is not sufficient to meet all our > needs. > > Here are the obstacles we have observed: > > 1) The current aggregate interface is not very concise to users. One > needs > > to know the design details of the intermediate Row buffer before > implements > > an Aggregate. Seven functions are needed even for a simple Count > aggregate. > > We'd better to make the UDAGG interface much more concisely. > > 2) the current aggregate function can be only applied on one single > column. > > There are many scenarios which require the aggregate function taking > > multiple columns as the inputs. > > 3) “Retraction” is not covered in the current Aggregate. > > > > For #1, I am thinking instead of letting users to manipulate the > > intermediate buffer, we could potentially put the entire Aggregate > instance > > or a subclass instance of Aggregate to the Row buffer, such that the user > > does not need to know how the Aggregate state is maintained by the > > framework. > > But to achieve this goal, we probably need a new dataStream API. The > > existing reduce API does not work with two different types of inputs (in > > this proposal, the inputs will be upstream values, and the instance of > the > > current accumulated Aggregate), while the fold API is not able to merge > the > > two Aggregate results (which is usually needed for merging two session > > windows). > > > > For #3, besides the aggregate itself, there are a few other things need > to > > be taken care of to fully support the retractions. I will share a > separate > > concrete proposal about how to generate and process retractions, and how > it > > works along with this new proposed UDAGG. > > > > I would like really appreciate if you can share your opinions on this > > proposal, especially for the needed dataStream API for #1. Also, if there > > is any other good things you think to be better added for UDAGG, please > > feel free to share with us. I will draft my proposal in a google doc and > > share to the flink DEV group very soon. > > > > Thanks, > > Shaoxuan > > > |
Hi Shaoxuan,
thanks a lot for this great design doc. I think user defined aggregation functions are a very important feature for the Table API and SQL. Have you thought about how the aggregation functions will be embedded in Flink functions? At the moment, we have a generic Flink function which is configured with aggregation functions, i.e., we do not leverage code generation here. Do you plan to embed built-in and user-defined aggregations functions that implement the proposed API with code generation? Can you maybe extend the JIRA or design document with this information? Thank you, Fabian 2017-01-18 20:55 GMT+01:00 Shaoxuan Wang <[hidden email]>: > Hi everyone, > I have drafted the design doc (link is provided below) for UDAGG, and > created the JIRA (FLINK-5564) to track the progress of this design. > Special thanks to Stephan and Fabian for their advice and help. > > Please check the design doc, feel free to share your comments in the google > doc: > https://docs.google.com/document/d/19JXK8jLIi8IqV9yf7hOs_Oz6 > 7yXOypY7Uh5gIOK2r-U/edit > > Regards, > Shaoxuan > > On Wed, Jan 11, 2017 at 6:09 AM, Fabian Hueske <[hidden email]> wrote: > > > Hi Shaoxuan, > > > > user-defined aggregates would be a great addition to the Table API / SQL. > > I completely agree that the current (internal) interface is not well > suited > > as an external interface and needs to be redesigned if exposed to users. > > > > We need to careful think about this new interface and how we can > integrate > > it with the DataStream (and DataSet) API to support all required > > operations, esp. with respect to null aggregates and support for > combining > > / merging. > > I agree that for efficient execution, we should avoid WindowFunctions > > (large state) and FoldFunction (not mergeable). If we need a new > interface > > in the DataStream API, we need to discuss this in more detail. > > I think we need a bit more information about the proposed UDAGG interface > > to discuss how this can be mapped to DataStream operators. > > > > Support for retraction will be required for our future plans with the > > streaming Table API / SQL interface. > > > > Looking forward to your proposal, > > Fabian > > > > 2017-01-10 15:40 GMT+01:00 Shaoxuan Wang <[hidden email]>: > > > > > Hello everyone, > > > > > > I am writing this email to propose a new User Defined Aggregate > > interface. > > > We were trying to leverage the existing Aggregate interface, but > > > unfortunately we realized that it is not sufficient to meet all our > > needs. > > > Here are the obstacles we have observed: > > > 1) The current aggregate interface is not very concise to users. One > > needs > > > to know the design details of the intermediate Row buffer before > > implements > > > an Aggregate. Seven functions are needed even for a simple Count > > aggregate. > > > We'd better to make the UDAGG interface much more concisely. > > > 2) the current aggregate function can be only applied on one single > > column. > > > There are many scenarios which require the aggregate function taking > > > multiple columns as the inputs. > > > 3) “Retraction” is not covered in the current Aggregate. > > > > > > For #1, I am thinking instead of letting users to manipulate the > > > intermediate buffer, we could potentially put the entire Aggregate > > instance > > > or a subclass instance of Aggregate to the Row buffer, such that the > user > > > does not need to know how the Aggregate state is maintained by the > > > framework. > > > But to achieve this goal, we probably need a new dataStream API. The > > > existing reduce API does not work with two different types of inputs > (in > > > this proposal, the inputs will be upstream values, and the instance of > > the > > > current accumulated Aggregate), while the fold API is not able to merge > > the > > > two Aggregate results (which is usually needed for merging two session > > > windows). > > > > > > For #3, besides the aggregate itself, there are a few other things need > > to > > > be taken care of to fully support the retractions. I will share a > > separate > > > concrete proposal about how to generate and process retractions, and > how > > it > > > works along with this new proposed UDAGG. > > > > > > I would like really appreciate if you can share your opinions on this > > > proposal, especially for the needed dataStream API for #1. Also, if > there > > > is any other good things you think to be better added for UDAGG, please > > > feel free to share with us. I will draft my proposal in a google doc > and > > > share to the flink DEV group very soon. > > > > > > Thanks, > > > Shaoxuan > > > > > > |
Hi Fabian,
Thanks for the carefully checking on the proposal. Yes, code generation is in my plan. As shown in "2.3 UDAGG interface", the input and return types of the new proposed UDAGG functions are dynamically given by the users ("[user defined xxx inputs/types]"). All embed built-in functions for this new API have to be generated via codegen. I will update Jira and doc. Thanks, Shaoxuan On Sat, Jan 21, 2017 at 7:29 AM, Fabian Hueske <[hidden email]> wrote: > Hi Shaoxuan, > > thanks a lot for this great design doc. > I think user defined aggregation functions are a very important feature for > the Table API and SQL. > > Have you thought about how the aggregation functions will be embedded in > Flink functions? > At the moment, we have a generic Flink function which is configured with > aggregation functions, i.e., we do not leverage code generation here. > Do you plan to embed built-in and user-defined aggregations functions that > implement the proposed API with code generation? > > Can you maybe extend the JIRA or design document with this information? > > Thank you, > Fabian > > 2017-01-18 20:55 GMT+01:00 Shaoxuan Wang <[hidden email]>: > > > Hi everyone, > > I have drafted the design doc (link is provided below) for UDAGG, and > > created the JIRA (FLINK-5564) to track the progress of this design. > > Special thanks to Stephan and Fabian for their advice and help. > > > > Please check the design doc, feel free to share your comments in the > > doc: > > https://docs.google.com/document/d/19JXK8jLIi8IqV9yf7hOs_Oz6 > > 7yXOypY7Uh5gIOK2r-U/edit > > > > Regards, > > Shaoxuan > > > > On Wed, Jan 11, 2017 at 6:09 AM, Fabian Hueske <[hidden email]> > wrote: > > > > > Hi Shaoxuan, > > > > > > user-defined aggregates would be a great addition to the Table API / > SQL. > > > I completely agree that the current (internal) interface is not well > > suited > > > as an external interface and needs to be redesigned if exposed to > users. > > > > > > We need to careful think about this new interface and how we can > > integrate > > > it with the DataStream (and DataSet) API to support all required > > > operations, esp. with respect to null aggregates and support for > > combining > > > / merging. > > > I agree that for efficient execution, we should avoid WindowFunctions > > > (large state) and FoldFunction (not mergeable). If we need a new > > interface > > > in the DataStream API, we need to discuss this in more detail. > > > I think we need a bit more information about the proposed UDAGG > interface > > > to discuss how this can be mapped to DataStream operators. > > > > > > Support for retraction will be required for our future plans with the > > > streaming Table API / SQL interface. > > > > > > Looking forward to your proposal, > > > Fabian > > > > > > 2017-01-10 15:40 GMT+01:00 Shaoxuan Wang <[hidden email]>: > > > > > > > Hello everyone, > > > > > > > > I am writing this email to propose a new User Defined Aggregate > > > interface. > > > > We were trying to leverage the existing Aggregate interface, but > > > > unfortunately we realized that it is not sufficient to meet all our > > > needs. > > > > Here are the obstacles we have observed: > > > > 1) The current aggregate interface is not very concise to users. One > > > needs > > > > to know the design details of the intermediate Row buffer before > > > implements > > > > an Aggregate. Seven functions are needed even for a simple Count > > > aggregate. > > > > We'd better to make the UDAGG interface much more concisely. > > > > 2) the current aggregate function can be only applied on one single > > > column. > > > > There are many scenarios which require the aggregate function taking > > > > multiple columns as the inputs. > > > > 3) “Retraction” is not covered in the current Aggregate. > > > > > > > > For #1, I am thinking instead of letting users to manipulate the > > > > intermediate buffer, we could potentially put the entire Aggregate > > > instance > > > > or a subclass instance of Aggregate to the Row buffer, such that the > > user > > > > does not need to know how the Aggregate state is maintained by the > > > > framework. > > > > But to achieve this goal, we probably need a new dataStream API. The > > > > existing reduce API does not work with two different types of inputs > > (in > > > > this proposal, the inputs will be upstream values, and the instance > of > > > the > > > > current accumulated Aggregate), while the fold API is not able to > merge > > > the > > > > two Aggregate results (which is usually needed for merging two > session > > > > windows). > > > > > > > > For #3, besides the aggregate itself, there are a few other things > need > > > to > > > > be taken care of to fully support the retractions. I will share a > > > separate > > > > concrete proposal about how to generate and process retractions, and > > how > > > it > > > > works along with this new proposed UDAGG. > > > > > > > > I would like really appreciate if you can share your opinions on this > > > > proposal, especially for the needed dataStream API for #1. Also, if > > there > > > > is any other good things you think to be better added for UDAGG, > please > > > > feel free to share with us. I will draft my proposal in a google doc > > and > > > > share to the flink DEV group very soon. > > > > > > > > Thanks, > > > > Shaoxuan > > > > > > > > > > |
Thanks for the clarification Shaoxuan.
Cheers, Fabian 2017-01-22 4:08 GMT+01:00 Shaoxuan Wang <[hidden email]>: > Hi Fabian, > Thanks for the carefully checking on the proposal. > Yes, code generation is in my plan. As shown in "2.3 UDAGG interface", the > input and return types of the new proposed UDAGG functions are dynamically > given by the users ("[user defined xxx inputs/types]"). All embed built-in > functions for this new API have to be generated via codegen. I will update > Jira and doc. > > Thanks, > Shaoxuan > > > On Sat, Jan 21, 2017 at 7:29 AM, Fabian Hueske <[hidden email]> wrote: > > > Hi Shaoxuan, > > > > thanks a lot for this great design doc. > > I think user defined aggregation functions are a very important feature > for > > the Table API and SQL. > > > > Have you thought about how the aggregation functions will be embedded in > > Flink functions? > > At the moment, we have a generic Flink function which is configured with > > aggregation functions, i.e., we do not leverage code generation here. > > Do you plan to embed built-in and user-defined aggregations functions > that > > implement the proposed API with code generation? > > > > Can you maybe extend the JIRA or design document with this information? > > > > Thank you, > > Fabian > > > > 2017-01-18 20:55 GMT+01:00 Shaoxuan Wang <[hidden email]>: > > > > > Hi everyone, > > > I have drafted the design doc (link is provided below) for UDAGG, and > > > created the JIRA (FLINK-5564) to track the progress of this design. > > > Special thanks to Stephan and Fabian for their advice and help. > > > > > > Please check the design doc, feel free to share your comments in the > > > doc: > > > https://docs.google.com/document/d/19JXK8jLIi8IqV9yf7hOs_Oz6 > > > 7yXOypY7Uh5gIOK2r-U/edit > > > > > > Regards, > > > Shaoxuan > > > > > > On Wed, Jan 11, 2017 at 6:09 AM, Fabian Hueske <[hidden email]> > > wrote: > > > > > > > Hi Shaoxuan, > > > > > > > > user-defined aggregates would be a great addition to the Table API / > > SQL. > > > > I completely agree that the current (internal) interface is not well > > > suited > > > > as an external interface and needs to be redesigned if exposed to > > users. > > > > > > > > We need to careful think about this new interface and how we can > > > integrate > > > > it with the DataStream (and DataSet) API to support all required > > > > operations, esp. with respect to null aggregates and support for > > > combining > > > > / merging. > > > > I agree that for efficient execution, we should avoid WindowFunctions > > > > (large state) and FoldFunction (not mergeable). If we need a new > > > interface > > > > in the DataStream API, we need to discuss this in more detail. > > > > I think we need a bit more information about the proposed UDAGG > > interface > > > > to discuss how this can be mapped to DataStream operators. > > > > > > > > Support for retraction will be required for our future plans with the > > > > streaming Table API / SQL interface. > > > > > > > > Looking forward to your proposal, > > > > Fabian > > > > > > > > 2017-01-10 15:40 GMT+01:00 Shaoxuan Wang <[hidden email]>: > > > > > > > > > Hello everyone, > > > > > > > > > > I am writing this email to propose a new User Defined Aggregate > > > > interface. > > > > > We were trying to leverage the existing Aggregate interface, but > > > > > unfortunately we realized that it is not sufficient to meet all our > > > > needs. > > > > > Here are the obstacles we have observed: > > > > > 1) The current aggregate interface is not very concise to users. > One > > > > needs > > > > > to know the design details of the intermediate Row buffer before > > > > implements > > > > > an Aggregate. Seven functions are needed even for a simple Count > > > > aggregate. > > > > > We'd better to make the UDAGG interface much more concisely. > > > > > 2) the current aggregate function can be only applied on one single > > > > column. > > > > > There are many scenarios which require the aggregate function > taking > > > > > multiple columns as the inputs. > > > > > 3) “Retraction” is not covered in the current Aggregate. > > > > > > > > > > For #1, I am thinking instead of letting users to manipulate the > > > > > intermediate buffer, we could potentially put the entire Aggregate > > > > instance > > > > > or a subclass instance of Aggregate to the Row buffer, such that > the > > > user > > > > > does not need to know how the Aggregate state is maintained by the > > > > > framework. > > > > > But to achieve this goal, we probably need a new dataStream API. > The > > > > > existing reduce API does not work with two different types of > inputs > > > (in > > > > > this proposal, the inputs will be upstream values, and the instance > > of > > > > the > > > > > current accumulated Aggregate), while the fold API is not able to > > merge > > > > the > > > > > two Aggregate results (which is usually needed for merging two > > session > > > > > windows). > > > > > > > > > > For #3, besides the aggregate itself, there are a few other things > > need > > > > to > > > > > be taken care of to fully support the retractions. I will share a > > > > separate > > > > > concrete proposal about how to generate and process retractions, > and > > > how > > > > it > > > > > works along with this new proposed UDAGG. > > > > > > > > > > I would like really appreciate if you can share your opinions on > this > > > > > proposal, especially for the needed dataStream API for #1. Also, if > > > there > > > > > is any other good things you think to be better added for UDAGG, > > please > > > > > feel free to share with us. I will draft my proposal in a google > doc > > > and > > > > > share to the flink DEV group very soon. > > > > > > > > > > Thanks, > > > > > Shaoxuan > > > > > > > > > > > > > > > |
Free forum by Nabble | Edit this page |