(DEPRECATED) Apache Flink Mailing List archive.

Support distinct aggregation over data stream on Table/SQL API

Classic

List

Threaded

4 messages Options

Rong Rong

Support distinct aggregation over data stream on Table/SQL API

Hi Community,

We are working on support of distinct aggregators over data stream on
Table/SQL API. Currently there are seems to be many JIRAs related to
distinct agg over stream use cases which are still pending (FLINK-6249
<https://issues.apache.org/jira/browse/FLINK-6249>, FLINK-6260
<https://issues.apache.org/jira/browse/FLINK-6260>, FLINK-5315
<https://issues.apache.org/jira/browse/FLINK-5315>, FLINK-6335
<https://issues.apache.org/jira/browse/FLINK-6335>, FLINK-6373
<https://issues.apache.org/jira/browse/FLINK-6373>, FLINK-6250
<https://issues.apache.org/jira/browse/FLINK-6250>, etc) and I am having
some concerns when trying to come up with a solution as there might be
other use cases out there.

I summarized a write up and categorized the use cases into unbounded or
bounded aggregations and proposed a solution through modifying and adding
new distinct aggregate functions using UDAGG API with DataView. Please find
it here
<https://docs.google.com/document/d/1zj6OA-K2hi7ah8Fo-xTQB-mVmYfm6LsN2_NHgTCVmJI/edit?usp=sharing>
.

Any comments or suggestions are highly appreciated.

Many Thanks,
Rong

Fabian Hueske-2

Re: Support distinct aggregation over data stream on Table/SQL API

Hi Rong,

Thanks for taking the initiative to improve the support for DISTINCT
aggregations!
I've made a pass over your design document and left a couple of comments.
I think it is a really good write up and serves as a good start.

IMO, the next steps could be to
1) continue and finalize the discussion on the design doc. Feel free to
open a new umbrella JIRA and link your doc there.
2) check which JIRAs are still relevant. Close or reorganize them according
to the plan in your design doc and make them subissues of the umbrella
issue.
3) add support for DISTINCT in SQL
4) later add extend the Table API to also support distinct aggregations
(this would be mostly API changes since the execution is solved before)

Let me know what you think.

Best, Fabian

2018-02-14 3:07 GMT+01:00 Rong Rong <[hidden email]>:

> Hi Community,
>
> We are working on support of distinct aggregators over data stream on
> Table/SQL API. Currently there are seems to be many JIRAs related to
> distinct agg over stream use cases which are still pending (FLINK-6249
> <https://issues.apache.org/jira/browse/FLINK-6249>, FLINK-6260
> <https://issues.apache.org/jira/browse/FLINK-6260>, FLINK-5315
> <https://issues.apache.org/jira/browse/FLINK-5315>, FLINK-6335
> <https://issues.apache.org/jira/browse/FLINK-6335>, FLINK-6373
> <https://issues.apache.org/jira/browse/FLINK-6373>, FLINK-6250
> <https://issues.apache.org/jira/browse/FLINK-6250>, etc) and I am having
> some concerns when trying to come up with a solution as there might be
> other use cases out there.
>
> I summarized a write up and categorized the use cases into unbounded or
> bounded aggregations and proposed a solution through modifying and adding
> new distinct aggregate functions using UDAGG API with DataView. Please find
> it here
> <https://docs.google.com/document/d/1zj6OA-K2hi7ah8Fo-
> xTQB-mVmYfm6LsN2_NHgTCVmJI/edit?usp=sharing>
> .
>
> Any comments or suggestions are highly appreciated.
>
> Many Thanks,
> Rong
>

Rong Rong

Re: Support distinct aggregation over data stream on Table/SQL API

Thanks Fabian for the review,

I will incorporate the feedback and finalized the design doc and open a
JIRA to track all sub-tasks.
Please also feel free to comment if there's any other related DISTINCT
aggregation use cases not covered by the design doc.

One higher level question regarding #4, should we always keep Table API
functionalities to be a superset of SQL API?
I have seen some features which are available on Table but not on SQL API
and I was wondering if that is a must obey rule during development.

--
Rong

On Wed, Feb 14, 2018 at 2:32 AM, Fabian Hueske <[hidden email]> wrote:

> Hi Rong,
>
> Thanks for taking the initiative to improve the support for DISTINCT
> aggregations!
> I've made a pass over your design document and left a couple of comments.
> I think it is a really good write up and serves as a good start.
>
> IMO, the next steps could be to
> 1) continue and finalize the discussion on the design doc. Feel free to
> open a new umbrella JIRA and link your doc there.
> 2) check which JIRAs are still relevant. Close or reorganize them according
> to the plan in your design doc and make them subissues of the umbrella
> issue.
> 3) add support for DISTINCT in SQL
> 4) later add extend the Table API to also support distinct aggregations
> (this would be mostly API changes since the execution is solved before)
>
> Let me know what you think.
>
> Best, Fabian
>
>
> 2018-02-14 3:07 GMT+01:00 Rong Rong <[hidden email]>:
>
> > Hi Community,
> >
> > We are working on support of distinct aggregators over data stream on
> > Table/SQL API. Currently there are seems to be many JIRAs related to
> > distinct agg over stream use cases which are still pending (FLINK-6249
> > <https://issues.apache.org/jira/browse/FLINK-6249>, FLINK-6260
> > <https://issues.apache.org/jira/browse/FLINK-6260>, FLINK-5315
> > <https://issues.apache.org/jira/browse/FLINK-5315>, FLINK-6335
> > <https://issues.apache.org/jira/browse/FLINK-6335>, FLINK-6373
> > <https://issues.apache.org/jira/browse/FLINK-6373>, FLINK-6250
> > <https://issues.apache.org/jira/browse/FLINK-6250>, etc) and I am having
> > some concerns when trying to come up with a solution as there might be
> > other use cases out there.
> >
> > I summarized a write up and categorized the use cases into unbounded or
> > bounded aggregations and proposed a solution through modifying and adding
> > new distinct aggregate functions using UDAGG API with DataView. Please
> find
> > it here
> > <https://docs.google.com/document/d/1zj6OA-K2hi7ah8Fo-
> > xTQB-mVmYfm6LsN2_NHgTCVmJI/edit?usp=sharing>
> > .
> >
> > Any comments or suggestions are highly appreciated.
> >
> > Many Thanks,
> > Rong
> >
>

Fabian Hueske-2

Re: Support distinct aggregation over data stream on Table/SQL API

Hi Rong,

Thanks for the update!
Please suggest JIRAs to close (or close them yourself if possible) if they
are covered by the ones that you create.

At the moment, we aim for feature parity between SQL and Table API.
So ideally all features are available in both APIs. This is usually not too
complicated, because they have the same internal representation (a Calcite
RelNode tree).
A path that we often take is to start implementing a feature for SQL and
add the missing API and translation step to the Table API afterwards.

In the long run, the Table API might have some shortcuts for features that
are hard to express in SQL but we are not there yet.

Best, Fabian

2018-02-15 20:06 GMT+01:00 Rong Rong <[hidden email]>:

> Thanks Fabian for the review,
>
> I will incorporate the feedback and finalized the design doc and open a
> JIRA to track all sub-tasks.
> Please also feel free to comment if there's any other related DISTINCT
> aggregation use cases not covered by the design doc.
>
> One higher level question regarding #4, should we always keep Table API
> functionalities to be a superset of SQL API?
> I have seen some features which are available on Table but not on SQL API
> and I was wondering if that is a must obey rule during development.
>
> --
> Rong
>
> On Wed, Feb 14, 2018 at 2:32 AM, Fabian Hueske <[hidden email]> wrote:
>
> > Hi Rong,
> >
> > Thanks for taking the initiative to improve the support for DISTINCT
> > aggregations!
> > I've made a pass over your design document and left a couple of comments.
> > I think it is a really good write up and serves as a good start.
> >
> > IMO, the next steps could be to
> > 1) continue and finalize the discussion on the design doc. Feel free to
> > open a new umbrella JIRA and link your doc there.
> > 2) check which JIRAs are still relevant. Close or reorganize them
> according
> > to the plan in your design doc and make them subissues of the umbrella
> > issue.
> > 3) add support for DISTINCT in SQL
> > 4) later add extend the Table API to also support distinct aggregations
> > (this would be mostly API changes since the execution is solved before)
> >
> > Let me know what you think.
> >
> > Best, Fabian
> >
> >
> > 2018-02-14 3:07 GMT+01:00 Rong Rong <[hidden email]>:
> >
> > > Hi Community,
> > >
> > > We are working on support of distinct aggregators over data stream on
> > > Table/SQL API. Currently there are seems to be many JIRAs related to
> > > distinct agg over stream use cases which are still pending (FLINK-6249
> > > <https://issues.apache.org/jira/browse/FLINK-6249>, FLINK-6260
> > > <https://issues.apache.org/jira/browse/FLINK-6260>, FLINK-5315
> > > <https://issues.apache.org/jira/browse/FLINK-5315>, FLINK-6335
> > > <https://issues.apache.org/jira/browse/FLINK-6335>, FLINK-6373
> > > <https://issues.apache.org/jira/browse/FLINK-6373>, FLINK-6250
> > > <https://issues.apache.org/jira/browse/FLINK-6250>, etc) and I am
> having
> > > some concerns when trying to come up with a solution as there might be
> > > other use cases out there.
> > >
> > > I summarized a write up and categorized the use cases into unbounded or
> > > bounded aggregations and proposed a solution through modifying and
> adding
> > > new distinct aggregate functions using UDAGG API with DataView. Please
> > find
> > > it here
> > > <https://docs.google.com/document/d/1zj6OA-K2hi7ah8Fo-
> > > xTQB-mVmYfm6LsN2_NHgTCVmJI/edit?usp=sharing>
> > > .
> > >
> > > Any comments or suggestions are highly appreciated.
> > >
> > > Many Thanks,
> > > Rong
> > >
> >
>