Hi everyone,
I want to start a FLIP about supporting windowing table-valued functions (TVF). The main purpose of this FLIP is to improve the near real-time (NRT) experience of Flink. FLIP-145: https://cwiki.apache.org/confluence/display/FLINK/FLIP-145%3A+Support+SQL+windowing+table-valued+function We want to introduce TUMBLE, HOP, CUMULATE windowing TVFs, the CUMULATE is a new kind of window. With the windowing TVFs, we can support richer operations on windows, including window join, window TopN and so on. This makes things simple: we only need to assign windows at the beginning of the query, and then apply operations after that like traditional batch SQL. We hope it can help to reduce the learning curve of windows, improve NRT for Flink, and attract more batch users. A simple code snippet for 10 minutes tumbling window aggregate: SELECT window_start, window_end, SUM(price) FROM TABLE( TUMBLE(TABLE Bid, DESCRIPTOR(bidtime), INTERVAL '10' MINUTES)) GROUP BY window_start, window_end; I'm looking forward to your feedback. Best, Jark |
Hi, Jark,
I'm very interested in this feature, and I'm also working on this recently. I just have a glance at the FLIP, it's good, but I found that there is no plan to add SESSION windows. Also, I think there can be more things we can do based on this new syntax. For example, - window sort support - window union/intersect/minus support - Improve dimension table join We can have more deep discussion on this new feature later . I've also opened an jira that is related to this feature recently: https://issues.apache.org/jira/browse/FLINK-18830 Best! PengchengLiu 在 2020/9/25 下午10:30,“Jark Wu”<[hidden email]> 写入: Hi everyone, I want to start a FLIP about supporting windowing table-valued functions (TVF). The main purpose of this FLIP is to improve the near real-time (NRT) experience of Flink. FLIP-145: https://cwiki.apache.org/confluence/display/FLINK/FLIP-145%3A+Support+SQL+windowing+table-valued+function We want to introduce TUMBLE, HOP, CUMULATE windowing TVFs, the CUMULATE is a new kind of window. With the windowing TVFs, we can support richer operations on windows, including window join, window TopN and so on. This makes things simple: we only need to assign windows at the beginning of the query, and then apply operations after that like traditional batch SQL. We hope it can help to reduce the learning curve of windows, improve NRT for Flink, and attract more batch users. A simple code snippet for 10 minutes tumbling window aggregate: SELECT window_start, window_end, SUM(price) FROM TABLE( TUMBLE(TABLE Bid, DESCRIPTOR(bidtime), INTERVAL '10' MINUTES)) GROUP BY window_start, window_end; I'm looking forward to your feedback. Best, Jark |
Hi pengcheng,
That's great to see you also have the need of window join. You are right, the windowing TVF is a powerful feature which can support more operations in the future. I think it as of the date time "partition" selection in batch SQL jobs, with this new syntax, I think it is possible to migrate traditional batch SQL jobs to Flink SQL by changing a few lines. Regarding the SESSION window, this is on purpose to keep it out of the FLIP, because we want to keep the FLIP small to catch up 1.12 and SESSION TVF is rarely useful (e.g. session window join?). Best, Jark On Fri, 25 Sep 2020 at 22:59, liupengcheng <[hidden email]> wrote: > Hi, Jark, > I'm very interested in this feature, and I'm also working on this > recently. > I just have a glance at the FLIP, it's good, but I found that > there is no plan to add SESSION windows. > Also, I think there can be more things we can do based on this new > syntax. For example, > - window sort support > - window union/intersect/minus support > - Improve dimension table join > We can have more deep discussion on this new feature later . > I've also opened an jira that is related to this feature recently: > https://issues.apache.org/jira/browse/FLINK-18830 > > Best! > PengchengLiu > > 在 2020/9/25 下午10:30,“Jark Wu”<[hidden email]> 写入: > > Hi everyone, > > I want to start a FLIP about supporting windowing table-valued > functions > (TVF). > The main purpose of this FLIP is to improve the near real-time (NRT) > experience of Flink. > > FLIP-145: > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-145%3A+Support+SQL+windowing+table-valued+function > > We want to introduce TUMBLE, HOP, CUMULATE windowing TVFs, the > CUMULATE is > a new kind of window. > With the windowing TVFs, we can support richer operations on windows, > including window join, window TopN and so on. > This makes things simple: we only need to assign windows at the > beginning > of the query, and then apply operations after that like traditional > batch > SQL. > We hope it can help to reduce the learning curve of windows, improve > NRT > for Flink, and attract more batch users. > > A simple code snippet for 10 minutes tumbling window aggregate: > > SELECT window_start, window_end, SUM(price) > FROM TABLE( > TUMBLE(TABLE Bid, DESCRIPTOR(bidtime), INTERVAL '10' MINUTES)) > GROUP BY window_start, window_end; > > I'm looking forward to your feedback. > > Best, > Jark > > > |
Hi Jark,
Thanks for reply, yes, I think it's a good feature, it can improve the NRT scenarios as you mentioned in the FLIP. Also, I think it can improve the streaming SQL greatly, it can support richer window operations in flink SQL and bring great convenience to users. (we are now only supported group window in flink). Regarding the SESSION window, I think it's especially useful for user behavior analysis(e.g. counting user visits on a news website or social platform), but I agree that we can keep it out of the FLIP now to catch up 1.12. Recently, I've done some work on the stream planner with the TVFs, and I'm willing to contribute to this part. Is it in the plan of this FLIP? Best, PengchengLiu 在 2020/9/26 下午11:09,“Jark Wu”<[hidden email]> 写入: Hi pengcheng, That's great to see you also have the need of window join. You are right, the windowing TVF is a powerful feature which can support more operations in the future. I think it as of the date time "partition" selection in batch SQL jobs, with this new syntax, I think it is possible to migrate traditional batch SQL jobs to Flink SQL by changing a few lines. Regarding the SESSION window, this is on purpose to keep it out of the FLIP, because we want to keep the FLIP small to catch up 1.12 and SESSION TVF is rarely useful (e.g. session window join?). Best, Jark On Fri, 25 Sep 2020 at 22:59, liupengcheng <[hidden email]> wrote: > Hi, Jark, > I'm very interested in this feature, and I'm also working on this > recently. > I just have a glance at the FLIP, it's good, but I found that > there is no plan to add SESSION windows. > Also, I think there can be more things we can do based on this new > syntax. For example, > - window sort support > - window union/intersect/minus support > - Improve dimension table join > We can have more deep discussion on this new feature later . > I've also opened an jira that is related to this feature recently: > https://issues.apache.org/jira/browse/FLINK-18830 > > Best! > PengchengLiu > > 在 2020/9/25 下午10:30,“Jark Wu”<[hidden email]> 写入: > > Hi everyone, > > I want to start a FLIP about supporting windowing table-valued > functions > (TVF). > The main purpose of this FLIP is to improve the near real-time (NRT) > experience of Flink. > > FLIP-145: > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-145%3A+Support+SQL+windowing+table-valued+function > > We want to introduce TUMBLE, HOP, CUMULATE windowing TVFs, the > CUMULATE is > a new kind of window. > With the windowing TVFs, we can support richer operations on windows, > including window join, window TopN and so on. > This makes things simple: we only need to assign windows at the > beginning > of the query, and then apply operations after that like traditional > batch > SQL. > We hope it can help to reduce the learning curve of windows, improve > NRT > for Flink, and attract more batch users. > > A simple code snippet for 10 minutes tumbling window aggregate: > > SELECT window_start, window_end, SUM(price) > FROM TABLE( > TUMBLE(TABLE Bid, DESCRIPTOR(bidtime), INTERVAL '10' MINUTES)) > GROUP BY window_start, window_end; > > I'm looking forward to your feedback. > > Best, > Jark > > > |
Hi Pengcheng,
Yes, the window TVF is part of the FLIP. Welcome to contribute and join the discussion. Regarding the SESSION window aggregation, users can use the existing grouped session window function. Best, Jark On Sun, 27 Sep 2020 at 21:24, liupengcheng <[hidden email]> wrote: > Hi Jark, > Thanks for reply, yes, I think it's a good feature, it can improve > the NRT scenarios > as you mentioned in the FLIP. Also, I think it can improve the > streaming SQL greatly, > it can support richer window operations in flink SQL and bring > great convenience to users. > (we are now only supported group window in flink). > > Regarding the SESSION window, I think it's especially useful for > user behavior analysis(e.g. > counting user visits on a news website or social platform), but I > agree that we can keep it > out of the FLIP now to catch up 1.12. > > Recently, I've done some work on the stream planner with the TVFs, > and I'm willing to contribute > to this part. Is it in the plan of this FLIP? > > Best, > PengchengLiu > > > 在 2020/9/26 下午11:09,“Jark Wu”<[hidden email]> 写入: > > Hi pengcheng, > > That's great to see you also have the need of window join. > You are right, the windowing TVF is a powerful feature which can > support > more operations in the future. > I think it as of the date time "partition" selection in batch SQL jobs, > with this new syntax, I think it is possible > to migrate traditional batch SQL jobs to Flink SQL by changing a few > lines. > > Regarding the SESSION window, this is on purpose to keep it out of the > FLIP, because we want to keep the > FLIP small to catch up 1.12 and SESSION TVF is rarely useful (e.g. > session > window join?). > > Best, > Jark > > On Fri, 25 Sep 2020 at 22:59, liupengcheng < > [hidden email]> > wrote: > > > Hi, Jark, > > I'm very interested in this feature, and I'm also working on > this > > recently. > > I just have a glance at the FLIP, it's good, but I found that > > there is no plan to add SESSION windows. > > Also, I think there can be more things we can do based on > this new > > syntax. For example, > > - window sort support > > - window union/intersect/minus support > > - Improve dimension table join > > We can have more deep discussion on this new feature later . > > I've also opened an jira that is related to this feature > recently: > > https://issues.apache.org/jira/browse/FLINK-18830 > > > > Best! > > PengchengLiu > > > > 在 2020/9/25 下午10:30,“Jark Wu”<[hidden email]> 写入: > > > > Hi everyone, > > > > I want to start a FLIP about supporting windowing table-valued > > functions > > (TVF). > > The main purpose of this FLIP is to improve the near real-time > (NRT) > > experience of Flink. > > > > FLIP-145: > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-145%3A+Support+SQL+windowing+table-valued+function > > > > We want to introduce TUMBLE, HOP, CUMULATE windowing TVFs, the > > CUMULATE is > > a new kind of window. > > With the windowing TVFs, we can support richer operations on > windows, > > including window join, window TopN and so on. > > This makes things simple: we only need to assign windows at the > > beginning > > of the query, and then apply operations after that like > traditional > > batch > > SQL. > > We hope it can help to reduce the learning curve of windows, > improve > > NRT > > for Flink, and attract more batch users. > > > > A simple code snippet for 10 minutes tumbling window aggregate: > > > > SELECT window_start, window_end, SUM(price) > > FROM TABLE( > > TUMBLE(TABLE Bid, DESCRIPTOR(bidtime), INTERVAL '10' > MINUTES)) > > GROUP BY window_start, window_end; > > > > I'm looking forward to your feedback. > > > > Best, > > Jark > > > > > > > > > |
Hi everyone,
I have added a section for Performance Optimization to describe how to improve the performance in the short-term and long-term and sketch the future performance potential under the new window API. Introducing the window API is just the first step, we will continuously improve the performance to make it powerful and useful. Best, Jark On Thu, 1 Oct 2020 at 14:28, Jark Wu <[hidden email]> wrote: > Hi Pengcheng, > > Yes, the window TVF is part of the FLIP. Welcome to contribute and join > the discussion. > Regarding the SESSION window aggregation, users can use the existing > grouped session window function. > > Best, > Jark > > On Sun, 27 Sep 2020 at 21:24, liupengcheng <[hidden email]> > wrote: > >> Hi Jark, >> Thanks for reply, yes, I think it's a good feature, it can >> improve the NRT scenarios >> as you mentioned in the FLIP. Also, I think it can improve the >> streaming SQL greatly, >> it can support richer window operations in flink SQL and bring >> great convenience to users. >> (we are now only supported group window in flink). >> >> Regarding the SESSION window, I think it's especially useful for >> user behavior analysis(e.g. >> counting user visits on a news website or social platform), but I >> agree that we can keep it >> out of the FLIP now to catch up 1.12. >> >> Recently, I've done some work on the stream planner with the >> TVFs, and I'm willing to contribute >> to this part. Is it in the plan of this FLIP? >> >> Best, >> PengchengLiu >> >> >> 在 2020/9/26 下午11:09,“Jark Wu”<[hidden email]> 写入: >> >> Hi pengcheng, >> >> That's great to see you also have the need of window join. >> You are right, the windowing TVF is a powerful feature which can >> support >> more operations in the future. >> I think it as of the date time "partition" selection in batch SQL >> jobs, >> with this new syntax, I think it is possible >> to migrate traditional batch SQL jobs to Flink SQL by changing a few >> lines. >> >> Regarding the SESSION window, this is on purpose to keep it out of the >> FLIP, because we want to keep the >> FLIP small to catch up 1.12 and SESSION TVF is rarely useful (e.g. >> session >> window join?). >> >> Best, >> Jark >> >> On Fri, 25 Sep 2020 at 22:59, liupengcheng < >> [hidden email]> >> wrote: >> >> > Hi, Jark, >> > I'm very interested in this feature, and I'm also working >> on this >> > recently. >> > I just have a glance at the FLIP, it's good, but I found >> that >> > there is no plan to add SESSION windows. >> > Also, I think there can be more things we can do based on >> this new >> > syntax. For example, >> > - window sort support >> > - window union/intersect/minus support >> > - Improve dimension table join >> > We can have more deep discussion on this new feature later . >> > I've also opened an jira that is related to this feature >> recently: >> > https://issues.apache.org/jira/browse/FLINK-18830 >> > >> > Best! >> > PengchengLiu >> > >> > 在 2020/9/25 下午10:30,“Jark Wu”<[hidden email]> 写入: >> > >> > Hi everyone, >> > >> > I want to start a FLIP about supporting windowing table-valued >> > functions >> > (TVF). >> > The main purpose of this FLIP is to improve the near real-time >> (NRT) >> > experience of Flink. >> > >> > FLIP-145: >> > >> > >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-145%3A+Support+SQL+windowing+table-valued+function >> > >> > We want to introduce TUMBLE, HOP, CUMULATE windowing TVFs, the >> > CUMULATE is >> > a new kind of window. >> > With the windowing TVFs, we can support richer operations on >> windows, >> > including window join, window TopN and so on. >> > This makes things simple: we only need to assign windows at the >> > beginning >> > of the query, and then apply operations after that like >> traditional >> > batch >> > SQL. >> > We hope it can help to reduce the learning curve of windows, >> improve >> > NRT >> > for Flink, and attract more batch users. >> > >> > A simple code snippet for 10 minutes tumbling window aggregate: >> > >> > SELECT window_start, window_end, SUM(price) >> > FROM TABLE( >> > TUMBLE(TABLE Bid, DESCRIPTOR(bidtime), INTERVAL '10' >> MINUTES)) >> > GROUP BY window_start, window_end; >> > >> > I'm looking forward to your feedback. >> > >> > Best, >> > Jark >> > >> > >> > >> >> >> |
Hi all,
I know we have a lot of discussion and development on going right now but it would be great if we can get FLIP-145 into a votable state. If there are no objections, I would like to start voting in the next days. Best, Jark On Thu, 1 Oct 2020 at 14:29, Jark Wu <[hidden email]> wrote: > Hi everyone, > > I have added a section for Performance Optimization to describe how to > improve the performance in the short-term and long-term > and sketch the future performance potential under the new window API. > Introducing the window API is just the first step, we will > continuously improve the performance to make it powerful and useful. > > Best, > Jark > > On Thu, 1 Oct 2020 at 14:28, Jark Wu <[hidden email]> wrote: > >> Hi Pengcheng, >> >> Yes, the window TVF is part of the FLIP. Welcome to contribute and join >> the discussion. >> Regarding the SESSION window aggregation, users can use the existing >> grouped session window function. >> >> Best, >> Jark >> >> On Sun, 27 Sep 2020 at 21:24, liupengcheng <[hidden email]> >> wrote: >> >>> Hi Jark, >>> Thanks for reply, yes, I think it's a good feature, it can >>> improve the NRT scenarios >>> as you mentioned in the FLIP. Also, I think it can improve the >>> streaming SQL greatly, >>> it can support richer window operations in flink SQL and bring >>> great convenience to users. >>> (we are now only supported group window in flink). >>> >>> Regarding the SESSION window, I think it's especially useful for >>> user behavior analysis(e.g. >>> counting user visits on a news website or social platform), but >>> I agree that we can keep it >>> out of the FLIP now to catch up 1.12. >>> >>> Recently, I've done some work on the stream planner with the >>> TVFs, and I'm willing to contribute >>> to this part. Is it in the plan of this FLIP? >>> >>> Best, >>> PengchengLiu >>> >>> >>> 在 2020/9/26 下午11:09,“Jark Wu”<[hidden email]> 写入: >>> >>> Hi pengcheng, >>> >>> That's great to see you also have the need of window join. >>> You are right, the windowing TVF is a powerful feature which can >>> support >>> more operations in the future. >>> I think it as of the date time "partition" selection in batch SQL >>> jobs, >>> with this new syntax, I think it is possible >>> to migrate traditional batch SQL jobs to Flink SQL by changing a >>> few lines. >>> >>> Regarding the SESSION window, this is on purpose to keep it out of >>> the >>> FLIP, because we want to keep the >>> FLIP small to catch up 1.12 and SESSION TVF is rarely useful (e.g. >>> session >>> window join?). >>> >>> Best, >>> Jark >>> >>> On Fri, 25 Sep 2020 at 22:59, liupengcheng < >>> [hidden email]> >>> wrote: >>> >>> > Hi, Jark, >>> > I'm very interested in this feature, and I'm also working >>> on this >>> > recently. >>> > I just have a glance at the FLIP, it's good, but I found >>> that >>> > there is no plan to add SESSION windows. >>> > Also, I think there can be more things we can do based on >>> this new >>> > syntax. For example, >>> > - window sort support >>> > - window union/intersect/minus support >>> > - Improve dimension table join >>> > We can have more deep discussion on this new feature later >>> . >>> > I've also opened an jira that is related to this feature >>> recently: >>> > https://issues.apache.org/jira/browse/FLINK-18830 >>> > >>> > Best! >>> > PengchengLiu >>> > >>> > 在 2020/9/25 下午10:30,“Jark Wu”<[hidden email]> 写入: >>> > >>> > Hi everyone, >>> > >>> > I want to start a FLIP about supporting windowing table-valued >>> > functions >>> > (TVF). >>> > The main purpose of this FLIP is to improve the near real-time >>> (NRT) >>> > experience of Flink. >>> > >>> > FLIP-145: >>> > >>> > >>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-145%3A+Support+SQL+windowing+table-valued+function >>> > >>> > We want to introduce TUMBLE, HOP, CUMULATE windowing TVFs, the >>> > CUMULATE is >>> > a new kind of window. >>> > With the windowing TVFs, we can support richer operations on >>> windows, >>> > including window join, window TopN and so on. >>> > This makes things simple: we only need to assign windows at the >>> > beginning >>> > of the query, and then apply operations after that like >>> traditional >>> > batch >>> > SQL. >>> > We hope it can help to reduce the learning curve of windows, >>> improve >>> > NRT >>> > for Flink, and attract more batch users. >>> > >>> > A simple code snippet for 10 minutes tumbling window aggregate: >>> > >>> > SELECT window_start, window_end, SUM(price) >>> > FROM TABLE( >>> > TUMBLE(TABLE Bid, DESCRIPTOR(bidtime), INTERVAL '10' >>> MINUTES)) >>> > GROUP BY window_start, window_end; >>> > >>> > I'm looking forward to your feedback. >>> > >>> > Best, >>> > Jark >>> > >>> > >>> > >>> >>> >>> |
Thanks Jark for bringing this discussion, I like this FLIP very much.
Especially the cumulate window, it's much like the current TUMBLE window + Fast Emit (which is an undocumented experimental feature), however, it's more powerful. And This will make the SQL semantic more standard, especially for the HOPPING window. Regarding time attribute, It seems that we don't need a specific function to infer the time attribute like `TUMBLE_ROWTIME` / `TUMBLE_PROCTIME`. Then are `window_start` and `window_end` column a time attribute column automatically? - If not, what will be the time attribute of the result relation of these TVFs? Especially after the window aggregation. - If yes, then how do we handle proctime? Regarding batch operators, It's great to hear that we can reuse the batch operators in continuous batch mode as you mentioned in the FLIP. Current window aggregate could also be used in batch mode with rowtime. Do you plan to support these TVFs for batch mode in this FLIP? Hence the Table/SQL is a unified API, it's great if we can keep the features complete both in streaming and batch mode. There is one more question, I don't know whether it should be considered in this FLIP. Does the new window support `grouping sets`? (It's not supported in old window impl). Jark Wu <[hidden email]> 于2020年10月9日周五 下午4:14写道: > Hi all, > > I know we have a lot of discussion and development on going right now but > it would be great if we can get FLIP-145 into a votable state. > If there are no objections, I would like to start voting in the next days. > > Best, > Jark > > On Thu, 1 Oct 2020 at 14:29, Jark Wu <[hidden email]> wrote: > > > Hi everyone, > > > > I have added a section for Performance Optimization to describe how to > > improve the performance in the short-term and long-term > > and sketch the future performance potential under the new window API. > > Introducing the window API is just the first step, we will > > continuously improve the performance to make it powerful and useful. > > > > Best, > > Jark > > > > On Thu, 1 Oct 2020 at 14:28, Jark Wu <[hidden email]> wrote: > > > >> Hi Pengcheng, > >> > >> Yes, the window TVF is part of the FLIP. Welcome to contribute and join > >> the discussion. > >> Regarding the SESSION window aggregation, users can use the existing > >> grouped session window function. > >> > >> Best, > >> Jark > >> > >> On Sun, 27 Sep 2020 at 21:24, liupengcheng <[hidden email] > > > >> wrote: > >> > >>> Hi Jark, > >>> Thanks for reply, yes, I think it's a good feature, it can > >>> improve the NRT scenarios > >>> as you mentioned in the FLIP. Also, I think it can improve the > >>> streaming SQL greatly, > >>> it can support richer window operations in flink SQL and bring > >>> great convenience to users. > >>> (we are now only supported group window in flink). > >>> > >>> Regarding the SESSION window, I think it's especially useful > for > >>> user behavior analysis(e.g. > >>> counting user visits on a news website or social platform), but > >>> I agree that we can keep it > >>> out of the FLIP now to catch up 1.12. > >>> > >>> Recently, I've done some work on the stream planner with the > >>> TVFs, and I'm willing to contribute > >>> to this part. Is it in the plan of this FLIP? > >>> > >>> Best, > >>> PengchengLiu > >>> > >>> > >>> 在 2020/9/26 下午11:09,“Jark Wu”<[hidden email]> 写入: > >>> > >>> Hi pengcheng, > >>> > >>> That's great to see you also have the need of window join. > >>> You are right, the windowing TVF is a powerful feature which can > >>> support > >>> more operations in the future. > >>> I think it as of the date time "partition" selection in batch SQL > >>> jobs, > >>> with this new syntax, I think it is possible > >>> to migrate traditional batch SQL jobs to Flink SQL by changing a > >>> few lines. > >>> > >>> Regarding the SESSION window, this is on purpose to keep it out of > >>> the > >>> FLIP, because we want to keep the > >>> FLIP small to catch up 1.12 and SESSION TVF is rarely useful (e.g. > >>> session > >>> window join?). > >>> > >>> Best, > >>> Jark > >>> > >>> On Fri, 25 Sep 2020 at 22:59, liupengcheng < > >>> [hidden email]> > >>> wrote: > >>> > >>> > Hi, Jark, > >>> > I'm very interested in this feature, and I'm also working > >>> on this > >>> > recently. > >>> > I just have a glance at the FLIP, it's good, but I found > >>> that > >>> > there is no plan to add SESSION windows. > >>> > Also, I think there can be more things we can do based on > >>> this new > >>> > syntax. For example, > >>> > - window sort support > >>> > - window union/intersect/minus support > >>> > - Improve dimension table join > >>> > We can have more deep discussion on this new feature > later > >>> . > >>> > I've also opened an jira that is related to this feature > >>> recently: > >>> > https://issues.apache.org/jira/browse/FLINK-18830 > >>> > > >>> > Best! > >>> > PengchengLiu > >>> > > >>> > 在 2020/9/25 下午10:30,“Jark Wu”<[hidden email]> 写入: > >>> > > >>> > Hi everyone, > >>> > > >>> > I want to start a FLIP about supporting windowing > table-valued > >>> > functions > >>> > (TVF). > >>> > The main purpose of this FLIP is to improve the near > real-time > >>> (NRT) > >>> > experience of Flink. > >>> > > >>> > FLIP-145: > >>> > > >>> > > >>> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-145%3A+Support+SQL+windowing+table-valued+function > >>> > > >>> > We want to introduce TUMBLE, HOP, CUMULATE windowing TVFs, > the > >>> > CUMULATE is > >>> > a new kind of window. > >>> > With the windowing TVFs, we can support richer operations on > >>> windows, > >>> > including window join, window TopN and so on. > >>> > This makes things simple: we only need to assign windows at > the > >>> > beginning > >>> > of the query, and then apply operations after that like > >>> traditional > >>> > batch > >>> > SQL. > >>> > We hope it can help to reduce the learning curve of windows, > >>> improve > >>> > NRT > >>> > for Flink, and attract more batch users. > >>> > > >>> > A simple code snippet for 10 minutes tumbling window > aggregate: > >>> > > >>> > SELECT window_start, window_end, SUM(price) > >>> > FROM TABLE( > >>> > TUMBLE(TABLE Bid, DESCRIPTOR(bidtime), INTERVAL '10' > >>> MINUTES)) > >>> > GROUP BY window_start, window_end; > >>> > > >>> > I'm looking forward to your feedback. > >>> > > >>> > Best, > >>> > Jark > >>> > > >>> > > >>> > > >>> > >>> > >>> > -- Best, Benchao Li |
Hi,Benchao,
Welcome to join the discussion, yes, this new syntax can make SQL more clear and simpler. For your first question, the `window_start` and `window_end` columns will be added automatically, so we don't need to use auxiliary group functions to infer or access the window properties. For the `grouping sets` on TVFs, I think it's interesting if we can support it, as we already supported `grouping sets` on streaming aggregates in blink planner. But I'm not sure if it will be included into this FLIP. cc @Jark Wu Best, Pengcheng 在 2020/10/9 下午5:25,“Benchao Li”<[hidden email]> 写入: Thanks Jark for bringing this discussion, I like this FLIP very much. Especially the cumulate window, it's much like the current TUMBLE window + Fast Emit (which is an undocumented experimental feature), however, it's more powerful. And This will make the SQL semantic more standard, especially for the HOPPING window. Regarding time attribute, It seems that we don't need a specific function to infer the time attribute like `TUMBLE_ROWTIME` / `TUMBLE_PROCTIME`. Then are `window_start` and `window_end` column a time attribute column automatically? - If not, what will be the time attribute of the result relation of these TVFs? Especially after the window aggregation. - If yes, then how do we handle proctime? Regarding batch operators, It's great to hear that we can reuse the batch operators in continuous batch mode as you mentioned in the FLIP. Current window aggregate could also be used in batch mode with rowtime. Do you plan to support these TVFs for batch mode in this FLIP? Hence the Table/SQL is a unified API, it's great if we can keep the features complete both in streaming and batch mode. There is one more question, I don't know whether it should be considered in this FLIP. Does the new window support `grouping sets`? (It's not supported in old window impl). Jark Wu <[hidden email]> 于2020年10月9日周五 下午4:14写道: > Hi all, > > I know we have a lot of discussion and development on going right now but > it would be great if we can get FLIP-145 into a votable state. > If there are no objections, I would like to start voting in the next days. > > Best, > Jark > > On Thu, 1 Oct 2020 at 14:29, Jark Wu <[hidden email]> wrote: > > > Hi everyone, > > > > I have added a section for Performance Optimization to describe how to > > improve the performance in the short-term and long-term > > and sketch the future performance potential under the new window API. > > Introducing the window API is just the first step, we will > > continuously improve the performance to make it powerful and useful. > > > > Best, > > Jark > > > > On Thu, 1 Oct 2020 at 14:28, Jark Wu <[hidden email]> wrote: > > > >> Hi Pengcheng, > >> > >> Yes, the window TVF is part of the FLIP. Welcome to contribute and join > >> the discussion. > >> Regarding the SESSION window aggregation, users can use the existing > >> grouped session window function. > >> > >> Best, > >> Jark > >> > >> On Sun, 27 Sep 2020 at 21:24, liupengcheng <[hidden email] > > > >> wrote: > >> > >>> Hi Jark, > >>> Thanks for reply, yes, I think it's a good feature, it can > >>> improve the NRT scenarios > >>> as you mentioned in the FLIP. Also, I think it can improve the > >>> streaming SQL greatly, > >>> it can support richer window operations in flink SQL and bring > >>> great convenience to users. > >>> (we are now only supported group window in flink). > >>> > >>> Regarding the SESSION window, I think it's especially useful > for > >>> user behavior analysis(e.g. > >>> counting user visits on a news website or social platform), but > >>> I agree that we can keep it > >>> out of the FLIP now to catch up 1.12. > >>> > >>> Recently, I've done some work on the stream planner with the > >>> TVFs, and I'm willing to contribute > >>> to this part. Is it in the plan of this FLIP? > >>> > >>> Best, > >>> PengchengLiu > >>> > >>> > >>> 在 2020/9/26 下午11:09,“Jark Wu”<[hidden email]> 写入: > >>> > >>> Hi pengcheng, > >>> > >>> That's great to see you also have the need of window join. > >>> You are right, the windowing TVF is a powerful feature which can > >>> support > >>> more operations in the future. > >>> I think it as of the date time "partition" selection in batch SQL > >>> jobs, > >>> with this new syntax, I think it is possible > >>> to migrate traditional batch SQL jobs to Flink SQL by changing a > >>> few lines. > >>> > >>> Regarding the SESSION window, this is on purpose to keep it out of > >>> the > >>> FLIP, because we want to keep the > >>> FLIP small to catch up 1.12 and SESSION TVF is rarely useful (e.g. > >>> session > >>> window join?). > >>> > >>> Best, > >>> Jark > >>> > >>> On Fri, 25 Sep 2020 at 22:59, liupengcheng < > >>> [hidden email]> > >>> wrote: > >>> > >>> > Hi, Jark, > >>> > I'm very interested in this feature, and I'm also working > >>> on this > >>> > recently. > >>> > I just have a glance at the FLIP, it's good, but I found > >>> that > >>> > there is no plan to add SESSION windows. > >>> > Also, I think there can be more things we can do based on > >>> this new > >>> > syntax. For example, > >>> > - window sort support > >>> > - window union/intersect/minus support > >>> > - Improve dimension table join > >>> > We can have more deep discussion on this new feature > later > >>> . > >>> > I've also opened an jira that is related to this feature > >>> recently: > >>> > https://issues.apache.org/jira/browse/FLINK-18830 > >>> > > >>> > Best! > >>> > PengchengLiu > >>> > > >>> > 在 2020/9/25 下午10:30,“Jark Wu”<[hidden email]> 写入: > >>> > > >>> > Hi everyone, > >>> > > >>> > I want to start a FLIP about supporting windowing > table-valued > >>> > functions > >>> > (TVF). > >>> > The main purpose of this FLIP is to improve the near > real-time > >>> (NRT) > >>> > experience of Flink. > >>> > > >>> > FLIP-145: > >>> > > >>> > > >>> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-145%3A+Support+SQL+windowing+table-valued+function > >>> > > >>> > We want to introduce TUMBLE, HOP, CUMULATE windowing TVFs, > the > >>> > CUMULATE is > >>> > a new kind of window. > >>> > With the windowing TVFs, we can support richer operations on > >>> windows, > >>> > including window join, window TopN and so on. > >>> > This makes things simple: we only need to assign windows at > the > >>> > beginning > >>> > of the query, and then apply operations after that like > >>> traditional > >>> > batch > >>> > SQL. > >>> > We hope it can help to reduce the learning curve of windows, > >>> improve > >>> > NRT > >>> > for Flink, and attract more batch users. > >>> > > >>> > A simple code snippet for 10 minutes tumbling window > aggregate: > >>> > > >>> > SELECT window_start, window_end, SUM(price) > >>> > FROM TABLE( > >>> > TUMBLE(TABLE Bid, DESCRIPTOR(bidtime), INTERVAL '10' > >>> MINUTES)) > >>> > GROUP BY window_start, window_end; > >>> > > >>> > I'm looking forward to your feedback. > >>> > > >>> > Best, > >>> > Jark > >>> > > >>> > > >>> > > >>> > >>> > >>> > -- Best, Benchao Li |
Hi Benchao,
1) time attribute Yes. We don't need time attribute auxiliary function. Because the new window operations are all based on the window_start and window_end columns instead of on the time attributes. So we don't need to propagate time attributes. Cascaded window aggregate can be expressed by simply GROUP BY the window_start and window_end of the previous window result. I have added a cascaded window aggregate example in the Tumbling Window section in the FLIP. If you want to define proctime window aggregate, the time column in TVF should be a proctime attribute field (or PROCTIME() function). 2) batch support Yes. The proposed syntax/API are unified for batch and streaming. Batch support is in the plan, but may not have enough time to catch up 1.12. 3) support `grouping sets` This is not included in the FLIP, but I think it's great if we can support `grouping sets`. The existing window impl doesn't support this because we convert the LogicalAggregate into WindowAggregate in the beginning, the expand grouping sets rule can't be applied in this situation. Fortunately, with the new window impl, the conversion to WindowAggregate will happen at the end, so I think the expand rule can be applied and support this feature naturally. Therefore, IMO, we don't need to include this feature in this FLIP to avoid the FLIP being too large. This can be a follow-up issue (maybe just add tests and docs) after the FLIP. Best, Jark On Fri, 9 Oct 2020 at 19:09, 刘 芃成 <[hidden email]> wrote: > Hi,Benchao, > Welcome to join the discussion, yes, this new syntax can make SQL > more clear and simpler. > For your first question, the `window_start` and `window_end` > columns will be added automatically, > so we don't need to use auxiliary group functions to infer or > access the window properties. > > For the `grouping sets` on TVFs, I think it's interesting if we > can support it, as we already supported `grouping sets` > on streaming aggregates in blink planner. But I'm not sure if it > will be included into this FLIP. > > cc @Jark Wu > > Best, > Pengcheng > > > 在 2020/10/9 下午5:25,“Benchao Li”<[hidden email]> 写入: > > Thanks Jark for bringing this discussion, I like this FLIP very much. > > Especially the cumulate window, it's much like the current TUMBLE > window + > Fast Emit (which is an undocumented experimental feature), however, > it's > more powerful. > > And This will make the SQL semantic more standard, especially for the > HOPPING window. > > Regarding time attribute, > It seems that we don't need a specific function to infer the time > attribute > like > `TUMBLE_ROWTIME` / `TUMBLE_PROCTIME`. Then are `window_start` and > `window_end` > column a time attribute column automatically? > - If not, what will be the time attribute of the result relation of > these > TVFs? > Especially after the window aggregation. > - If yes, then how do we handle proctime? > > Regarding batch operators, > It's great to hear that we can reuse the batch operators in continuous > batch mode > as you mentioned in the FLIP. > Current window aggregate could also be used in batch mode with > rowtime. Do > you plan > to support these TVFs for batch mode in this FLIP? Hence the Table/SQL > is a > unified > API, it's great if we can keep the features complete both in streaming > and > batch mode. > > There is one more question, I don't know whether it should be > considered in > this FLIP. > Does the new window support `grouping sets`? (It's not supported in old > window impl). > > Jark Wu <[hidden email]> 于2020年10月9日周五 下午4:14写道: > > > Hi all, > > > > I know we have a lot of discussion and development on going right > now but > > it would be great if we can get FLIP-145 into a votable state. > > If there are no objections, I would like to start voting in the next > days. > > > > Best, > > Jark > > > > On Thu, 1 Oct 2020 at 14:29, Jark Wu <[hidden email]> wrote: > > > > > Hi everyone, > > > > > > I have added a section for Performance Optimization to describe > how to > > > improve the performance in the short-term and long-term > > > and sketch the future performance potential under the new window > API. > > > Introducing the window API is just the first step, we will > > > continuously improve the performance to make it powerful and > useful. > > > > > > Best, > > > Jark > > > > > > On Thu, 1 Oct 2020 at 14:28, Jark Wu <[hidden email]> wrote: > > > > > >> Hi Pengcheng, > > >> > > >> Yes, the window TVF is part of the FLIP. Welcome to contribute > and join > > >> the discussion. > > >> Regarding the SESSION window aggregation, users can use the > existing > > >> grouped session window function. > > >> > > >> Best, > > >> Jark > > >> > > >> On Sun, 27 Sep 2020 at 21:24, liupengcheng < > [hidden email] > > > > > >> wrote: > > >> > > >>> Hi Jark, > > >>> Thanks for reply, yes, I think it's a good feature, it > can > > >>> improve the NRT scenarios > > >>> as you mentioned in the FLIP. Also, I think it can > improve the > > >>> streaming SQL greatly, > > >>> it can support richer window operations in flink SQL and > bring > > >>> great convenience to users. > > >>> (we are now only supported group window in flink). > > >>> > > >>> Regarding the SESSION window, I think it's especially > useful > > for > > >>> user behavior analysis(e.g. > > >>> counting user visits on a news website or social > platform), but > > >>> I agree that we can keep it > > >>> out of the FLIP now to catch up 1.12. > > >>> > > >>> Recently, I've done some work on the stream planner with > the > > >>> TVFs, and I'm willing to contribute > > >>> to this part. Is it in the plan of this FLIP? > > >>> > > >>> Best, > > >>> PengchengLiu > > >>> > > >>> > > >>> 在 2020/9/26 下午11:09,“Jark Wu”<[hidden email]> 写入: > > >>> > > >>> Hi pengcheng, > > >>> > > >>> That's great to see you also have the need of window join. > > >>> You are right, the windowing TVF is a powerful feature which > can > > >>> support > > >>> more operations in the future. > > >>> I think it as of the date time "partition" selection in > batch SQL > > >>> jobs, > > >>> with this new syntax, I think it is possible > > >>> to migrate traditional batch SQL jobs to Flink SQL by > changing a > > >>> few lines. > > >>> > > >>> Regarding the SESSION window, this is on purpose to keep it > out of > > >>> the > > >>> FLIP, because we want to keep the > > >>> FLIP small to catch up 1.12 and SESSION TVF is rarely useful > (e.g. > > >>> session > > >>> window join?). > > >>> > > >>> Best, > > >>> Jark > > >>> > > >>> On Fri, 25 Sep 2020 at 22:59, liupengcheng < > > >>> [hidden email]> > > >>> wrote: > > >>> > > >>> > Hi, Jark, > > >>> > I'm very interested in this feature, and I'm also > working > > >>> on this > > >>> > recently. > > >>> > I just have a glance at the FLIP, it's good, but I > found > > >>> that > > >>> > there is no plan to add SESSION windows. > > >>> > Also, I think there can be more things we can do > based on > > >>> this new > > >>> > syntax. For example, > > >>> > - window sort support > > >>> > - window union/intersect/minus support > > >>> > - Improve dimension table join > > >>> > We can have more deep discussion on this new > feature > > later > > >>> . > > >>> > I've also opened an jira that is related to this > feature > > >>> recently: > > >>> > https://issues.apache.org/jira/browse/FLINK-18830 > > >>> > > > >>> > Best! > > >>> > PengchengLiu > > >>> > > > >>> > 在 2020/9/25 下午10:30,“Jark Wu”<[hidden email]> 写入: > > >>> > > > >>> > Hi everyone, > > >>> > > > >>> > I want to start a FLIP about supporting windowing > > table-valued > > >>> > functions > > >>> > (TVF). > > >>> > The main purpose of this FLIP is to improve the near > > real-time > > >>> (NRT) > > >>> > experience of Flink. > > >>> > > > >>> > FLIP-145: > > >>> > > > >>> > > > >>> > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-145%3A+Support+SQL+windowing+table-valued+function > > >>> > > > >>> > We want to introduce TUMBLE, HOP, CUMULATE windowing > TVFs, > > the > > >>> > CUMULATE is > > >>> > a new kind of window. > > >>> > With the windowing TVFs, we can support richer > operations on > > >>> windows, > > >>> > including window join, window TopN and so on. > > >>> > This makes things simple: we only need to assign > windows at > > the > > >>> > beginning > > >>> > of the query, and then apply operations after that like > > >>> traditional > > >>> > batch > > >>> > SQL. > > >>> > We hope it can help to reduce the learning curve of > windows, > > >>> improve > > >>> > NRT > > >>> > for Flink, and attract more batch users. > > >>> > > > >>> > A simple code snippet for 10 minutes tumbling window > > aggregate: > > >>> > > > >>> > SELECT window_start, window_end, SUM(price) > > >>> > FROM TABLE( > > >>> > TUMBLE(TABLE Bid, DESCRIPTOR(bidtime), INTERVAL > '10' > > >>> MINUTES)) > > >>> > GROUP BY window_start, window_end; > > >>> > > > >>> > I'm looking forward to your feedback. > > >>> > > > >>> > Best, > > >>> > Jark > > >>> > > > >>> > > > >>> > > > >>> > > >>> > > >>> > > > > > -- > > Best, > Benchao Li > |
Hi Jark,
2 & 3 sounds good to me. Regarding time attribute, I still have some questions, I knew it's easy to support cascaded window aggregate using new TVFs. However there are some other places where need time attribute: - CEP - interval join - order by - over window If there is no time attribute column, how do we integrate these old features with the new TVFs. E.g. StreamA -> new window aggregate -> interval join -> Sink / StreamB ----------------------------------- Jark Wu <[hidden email]> 于2020年10月9日周五 下午11:51写道: > Hi Benchao, > > 1) time attribute > Yes. We don't need time attribute auxiliary function. Because the new > window operations are all based on the > window_start and window_end columns instead of on the time attributes. So > we don't need to propagate time attributes. > Cascaded window aggregate can be expressed by simply GROUP BY the > window_start and window_end of the previous window result. > I have added a cascaded window aggregate example in the Tumbling Window > section in the FLIP. > If you want to define proctime window aggregate, the time column in TVF > should be a proctime attribute field (or PROCTIME() function). > > 2) batch support > Yes. The proposed syntax/API are unified for batch and streaming. Batch > support is in the plan, but may not have enough time to catch up 1.12. > > 3) support `grouping sets` > This is not included in the FLIP, but I think it's great if we can support > `grouping sets`. > The existing window impl doesn't support this because we convert the > LogicalAggregate into WindowAggregate in the beginning, > the expand grouping sets rule can't be applied in this situation. > Fortunately, with the new window impl, the conversion to WindowAggregate > will happen at the end, so I think the expand rule can be > applied and support this feature naturally. > Therefore, IMO, we don't need to include this feature in this FLIP to avoid > the FLIP being too large. > This can be a follow-up issue (maybe just add tests and docs) after the > FLIP. > > Best, > Jark > > > On Fri, 9 Oct 2020 at 19:09, 刘 芃成 <[hidden email]> wrote: > > > Hi,Benchao, > > Welcome to join the discussion, yes, this new syntax can make SQL > > more clear and simpler. > > For your first question, the `window_start` and `window_end` > > columns will be added automatically, > > so we don't need to use auxiliary group functions to infer or > > access the window properties. > > > > For the `grouping sets` on TVFs, I think it's interesting if we > > can support it, as we already supported `grouping sets` > > on streaming aggregates in blink planner. But I'm not sure if it > > will be included into this FLIP. > > > > cc @Jark Wu > > > > Best, > > Pengcheng > > > > > > 在 2020/10/9 下午5:25,“Benchao Li”<[hidden email]> 写入: > > > > Thanks Jark for bringing this discussion, I like this FLIP very much. > > > > Especially the cumulate window, it's much like the current TUMBLE > > window + > > Fast Emit (which is an undocumented experimental feature), however, > > it's > > more powerful. > > > > And This will make the SQL semantic more standard, especially for the > > HOPPING window. > > > > Regarding time attribute, > > It seems that we don't need a specific function to infer the time > > attribute > > like > > `TUMBLE_ROWTIME` / `TUMBLE_PROCTIME`. Then are `window_start` and > > `window_end` > > column a time attribute column automatically? > > - If not, what will be the time attribute of the result relation of > > these > > TVFs? > > Especially after the window aggregation. > > - If yes, then how do we handle proctime? > > > > Regarding batch operators, > > It's great to hear that we can reuse the batch operators in > continuous > > batch mode > > as you mentioned in the FLIP. > > Current window aggregate could also be used in batch mode with > > rowtime. Do > > you plan > > to support these TVFs for batch mode in this FLIP? Hence the > Table/SQL > > is a > > unified > > API, it's great if we can keep the features complete both in > streaming > > and > > batch mode. > > > > There is one more question, I don't know whether it should be > > considered in > > this FLIP. > > Does the new window support `grouping sets`? (It's not supported in > old > > window impl). > > > > Jark Wu <[hidden email]> 于2020年10月9日周五 下午4:14写道: > > > > > Hi all, > > > > > > I know we have a lot of discussion and development on going right > > now but > > > it would be great if we can get FLIP-145 into a votable state. > > > If there are no objections, I would like to start voting in the > next > > days. > > > > > > Best, > > > Jark > > > > > > On Thu, 1 Oct 2020 at 14:29, Jark Wu <[hidden email]> wrote: > > > > > > > Hi everyone, > > > > > > > > I have added a section for Performance Optimization to describe > > how to > > > > improve the performance in the short-term and long-term > > > > and sketch the future performance potential under the new window > > API. > > > > Introducing the window API is just the first step, we will > > > > continuously improve the performance to make it powerful and > > useful. > > > > > > > > Best, > > > > Jark > > > > > > > > On Thu, 1 Oct 2020 at 14:28, Jark Wu <[hidden email]> wrote: > > > > > > > >> Hi Pengcheng, > > > >> > > > >> Yes, the window TVF is part of the FLIP. Welcome to contribute > > and join > > > >> the discussion. > > > >> Regarding the SESSION window aggregation, users can use the > > existing > > > >> grouped session window function. > > > >> > > > >> Best, > > > >> Jark > > > >> > > > >> On Sun, 27 Sep 2020 at 21:24, liupengcheng < > > [hidden email] > > > > > > > >> wrote: > > > >> > > > >>> Hi Jark, > > > >>> Thanks for reply, yes, I think it's a good feature, it > > can > > > >>> improve the NRT scenarios > > > >>> as you mentioned in the FLIP. Also, I think it can > > improve the > > > >>> streaming SQL greatly, > > > >>> it can support richer window operations in flink SQL > and > > bring > > > >>> great convenience to users. > > > >>> (we are now only supported group window in flink). > > > >>> > > > >>> Regarding the SESSION window, I think it's especially > > useful > > > for > > > >>> user behavior analysis(e.g. > > > >>> counting user visits on a news website or social > > platform), but > > > >>> I agree that we can keep it > > > >>> out of the FLIP now to catch up 1.12. > > > >>> > > > >>> Recently, I've done some work on the stream planner > with > > the > > > >>> TVFs, and I'm willing to contribute > > > >>> to this part. Is it in the plan of this FLIP? > > > >>> > > > >>> Best, > > > >>> PengchengLiu > > > >>> > > > >>> > > > >>> 在 2020/9/26 下午11:09,“Jark Wu”<[hidden email]> 写入: > > > >>> > > > >>> Hi pengcheng, > > > >>> > > > >>> That's great to see you also have the need of window join. > > > >>> You are right, the windowing TVF is a powerful feature > which > > can > > > >>> support > > > >>> more operations in the future. > > > >>> I think it as of the date time "partition" selection in > > batch SQL > > > >>> jobs, > > > >>> with this new syntax, I think it is possible > > > >>> to migrate traditional batch SQL jobs to Flink SQL by > > changing a > > > >>> few lines. > > > >>> > > > >>> Regarding the SESSION window, this is on purpose to keep it > > out of > > > >>> the > > > >>> FLIP, because we want to keep the > > > >>> FLIP small to catch up 1.12 and SESSION TVF is rarely > useful > > (e.g. > > > >>> session > > > >>> window join?). > > > >>> > > > >>> Best, > > > >>> Jark > > > >>> > > > >>> On Fri, 25 Sep 2020 at 22:59, liupengcheng < > > > >>> [hidden email]> > > > >>> wrote: > > > >>> > > > >>> > Hi, Jark, > > > >>> > I'm very interested in this feature, and I'm also > > working > > > >>> on this > > > >>> > recently. > > > >>> > I just have a glance at the FLIP, it's good, but > I > > found > > > >>> that > > > >>> > there is no plan to add SESSION windows. > > > >>> > Also, I think there can be more things we can do > > based on > > > >>> this new > > > >>> > syntax. For example, > > > >>> > - window sort support > > > >>> > - window union/intersect/minus support > > > >>> > - Improve dimension table join > > > >>> > We can have more deep discussion on this new > > feature > > > later > > > >>> . > > > >>> > I've also opened an jira that is related to this > > feature > > > >>> recently: > > > >>> > https://issues.apache.org/jira/browse/FLINK-18830 > > > >>> > > > > >>> > Best! > > > >>> > PengchengLiu > > > >>> > > > > >>> > 在 2020/9/25 下午10:30,“Jark Wu”<[hidden email]> 写入: > > > >>> > > > > >>> > Hi everyone, > > > >>> > > > > >>> > I want to start a FLIP about supporting windowing > > > table-valued > > > >>> > functions > > > >>> > (TVF). > > > >>> > The main purpose of this FLIP is to improve the near > > > real-time > > > >>> (NRT) > > > >>> > experience of Flink. > > > >>> > > > > >>> > FLIP-145: > > > >>> > > > > >>> > > > > >>> > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-145%3A+Support+SQL+windowing+table-valued+function > > > >>> > > > > >>> > We want to introduce TUMBLE, HOP, CUMULATE windowing > > TVFs, > > > the > > > >>> > CUMULATE is > > > >>> > a new kind of window. > > > >>> > With the windowing TVFs, we can support richer > > operations on > > > >>> windows, > > > >>> > including window join, window TopN and so on. > > > >>> > This makes things simple: we only need to assign > > windows at > > > the > > > >>> > beginning > > > >>> > of the query, and then apply operations after that > like > > > >>> traditional > > > >>> > batch > > > >>> > SQL. > > > >>> > We hope it can help to reduce the learning curve of > > windows, > > > >>> improve > > > >>> > NRT > > > >>> > for Flink, and attract more batch users. > > > >>> > > > > >>> > A simple code snippet for 10 minutes tumbling window > > > aggregate: > > > >>> > > > > >>> > SELECT window_start, window_end, SUM(price) > > > >>> > FROM TABLE( > > > >>> > TUMBLE(TABLE Bid, DESCRIPTOR(bidtime), INTERVAL > > '10' > > > >>> MINUTES)) > > > >>> > GROUP BY window_start, window_end; > > > >>> > > > > >>> > I'm looking forward to your feedback. > > > >>> > > > > >>> > Best, > > > >>> > Jark > > > >>> > > > > >>> > > > > >>> > > > > >>> > > > >>> > > > >>> > > > > > > > > > -- > > > > Best, > > Benchao Li > > > -- Best, Benchao Li |
Hi,Benchao,
In TVFs, the time attributes is just passed through from parent rels, and the TVFs just add two additional window attributes(i.e. window_start & window_end). Also, I think the time columns can be not only a time attribute with type of `TimeIndicatorType` but also a regular column with type of `Timestamp`. For cascaded window operations, we can use window_start/window_end of the previous window result directly to indicate operating on the same window, or use new DESCRIPTOR column to assign new windows, in case of the change of the time column(e.g. in some case, the original timestamp is inaccurate and need some conversion to be used). You can check the definition or signature of these TVFs in the FLIP. e.g. > SELECT * FROM TABLE( > TUMBLE(TABLE Bid, DESCRIPTOR(bidtime), INTERVAL '10' MINUTES)) In the example, the `bidtime` is the time attribute column, which is the first operand of the DESCRIPTOR function. +1 start voting. Benchao Li <[hidden email]> 于2020年10月10日周六 上午10:08写道: > Hi Jark, > > 2 & 3 sounds good to me. > > Regarding time attribute, > I still have some questions, I knew it's easy to support cascaded window > aggregate using new TVFs. > However there are some other places where need time attribute: > - CEP > - interval join > - order by > - over window > If there is no time attribute column, how do we integrate these old > features with the new TVFs. > E.g. > StreamA -> new window aggregate -> interval join -> Sink > / > StreamB ----------------------------------- > > > Jark Wu <[hidden email]> 于2020年10月9日周五 下午11:51写道: > >> Hi Benchao, >> >> 1) time attribute >> Yes. We don't need time attribute auxiliary function. Because the new >> window operations are all based on the >> window_start and window_end columns instead of on the time attributes. So >> we don't need to propagate time attributes. >> Cascaded window aggregate can be expressed by simply GROUP BY the >> window_start and window_end of the previous window result. >> I have added a cascaded window aggregate example in the Tumbling Window >> section in the FLIP. >> If you want to define proctime window aggregate, the time column in TVF >> should be a proctime attribute field (or PROCTIME() function). >> >> 2) batch support >> Yes. The proposed syntax/API are unified for batch and streaming. Batch >> support is in the plan, but may not have enough time to catch up 1.12. >> >> 3) support `grouping sets` >> This is not included in the FLIP, but I think it's great if we can support >> `grouping sets`. >> The existing window impl doesn't support this because we convert the >> LogicalAggregate into WindowAggregate in the beginning, >> the expand grouping sets rule can't be applied in this situation. >> Fortunately, with the new window impl, the conversion to WindowAggregate >> will happen at the end, so I think the expand rule can be >> applied and support this feature naturally. >> Therefore, IMO, we don't need to include this feature in this FLIP to >> avoid >> the FLIP being too large. >> This can be a follow-up issue (maybe just add tests and docs) after the >> FLIP. >> >> Best, >> Jark >> >> >> On Fri, 9 Oct 2020 at 19:09, 刘 芃成 <[hidden email]> wrote: >> >> > Hi,Benchao, >> > Welcome to join the discussion, yes, this new syntax can make >> SQL >> > more clear and simpler. >> > For your first question, the `window_start` and `window_end` >> > columns will be added automatically, >> > so we don't need to use auxiliary group functions to infer or >> > access the window properties. >> > >> > For the `grouping sets` on TVFs, I think it's interesting if we >> > can support it, as we already supported `grouping sets` >> > on streaming aggregates in blink planner. But I'm not sure if it >> > will be included into this FLIP. >> > >> > cc @Jark Wu >> > >> > Best, >> > Pengcheng >> > >> > >> > 在 2020/10/9 下午5:25,“Benchao Li”<[hidden email]> 写入: >> > >> > Thanks Jark for bringing this discussion, I like this FLIP very >> much. >> > >> > Especially the cumulate window, it's much like the current TUMBLE >> > window + >> > Fast Emit (which is an undocumented experimental feature), however, >> > it's >> > more powerful. >> > >> > And This will make the SQL semantic more standard, especially for >> the >> > HOPPING window. >> > >> > Regarding time attribute, >> > It seems that we don't need a specific function to infer the time >> > attribute >> > like >> > `TUMBLE_ROWTIME` / `TUMBLE_PROCTIME`. Then are `window_start` and >> > `window_end` >> > column a time attribute column automatically? >> > - If not, what will be the time attribute of the result relation of >> > these >> > TVFs? >> > Especially after the window aggregation. >> > - If yes, then how do we handle proctime? >> > >> > Regarding batch operators, >> > It's great to hear that we can reuse the batch operators in >> continuous >> > batch mode >> > as you mentioned in the FLIP. >> > Current window aggregate could also be used in batch mode with >> > rowtime. Do >> > you plan >> > to support these TVFs for batch mode in this FLIP? Hence the >> Table/SQL >> > is a >> > unified >> > API, it's great if we can keep the features complete both in >> streaming >> > and >> > batch mode. >> > >> > There is one more question, I don't know whether it should be >> > considered in >> > this FLIP. >> > Does the new window support `grouping sets`? (It's not supported in >> old >> > window impl). >> > >> > Jark Wu <[hidden email]> 于2020年10月9日周五 下午4:14写道: >> > >> > > Hi all, >> > > >> > > I know we have a lot of discussion and development on going right >> > now but >> > > it would be great if we can get FLIP-145 into a votable state. >> > > If there are no objections, I would like to start voting in the >> next >> > days. >> > > >> > > Best, >> > > Jark >> > > >> > > On Thu, 1 Oct 2020 at 14:29, Jark Wu <[hidden email]> wrote: >> > > >> > > > Hi everyone, >> > > > >> > > > I have added a section for Performance Optimization to describe >> > how to >> > > > improve the performance in the short-term and long-term >> > > > and sketch the future performance potential under the new window >> > API. >> > > > Introducing the window API is just the first step, we will >> > > > continuously improve the performance to make it powerful and >> > useful. >> > > > >> > > > Best, >> > > > Jark >> > > > >> > > > On Thu, 1 Oct 2020 at 14:28, Jark Wu <[hidden email]> wrote: >> > > > >> > > >> Hi Pengcheng, >> > > >> >> > > >> Yes, the window TVF is part of the FLIP. Welcome to contribute >> > and join >> > > >> the discussion. >> > > >> Regarding the SESSION window aggregation, users can use the >> > existing >> > > >> grouped session window function. >> > > >> >> > > >> Best, >> > > >> Jark >> > > >> >> > > >> On Sun, 27 Sep 2020 at 21:24, liupengcheng < >> > [hidden email] >> > > > >> > > >> wrote: >> > > >> >> > > >>> Hi Jark, >> > > >>> Thanks for reply, yes, I think it's a good feature, it >> > can >> > > >>> improve the NRT scenarios >> > > >>> as you mentioned in the FLIP. Also, I think it can >> > improve the >> > > >>> streaming SQL greatly, >> > > >>> it can support richer window operations in flink SQL >> and >> > bring >> > > >>> great convenience to users. >> > > >>> (we are now only supported group window in flink). >> > > >>> >> > > >>> Regarding the SESSION window, I think it's especially >> > useful >> > > for >> > > >>> user behavior analysis(e.g. >> > > >>> counting user visits on a news website or social >> > platform), but >> > > >>> I agree that we can keep it >> > > >>> out of the FLIP now to catch up 1.12. >> > > >>> >> > > >>> Recently, I've done some work on the stream planner >> with >> > the >> > > >>> TVFs, and I'm willing to contribute >> > > >>> to this part. Is it in the plan of this FLIP? >> > > >>> >> > > >>> Best, >> > > >>> PengchengLiu >> > > >>> >> > > >>> >> > > >>> 在 2020/9/26 下午11:09,“Jark Wu”<[hidden email]> 写入: >> > > >>> >> > > >>> Hi pengcheng, >> > > >>> >> > > >>> That's great to see you also have the need of window join. >> > > >>> You are right, the windowing TVF is a powerful feature >> which >> > can >> > > >>> support >> > > >>> more operations in the future. >> > > >>> I think it as of the date time "partition" selection in >> > batch SQL >> > > >>> jobs, >> > > >>> with this new syntax, I think it is possible >> > > >>> to migrate traditional batch SQL jobs to Flink SQL by >> > changing a >> > > >>> few lines. >> > > >>> >> > > >>> Regarding the SESSION window, this is on purpose to keep >> it >> > out of >> > > >>> the >> > > >>> FLIP, because we want to keep the >> > > >>> FLIP small to catch up 1.12 and SESSION TVF is rarely >> useful >> > (e.g. >> > > >>> session >> > > >>> window join?). >> > > >>> >> > > >>> Best, >> > > >>> Jark >> > > >>> >> > > >>> On Fri, 25 Sep 2020 at 22:59, liupengcheng < >> > > >>> [hidden email]> >> > > >>> wrote: >> > > >>> >> > > >>> > Hi, Jark, >> > > >>> > I'm very interested in this feature, and I'm >> also >> > working >> > > >>> on this >> > > >>> > recently. >> > > >>> > I just have a glance at the FLIP, it's good, >> but I >> > found >> > > >>> that >> > > >>> > there is no plan to add SESSION windows. >> > > >>> > Also, I think there can be more things we can do >> > based on >> > > >>> this new >> > > >>> > syntax. For example, >> > > >>> > - window sort support >> > > >>> > - window union/intersect/minus support >> > > >>> > - Improve dimension table join >> > > >>> > We can have more deep discussion on this new >> > feature >> > > later >> > > >>> . >> > > >>> > I've also opened an jira that is related to this >> > feature >> > > >>> recently: >> > > >>> > https://issues.apache.org/jira/browse/FLINK-18830 >> > > >>> > >> > > >>> > Best! >> > > >>> > PengchengLiu >> > > >>> > >> > > >>> > 在 2020/9/25 下午10:30,“Jark Wu”<[hidden email]> 写入: >> > > >>> > >> > > >>> > Hi everyone, >> > > >>> > >> > > >>> > I want to start a FLIP about supporting windowing >> > > table-valued >> > > >>> > functions >> > > >>> > (TVF). >> > > >>> > The main purpose of this FLIP is to improve the near >> > > real-time >> > > >>> (NRT) >> > > >>> > experience of Flink. >> > > >>> > >> > > >>> > FLIP-145: >> > > >>> > >> > > >>> > >> > > >>> >> > > >> > >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-145%3A+Support+SQL+windowing+table-valued+function >> > > >>> > >> > > >>> > We want to introduce TUMBLE, HOP, CUMULATE windowing >> > TVFs, >> > > the >> > > >>> > CUMULATE is >> > > >>> > a new kind of window. >> > > >>> > With the windowing TVFs, we can support richer >> > operations on >> > > >>> windows, >> > > >>> > including window join, window TopN and so on. >> > > >>> > This makes things simple: we only need to assign >> > windows at >> > > the >> > > >>> > beginning >> > > >>> > of the query, and then apply operations after that >> like >> > > >>> traditional >> > > >>> > batch >> > > >>> > SQL. >> > > >>> > We hope it can help to reduce the learning curve of >> > windows, >> > > >>> improve >> > > >>> > NRT >> > > >>> > for Flink, and attract more batch users. >> > > >>> > >> > > >>> > A simple code snippet for 10 minutes tumbling window >> > > aggregate: >> > > >>> > >> > > >>> > SELECT window_start, window_end, SUM(price) >> > > >>> > FROM TABLE( >> > > >>> > TUMBLE(TABLE Bid, DESCRIPTOR(bidtime), INTERVAL >> > '10' >> > > >>> MINUTES)) >> > > >>> > GROUP BY window_start, window_end; >> > > >>> > >> > > >>> > I'm looking forward to your feedback. >> > > >>> > >> > > >>> > Best, >> > > >>> > Jark >> > > >>> > >> > > >>> > >> > > >>> > >> > > >>> >> > > >>> >> > > >>> >> > > >> > >> > >> > -- >> > >> > Best, >> > Benchao Li >> > >> > > > -- > > Best, > Benchao Li > |
Hi pengcheng,
Thanks for your response. I knew that the original time attribute column will be retained after the TVF, what I'm questioning is how do we get the time attribute column after Aggregation. Your answer did not remove my doubts about this. It's ok if we did not plan to integrate new TVF aggregate with old "time attribute scenarios" listed in my previous email in this FLIP. However it's good to elaborate more on this, and leave it to the future plan. pengcheng Liu <[hidden email]> 于2020年10月10日周六 上午10:45写道: > Hi,Benchao, > In TVFs, the time attributes is just passed through from parent rels, > and the TVFs just add two > additional window attributes(i.e. window_start & window_end). Also, I > think the time columns can be not only a time attribute > with type of `TimeIndicatorType` but also a regular column with type > of `Timestamp`. > > For cascaded window operations, we can use window_start/window_end of > the previous window result directly to > indicate operating on the same window, or use new DESCRIPTOR column > to assign new windows, in case of the change of > the time column(e.g. in some case, the original timestamp is > inaccurate and need some conversion to be used). > > You can check the definition or signature of these TVFs in the FLIP. > e.g. > >> SELECT * FROM TABLE( >> TUMBLE(TABLE Bid, DESCRIPTOR(bidtime), INTERVAL '10' MINUTES)) > > In the example, the `bidtime` is the time attribute column, which is > the first operand of the DESCRIPTOR function. > > +1 start voting. > > Benchao Li <[hidden email]> 于2020年10月10日周六 上午10:08写道: > >> Hi Jark, >> >> 2 & 3 sounds good to me. >> >> Regarding time attribute, >> I still have some questions, I knew it's easy to support cascaded window >> aggregate using new TVFs. >> However there are some other places where need time attribute: >> - CEP >> - interval join >> - order by >> - over window >> If there is no time attribute column, how do we integrate these old >> features with the new TVFs. >> E.g. >> StreamA -> new window aggregate -> interval join -> Sink >> / >> StreamB ----------------------------------- >> >> >> Jark Wu <[hidden email]> 于2020年10月9日周五 下午11:51写道: >> >>> Hi Benchao, >>> >>> 1) time attribute >>> Yes. We don't need time attribute auxiliary function. Because the new >>> window operations are all based on the >>> window_start and window_end columns instead of on the time attributes. >>> So >>> we don't need to propagate time attributes. >>> Cascaded window aggregate can be expressed by simply GROUP BY the >>> window_start and window_end of the previous window result. >>> I have added a cascaded window aggregate example in the Tumbling Window >>> section in the FLIP. >>> If you want to define proctime window aggregate, the time column in TVF >>> should be a proctime attribute field (or PROCTIME() function). >>> >>> 2) batch support >>> Yes. The proposed syntax/API are unified for batch and streaming. Batch >>> support is in the plan, but may not have enough time to catch up 1.12. >>> >>> 3) support `grouping sets` >>> This is not included in the FLIP, but I think it's great if we can >>> support >>> `grouping sets`. >>> The existing window impl doesn't support this because we convert the >>> LogicalAggregate into WindowAggregate in the beginning, >>> the expand grouping sets rule can't be applied in this situation. >>> Fortunately, with the new window impl, the conversion to WindowAggregate >>> will happen at the end, so I think the expand rule can be >>> applied and support this feature naturally. >>> Therefore, IMO, we don't need to include this feature in this FLIP to >>> avoid >>> the FLIP being too large. >>> This can be a follow-up issue (maybe just add tests and docs) after the >>> FLIP. >>> >>> Best, >>> Jark >>> >>> >>> On Fri, 9 Oct 2020 at 19:09, 刘 芃成 <[hidden email]> wrote: >>> >>> > Hi,Benchao, >>> > Welcome to join the discussion, yes, this new syntax can make >>> SQL >>> > more clear and simpler. >>> > For your first question, the `window_start` and `window_end` >>> > columns will be added automatically, >>> > so we don't need to use auxiliary group functions to infer or >>> > access the window properties. >>> > >>> > For the `grouping sets` on TVFs, I think it's interesting if we >>> > can support it, as we already supported `grouping sets` >>> > on streaming aggregates in blink planner. But I'm not sure if >>> it >>> > will be included into this FLIP. >>> > >>> > cc @Jark Wu >>> > >>> > Best, >>> > Pengcheng >>> > >>> > >>> > 在 2020/10/9 下午5:25,“Benchao Li”<[hidden email]> 写入: >>> > >>> > Thanks Jark for bringing this discussion, I like this FLIP very >>> much. >>> > >>> > Especially the cumulate window, it's much like the current TUMBLE >>> > window + >>> > Fast Emit (which is an undocumented experimental feature), however, >>> > it's >>> > more powerful. >>> > >>> > And This will make the SQL semantic more standard, especially for >>> the >>> > HOPPING window. >>> > >>> > Regarding time attribute, >>> > It seems that we don't need a specific function to infer the time >>> > attribute >>> > like >>> > `TUMBLE_ROWTIME` / `TUMBLE_PROCTIME`. Then are `window_start` and >>> > `window_end` >>> > column a time attribute column automatically? >>> > - If not, what will be the time attribute of the result relation of >>> > these >>> > TVFs? >>> > Especially after the window aggregation. >>> > - If yes, then how do we handle proctime? >>> > >>> > Regarding batch operators, >>> > It's great to hear that we can reuse the batch operators in >>> continuous >>> > batch mode >>> > as you mentioned in the FLIP. >>> > Current window aggregate could also be used in batch mode with >>> > rowtime. Do >>> > you plan >>> > to support these TVFs for batch mode in this FLIP? Hence the >>> Table/SQL >>> > is a >>> > unified >>> > API, it's great if we can keep the features complete both in >>> streaming >>> > and >>> > batch mode. >>> > >>> > There is one more question, I don't know whether it should be >>> > considered in >>> > this FLIP. >>> > Does the new window support `grouping sets`? (It's not supported >>> in old >>> > window impl). >>> > >>> > Jark Wu <[hidden email]> 于2020年10月9日周五 下午4:14写道: >>> > >>> > > Hi all, >>> > > >>> > > I know we have a lot of discussion and development on going right >>> > now but >>> > > it would be great if we can get FLIP-145 into a votable state. >>> > > If there are no objections, I would like to start voting in the >>> next >>> > days. >>> > > >>> > > Best, >>> > > Jark >>> > > >>> > > On Thu, 1 Oct 2020 at 14:29, Jark Wu <[hidden email]> wrote: >>> > > >>> > > > Hi everyone, >>> > > > >>> > > > I have added a section for Performance Optimization to describe >>> > how to >>> > > > improve the performance in the short-term and long-term >>> > > > and sketch the future performance potential under the new >>> window >>> > API. >>> > > > Introducing the window API is just the first step, we will >>> > > > continuously improve the performance to make it powerful and >>> > useful. >>> > > > >>> > > > Best, >>> > > > Jark >>> > > > >>> > > > On Thu, 1 Oct 2020 at 14:28, Jark Wu <[hidden email]> wrote: >>> > > > >>> > > >> Hi Pengcheng, >>> > > >> >>> > > >> Yes, the window TVF is part of the FLIP. Welcome to contribute >>> > and join >>> > > >> the discussion. >>> > > >> Regarding the SESSION window aggregation, users can use the >>> > existing >>> > > >> grouped session window function. >>> > > >> >>> > > >> Best, >>> > > >> Jark >>> > > >> >>> > > >> On Sun, 27 Sep 2020 at 21:24, liupengcheng < >>> > [hidden email] >>> > > > >>> > > >> wrote: >>> > > >> >>> > > >>> Hi Jark, >>> > > >>> Thanks for reply, yes, I think it's a good feature, >>> it >>> > can >>> > > >>> improve the NRT scenarios >>> > > >>> as you mentioned in the FLIP. Also, I think it can >>> > improve the >>> > > >>> streaming SQL greatly, >>> > > >>> it can support richer window operations in flink SQL >>> and >>> > bring >>> > > >>> great convenience to users. >>> > > >>> (we are now only supported group window in flink). >>> > > >>> >>> > > >>> Regarding the SESSION window, I think it's especially >>> > useful >>> > > for >>> > > >>> user behavior analysis(e.g. >>> > > >>> counting user visits on a news website or social >>> > platform), but >>> > > >>> I agree that we can keep it >>> > > >>> out of the FLIP now to catch up 1.12. >>> > > >>> >>> > > >>> Recently, I've done some work on the stream planner >>> with >>> > the >>> > > >>> TVFs, and I'm willing to contribute >>> > > >>> to this part. Is it in the plan of this FLIP? >>> > > >>> >>> > > >>> Best, >>> > > >>> PengchengLiu >>> > > >>> >>> > > >>> >>> > > >>> 在 2020/9/26 下午11:09,“Jark Wu”<[hidden email]> 写入: >>> > > >>> >>> > > >>> Hi pengcheng, >>> > > >>> >>> > > >>> That's great to see you also have the need of window >>> join. >>> > > >>> You are right, the windowing TVF is a powerful feature >>> which >>> > can >>> > > >>> support >>> > > >>> more operations in the future. >>> > > >>> I think it as of the date time "partition" selection in >>> > batch SQL >>> > > >>> jobs, >>> > > >>> with this new syntax, I think it is possible >>> > > >>> to migrate traditional batch SQL jobs to Flink SQL by >>> > changing a >>> > > >>> few lines. >>> > > >>> >>> > > >>> Regarding the SESSION window, this is on purpose to keep >>> it >>> > out of >>> > > >>> the >>> > > >>> FLIP, because we want to keep the >>> > > >>> FLIP small to catch up 1.12 and SESSION TVF is rarely >>> useful >>> > (e.g. >>> > > >>> session >>> > > >>> window join?). >>> > > >>> >>> > > >>> Best, >>> > > >>> Jark >>> > > >>> >>> > > >>> On Fri, 25 Sep 2020 at 22:59, liupengcheng < >>> > > >>> [hidden email]> >>> > > >>> wrote: >>> > > >>> >>> > > >>> > Hi, Jark, >>> > > >>> > I'm very interested in this feature, and I'm >>> also >>> > working >>> > > >>> on this >>> > > >>> > recently. >>> > > >>> > I just have a glance at the FLIP, it's good, >>> but I >>> > found >>> > > >>> that >>> > > >>> > there is no plan to add SESSION windows. >>> > > >>> > Also, I think there can be more things we can >>> do >>> > based on >>> > > >>> this new >>> > > >>> > syntax. For example, >>> > > >>> > - window sort support >>> > > >>> > - window union/intersect/minus support >>> > > >>> > - Improve dimension table join >>> > > >>> > We can have more deep discussion on this new >>> > feature >>> > > later >>> > > >>> . >>> > > >>> > I've also opened an jira that is related to >>> this >>> > feature >>> > > >>> recently: >>> > > >>> > https://issues.apache.org/jira/browse/FLINK-18830 >>> > > >>> > >>> > > >>> > Best! >>> > > >>> > PengchengLiu >>> > > >>> > >>> > > >>> > 在 2020/9/25 下午10:30,“Jark Wu”<[hidden email]> 写入: >>> > > >>> > >>> > > >>> > Hi everyone, >>> > > >>> > >>> > > >>> > I want to start a FLIP about supporting windowing >>> > > table-valued >>> > > >>> > functions >>> > > >>> > (TVF). >>> > > >>> > The main purpose of this FLIP is to improve the >>> near >>> > > real-time >>> > > >>> (NRT) >>> > > >>> > experience of Flink. >>> > > >>> > >>> > > >>> > FLIP-145: >>> > > >>> > >>> > > >>> > >>> > > >>> >>> > > >>> > >>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-145%3A+Support+SQL+windowing+table-valued+function >>> > > >>> > >>> > > >>> > We want to introduce TUMBLE, HOP, CUMULATE >>> windowing >>> > TVFs, >>> > > the >>> > > >>> > CUMULATE is >>> > > >>> > a new kind of window. >>> > > >>> > With the windowing TVFs, we can support richer >>> > operations on >>> > > >>> windows, >>> > > >>> > including window join, window TopN and so on. >>> > > >>> > This makes things simple: we only need to assign >>> > windows at >>> > > the >>> > > >>> > beginning >>> > > >>> > of the query, and then apply operations after that >>> like >>> > > >>> traditional >>> > > >>> > batch >>> > > >>> > SQL. >>> > > >>> > We hope it can help to reduce the learning curve of >>> > windows, >>> > > >>> improve >>> > > >>> > NRT >>> > > >>> > for Flink, and attract more batch users. >>> > > >>> > >>> > > >>> > A simple code snippet for 10 minutes tumbling >>> window >>> > > aggregate: >>> > > >>> > >>> > > >>> > SELECT window_start, window_end, SUM(price) >>> > > >>> > FROM TABLE( >>> > > >>> > TUMBLE(TABLE Bid, DESCRIPTOR(bidtime), INTERVAL >>> > '10' >>> > > >>> MINUTES)) >>> > > >>> > GROUP BY window_start, window_end; >>> > > >>> > >>> > > >>> > I'm looking forward to your feedback. >>> > > >>> > >>> > > >>> > Best, >>> > > >>> > Jark >>> > > >>> > >>> > > >>> > >>> > > >>> > >>> > > >>> >>> > > >>> >>> > > >>> >>> > > >>> > >>> > >>> > -- >>> > >>> > Best, >>> > Benchao Li >>> > >>> >> >> >> -- >> >> Best, >> Benchao Li >> > -- Best, Benchao Li |
Hi, Benchao,
I think I got your point, actually, in current implementation for group window aggregation, the value of time attributes(e.g. TUMBLE_ROWTIME/TUMBLE_PROCTIME) is calculated as (window_end – 1), so I think we can just use it directly if you need this. But I think this time attributes is mainly suggested to use in case of cascaded window operations. Regarding the example you provided, I think the semantics of the SQL in your example which doing interval join(e.g. with TUMBLE_ROWTIME) after window aggregation is not clear in the current implementation, and I think that’s a strong reason why we need the new TVFs syntax. With the new syntax, users should understand which time column to use and how to generate it when doing interval join and etc. Best, Pengcheng 发件人: Benchao Li <[hidden email]> 日期: 2020年10月10日 星期六 上午11:02 收件人: pengcheng Liu <[hidden email]> 抄送: dev <[hidden email]> 主题: Re: [DISCUSS] FLIP-145: Support SQL windowing table-valued function Hi pengcheng, Thanks for your response. I knew that the original time attribute column will be retained after the TVF, what I'm questioning is how do we get the time attribute column after Aggregation. Your answer did not remove my doubts about this. It's ok if we did not plan to integrate new TVF aggregate with old "time attribute scenarios" listed in my previous email in this FLIP. However it's good to elaborate more on this, and leave it to the future plan. pengcheng Liu <[hidden email]<mailto:[hidden email]>> 于2020年10月10日周六 上午10:45写道: Hi,Benchao, In TVFs, the time attributes is just passed through from parent rels, and the TVFs just add two additional window attributes(i.e. window_start & window_end). Also, I think the time columns can be not only a time attribute with type of `TimeIndicatorType` but also a regular column with type of `Timestamp`. For cascaded window operations, we can use window_start/window_end of the previous window result directly to indicate operating on the same window, or use new DESCRIPTOR column to assign new windows, in case of the change of the time column(e.g. in some case, the original timestamp is inaccurate and need some conversion to be used). You can check the definition or signature of these TVFs in the FLIP. e.g. SELECT * FROM TABLE( TUMBLE(TABLE Bid, DESCRIPTOR(bidtime), INTERVAL '10' MINUTES)) In the example, the `bidtime` is the time attribute column, which is the first operand of the DESCRIPTOR function. +1 start voting. Benchao Li <[hidden email]<mailto:[hidden email]>> 于2020年10月10日周六 上午10:08写道: Hi Jark, 2 & 3 sounds good to me. Regarding time attribute, I still have some questions, I knew it's easy to support cascaded window aggregate using new TVFs. However there are some other places where need time attribute: - CEP - interval join - order by - over window If there is no time attribute column, how do we integrate these old features with the new TVFs. E.g. StreamA -> new window aggregate -> interval join -> Sink / StreamB ----------------------------------- Jark Wu <[hidden email]<mailto:[hidden email]>> 于2020年10月9日周五 下午11:51写道: Hi Benchao, 1) time attribute Yes. We don't need time attribute auxiliary function. Because the new window operations are all based on the window_start and window_end columns instead of on the time attributes. So we don't need to propagate time attributes. Cascaded window aggregate can be expressed by simply GROUP BY the window_start and window_end of the previous window result. I have added a cascaded window aggregate example in the Tumbling Window section in the FLIP. If you want to define proctime window aggregate, the time column in TVF should be a proctime attribute field (or PROCTIME() function). 2) batch support Yes. The proposed syntax/API are unified for batch and streaming. Batch support is in the plan, but may not have enough time to catch up 1.12. 3) support `grouping sets` This is not included in the FLIP, but I think it's great if we can support `grouping sets`. The existing window impl doesn't support this because we convert the LogicalAggregate into WindowAggregate in the beginning, the expand grouping sets rule can't be applied in this situation. Fortunately, with the new window impl, the conversion to WindowAggregate will happen at the end, so I think the expand rule can be applied and support this feature naturally. Therefore, IMO, we don't need to include this feature in this FLIP to avoid the FLIP being too large. This can be a follow-up issue (maybe just add tests and docs) after the FLIP. Best, Jark On Fri, 9 Oct 2020 at 19:09, 刘 芃成 <[hidden email]<mailto:[hidden email]>> wrote: > Hi,Benchao, > Welcome to join the discussion, yes, this new syntax can make SQL > more clear and simpler. > For your first question, the `window_start` and `window_end` > columns will be added automatically, > so we don't need to use auxiliary group functions to infer or > access the window properties. > > For the `grouping sets` on TVFs, I think it's interesting if we > can support it, as we already supported `grouping sets` > on streaming aggregates in blink planner. But I'm not sure if it > will be included into this FLIP. > > cc @Jark Wu > > Best, > Pengcheng > > > 在 2020/10/9 下午5:25,“Benchao Li”<[hidden email]<mailto:[hidden email]>> 写入: > > Thanks Jark for bringing this discussion, I like this FLIP very much. > > Especially the cumulate window, it's much like the current TUMBLE > window + > Fast Emit (which is an undocumented experimental feature), however, > it's > more powerful. > > And This will make the SQL semantic more standard, especially for the > HOPPING window. > > Regarding time attribute, > It seems that we don't need a specific function to infer the time > attribute > like > `TUMBLE_ROWTIME` / `TUMBLE_PROCTIME`. Then are `window_start` and > `window_end` > column a time attribute column automatically? > - If not, what will be the time attribute of the result relation of > these > TVFs? > Especially after the window aggregation. > - If yes, then how do we handle proctime? > > Regarding batch operators, > It's great to hear that we can reuse the batch operators in continuous > batch mode > as you mentioned in the FLIP. > Current window aggregate could also be used in batch mode with > rowtime. Do > you plan > to support these TVFs for batch mode in this FLIP? Hence the Table/SQL > is a > unified > API, it's great if we can keep the features complete both in streaming > and > batch mode. > > There is one more question, I don't know whether it should be > considered in > this FLIP. > Does the new window support `grouping sets`? (It's not supported in old > window impl). > > Jark Wu <[hidden email]<mailto:[hidden email]>> 于2020年10月9日周五 下午4:14写道: > > > Hi all, > > > > I know we have a lot of discussion and development on going right > now but > > it would be great if we can get FLIP-145 into a votable state. > > If there are no objections, I would like to start voting in the next > days. > > > > Best, > > Jark > > > > On Thu, 1 Oct 2020 at 14:29, Jark Wu <[hidden email]<mailto:[hidden email]>> wrote: > > > > > Hi everyone, > > > > > > I have added a section for Performance Optimization to describe > how to > > > improve the performance in the short-term and long-term > > > and sketch the future performance potential under the new window > API. > > > Introducing the window API is just the first step, we will > > > continuously improve the performance to make it powerful and > useful. > > > > > > Best, > > > Jark > > > > > > On Thu, 1 Oct 2020 at 14:28, Jark Wu <[hidden email]<mailto:[hidden email]>> wrote: > > > > > >> Hi Pengcheng, > > >> > > >> Yes, the window TVF is part of the FLIP. Welcome to contribute > and join > > >> the discussion. > > >> Regarding the SESSION window aggregation, users can use the > existing > > >> grouped session window function. > > >> > > >> Best, > > >> Jark > > >> > > >> On Sun, 27 Sep 2020 at 21:24, liupengcheng < > [hidden email]<mailto:[hidden email]> > > > > > >> wrote: > > >> > > >>> Hi Jark, > > >>> Thanks for reply, yes, I think it's a good feature, it > can > > >>> improve the NRT scenarios > > >>> as you mentioned in the FLIP. Also, I think it can > improve the > > >>> streaming SQL greatly, > > >>> it can support richer window operations in flink SQL and > bring > > >>> great convenience to users. > > >>> (we are now only supported group window in flink). > > >>> > > >>> Regarding the SESSION window, I think it's especially > useful > > for > > >>> user behavior analysis(e.g. > > >>> counting user visits on a news website or social > platform), but > > >>> I agree that we can keep it > > >>> out of the FLIP now to catch up 1.12. > > >>> > > >>> Recently, I've done some work on the stream planner with > the > > >>> TVFs, and I'm willing to contribute > > >>> to this part. Is it in the plan of this FLIP? > > >>> > > >>> Best, > > >>> PengchengLiu > > >>> > > >>> > > >>> 在 2020/9/26 下午11:09,“Jark Wu”<[hidden email]<mailto:[hidden email]>> 写入: > > >>> > > >>> Hi pengcheng, > > >>> > > >>> That's great to see you also have the need of window join. > > >>> You are right, the windowing TVF is a powerful feature which > can > > >>> support > > >>> more operations in the future. > > >>> I think it as of the date time "partition" selection in > batch SQL > > >>> jobs, > > >>> with this new syntax, I think it is possible > > >>> to migrate traditional batch SQL jobs to Flink SQL by > changing a > > >>> few lines. > > >>> > > >>> Regarding the SESSION window, this is on purpose to keep it > out of > > >>> the > > >>> FLIP, because we want to keep the > > >>> FLIP small to catch up 1.12 and SESSION TVF is rarely useful > (e.g. > > >>> session > > >>> window join?). > > >>> > > >>> Best, > > >>> Jark > > >>> > > >>> On Fri, 25 Sep 2020 at 22:59, liupengcheng < > > >>> [hidden email]<mailto:[hidden email]>> > > >>> wrote: > > >>> > > >>> > Hi, Jark, > > >>> > I'm very interested in this feature, and I'm also > working > > >>> on this > > >>> > recently. > > >>> > I just have a glance at the FLIP, it's good, but I > found > > >>> that > > >>> > there is no plan to add SESSION windows. > > >>> > Also, I think there can be more things we can do > based on > > >>> this new > > >>> > syntax. For example, > > >>> > - window sort support > > >>> > - window union/intersect/minus support > > >>> > - Improve dimension table join > > >>> > We can have more deep discussion on this new > feature > > later > > >>> . > > >>> > I've also opened an jira that is related to this > feature > > >>> recently: > > >>> > https://issues.apache.org/jira/browse/FLINK-18830 > > >>> > > > >>> > Best! > > >>> > PengchengLiu > > >>> > > > >>> > 在 2020/9/25 下午10:30,“Jark Wu”<[hidden email]<mailto:[hidden email]>> 写入: > > >>> > > > >>> > Hi everyone, > > >>> > > > >>> > I want to start a FLIP about supporting windowing > > table-valued > > >>> > functions > > >>> > (TVF). > > >>> > The main purpose of this FLIP is to improve the near > > real-time > > >>> (NRT) > > >>> > experience of Flink. > > >>> > > > >>> > FLIP-145: > > >>> > > > >>> > > > >>> > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-145%3A+Support+SQL+windowing+table-valued+function > > >>> > > > >>> > We want to introduce TUMBLE, HOP, CUMULATE windowing > TVFs, > > the > > >>> > CUMULATE is > > >>> > a new kind of window. > > >>> > With the windowing TVFs, we can support richer > operations on > > >>> windows, > > >>> > including window join, window TopN and so on. > > >>> > This makes things simple: we only need to assign > windows at > > the > > >>> > beginning > > >>> > of the query, and then apply operations after that like > > >>> traditional > > >>> > batch > > >>> > SQL. > > >>> > We hope it can help to reduce the learning curve of > windows, > > >>> improve > > >>> > NRT > > >>> > for Flink, and attract more batch users. > > >>> > > > >>> > A simple code snippet for 10 minutes tumbling window > > aggregate: > > >>> > > > >>> > SELECT window_start, window_end, SUM(price) > > >>> > FROM TABLE( > > >>> > TUMBLE(TABLE Bid, DESCRIPTOR(bidtime), INTERVAL > '10' > > >>> MINUTES)) > > >>> > GROUP BY window_start, window_end; > > >>> > > > >>> > I'm looking forward to your feedback. > > >>> > > > >>> > Best, > > >>> > Jark > > >>> > > > >>> > > > >>> > > > >>> > > >>> > > >>> > > > > > -- > > Best, > Benchao Li > -- Best, Benchao Li -- Best, Benchao Li |
Hi Benchao,
That's a good question. IMO, the new windowed operators and the current time operators are two different sets of functions, just like time operators and non-time operators are two different sets of functions. I think it's fine if we don't support integrating them, just like time operators can't be applied on non-windowed aggregate. If users want to use time operators in the whole pipeline, then he/she can use the grouped window aggregates instead of the window TVFs. The key idea of window TVF is that all the operators in the pipeline are based on the **windows**. In terms of syntax, if the key clause (e.g. group by, partitioned by, join on, order by) contains window_start and window_end, it can be translated into windowed operators. Thus, we will have windowed CEP, windowed sort, windowed over aggregate in the future to make it possible to build a windowed pipeline. But I think we can elaborate the integration more in the future if users need it. Actually, I don't fully understand the scenario of integrating window TVF and time operators at this point. For example, interval join an input stream and a window join result. I don't see why it can't be expressed by nested window join and why users have to use interval join here. Maybe we can wait for more inputs from users when the window TVF is released and we can elaborate it again. Best, Jark On Sat, 10 Oct 2020 at 12:01, 刘 芃成 <[hidden email]> wrote: > Hi, Benchao, > I think I got your point, actually, in current implementation for > group window aggregation, the value of time attributes(e.g. > TUMBLE_ROWTIME/TUMBLE_PROCTIME) is calculated as (window_end – 1), so I > think we can just use it directly if you need this. But I think this time > attributes is mainly suggested to use in case of cascaded window operations. > Regarding the example you provided, I think the semantics of the SQL in > your example which doing interval join(e.g. with TUMBLE_ROWTIME) after > window aggregation is not clear in the current implementation, and I think > that’s a strong reason why we need the new TVFs syntax. > With the new syntax, users should understand which time column to > use and how to generate it when doing interval join and etc. > > Best, > Pengcheng > > 发件人: Benchao Li <[hidden email]> > 日期: 2020年10月10日 星期六 上午11:02 > 收件人: pengcheng Liu <[hidden email]> > 抄送: dev <[hidden email]> > 主题: Re: [DISCUSS] FLIP-145: Support SQL windowing table-valued function > > Hi pengcheng, > > Thanks for your response. > I knew that the original time attribute column will be retained after the > TVF, > what I'm questioning is how do we get the time attribute column after > Aggregation. > Your answer did not remove my doubts about this. > > It's ok if we did not plan to integrate new TVF aggregate with old "time > attribute scenarios" > listed in my previous email in this FLIP. However it's good to elaborate > leave it to the future plan. > > pengcheng Liu <[hidden email]<mailto: > [hidden email]>> 于2020年10月10日周六 上午10:45写道: > Hi,Benchao, > In TVFs, the time attributes is just passed through from parent rels, > and the TVFs just add two > additional window attributes(i.e. window_start & window_end). Also, I > think the time columns can be not only a time attribute > with type of `TimeIndicatorType` but also a regular column with type > of `Timestamp`. > > For cascaded window operations, we can use window_start/window_end of > the previous window result directly to > indicate operating on the same window, or use new DESCRIPTOR column > to assign new windows, in case of the change of > the time column(e.g. in some case, the original timestamp is > inaccurate and need some conversion to be used). > > You can check the definition or signature of these TVFs in the FLIP. > e.g. > SELECT * FROM TABLE( > TUMBLE(TABLE Bid, DESCRIPTOR(bidtime), INTERVAL '10' MINUTES)) > In the example, the `bidtime` is the time attribute column, which is > the first operand of the DESCRIPTOR function. > > +1 start voting. > > Benchao Li <[hidden email]<mailto:[hidden email]>> > 于2020年10月10日周六 上午10:08写道: > Hi Jark, > > 2 & 3 sounds good to me. > > Regarding time attribute, > I still have some questions, I knew it's easy to support cascaded window > aggregate using new TVFs. > However there are some other places where need time attribute: > - CEP > - interval join > - order by > - over window > If there is no time attribute column, how do we integrate these old > features with the new TVFs. > E.g. > StreamA -> new window aggregate -> interval join -> Sink > / > StreamB ----------------------------------- > > > Jark Wu <[hidden email]<mailto:[hidden email]>> 于2020年10月9日周五 > 下午11:51写道: > Hi Benchao, > > 1) time attribute > Yes. We don't need time attribute auxiliary function. Because the new > window operations are all based on the > window_start and window_end columns instead of on the time attributes. So > we don't need to propagate time attributes. > Cascaded window aggregate can be expressed by simply GROUP BY the > window_start and window_end of the previous window result. > I have added a cascaded window aggregate example in the Tumbling Window > section in the FLIP. > If you want to define proctime window aggregate, the time column in TVF > should be a proctime attribute field (or PROCTIME() function). > > 2) batch support > Yes. The proposed syntax/API are unified for batch and streaming. Batch > support is in the plan, but may not have enough time to catch up 1.12. > > 3) support `grouping sets` > This is not included in the FLIP, but I think it's great if we can support > `grouping sets`. > The existing window impl doesn't support this because we convert the > LogicalAggregate into WindowAggregate in the beginning, > the expand grouping sets rule can't be applied in this situation. > Fortunately, with the new window impl, the conversion to WindowAggregate > will happen at the end, so I think the expand rule can be > applied and support this feature naturally. > Therefore, IMO, we don't need to include this feature in this FLIP to avoid > the FLIP being too large. > This can be a follow-up issue (maybe just add tests and docs) after the > FLIP. > > Best, > Jark > > > On Fri, 9 Oct 2020 at 19:09, 刘 芃成 <[hidden email]<mailto: > [hidden email]>> wrote: > > > Hi,Benchao, > > Welcome to join the discussion, yes, this new syntax can make SQL > > more clear and simpler. > > For your first question, the `window_start` and `window_end` > > columns will be added automatically, > > so we don't need to use auxiliary group functions to infer or > > access the window properties. > > > > For the `grouping sets` on TVFs, I think it's interesting if we > > can support it, as we already supported `grouping sets` > > on streaming aggregates in blink planner. But I'm not sure if it > > will be included into this FLIP. > > > > cc @Jark Wu > > > > Best, > > Pengcheng > > > > > > 在 2020/10/9 下午5:25,“Benchao Li”<[hidden email]<mailto: > [hidden email]>> 写入: > > > > Thanks Jark for bringing this discussion, I like this FLIP very much. > > > > Especially the cumulate window, it's much like the current TUMBLE > > window + > > Fast Emit (which is an undocumented experimental feature), however, > > it's > > more powerful. > > > > And This will make the SQL semantic more standard, especially for the > > HOPPING window. > > > > Regarding time attribute, > > It seems that we don't need a specific function to infer the time > > attribute > > like > > `TUMBLE_ROWTIME` / `TUMBLE_PROCTIME`. Then are `window_start` and > > `window_end` > > column a time attribute column automatically? > > - If not, what will be the time attribute of the result relation of > > these > > TVFs? > > Especially after the window aggregation. > > - If yes, then how do we handle proctime? > > > > Regarding batch operators, > > It's great to hear that we can reuse the batch operators in > continuous > > batch mode > > as you mentioned in the FLIP. > > Current window aggregate could also be used in batch mode with > > rowtime. Do > > you plan > > to support these TVFs for batch mode in this FLIP? Hence the > Table/SQL > > is a > > unified > > API, it's great if we can keep the features complete both in > streaming > > and > > batch mode. > > > > There is one more question, I don't know whether it should be > > considered in > > this FLIP. > > Does the new window support `grouping sets`? (It's not supported in > old > > window impl). > > > > Jark Wu <[hidden email]<mailto:[hidden email]>> 于2020年10月9日周五 > 下午4:14写道: > > > > > Hi all, > > > > > > I know we have a lot of discussion and development on going right > > now but > > > it would be great if we can get FLIP-145 into a votable state. > > > If there are no objections, I would like to start voting in the > next > > days. > > > > > > Best, > > > Jark > > > > > > On Thu, 1 Oct 2020 at 14:29, Jark Wu <[hidden email]<mailto: > [hidden email]>> wrote: > > > > > > > Hi everyone, > > > > > > > > I have added a section for Performance Optimization to describe > > how to > > > > improve the performance in the short-term and long-term > > > > and sketch the future performance potential under the new window > > API. > > > > Introducing the window API is just the first step, we will > > > > continuously improve the performance to make it powerful and > > useful. > > > > > > > > Best, > > > > Jark > > > > > > > > On Thu, 1 Oct 2020 at 14:28, Jark Wu <[hidden email]<mailto: > [hidden email]>> wrote: > > > > > > > >> Hi Pengcheng, > > > >> > > > >> Yes, the window TVF is part of the FLIP. Welcome to contribute > > and join > > > >> the discussion. > > > >> Regarding the SESSION window aggregation, users can use the > > existing > > > >> grouped session window function. > > > >> > > > >> Best, > > > >> Jark > > > >> > > > >> On Sun, 27 Sep 2020 at 21:24, liupengcheng < > > [hidden email]<mailto:[hidden email]> > > > > > > > >> wrote: > > > >> > > > >>> Hi Jark, > > > >>> Thanks for reply, yes, I think it's a good feature, it > > can > > > >>> improve the NRT scenarios > > > >>> as you mentioned in the FLIP. Also, I think it can > > improve the > > > >>> streaming SQL greatly, > > > >>> it can support richer window operations in flink SQL > and > > bring > > > >>> great convenience to users. > > > >>> (we are now only supported group window in flink). > > > >>> > > > >>> Regarding the SESSION window, I think it's especially > > useful > > > for > > > >>> user behavior analysis(e.g. > > > >>> counting user visits on a news website or social > > platform), but > > > >>> I agree that we can keep it > > > >>> out of the FLIP now to catch up 1.12. > > > >>> > > > >>> Recently, I've done some work on the stream planner > with > > the > > > >>> TVFs, and I'm willing to contribute > > > >>> to this part. Is it in the plan of this FLIP? > > > >>> > > > >>> Best, > > > >>> PengchengLiu > > > >>> > > > >>> > > > >>> 在 2020/9/26 下午11:09,“Jark Wu”<[hidden email]<mailto: > [hidden email]>> 写入: > > > >>> > > > >>> Hi pengcheng, > > > >>> > > > >>> That's great to see you also have the need of window join. > > > >>> You are right, the windowing TVF is a powerful feature > which > > can > > > >>> support > > > >>> more operations in the future. > > > >>> I think it as of the date time "partition" selection in > > batch SQL > > > >>> jobs, > > > >>> with this new syntax, I think it is possible > > > >>> to migrate traditional batch SQL jobs to Flink SQL by > > changing a > > > >>> few lines. > > > >>> > > > >>> Regarding the SESSION window, this is on purpose to keep it > > out of > > > >>> the > > > >>> FLIP, because we want to keep the > > > >>> FLIP small to catch up 1.12 and SESSION TVF is rarely > useful > > (e.g. > > > >>> session > > > >>> window join?). > > > >>> > > > >>> Best, > > > >>> Jark > > > >>> > > > >>> On Fri, 25 Sep 2020 at 22:59, liupengcheng < > > > >>> [hidden email]<mailto:[hidden email] > >> > > > >>> wrote: > > > >>> > > > >>> > Hi, Jark, > > > >>> > I'm very interested in this feature, and I'm also > > working > > > >>> on this > > > >>> > recently. > > > >>> > I just have a glance at the FLIP, it's good, but > I > > found > > > >>> that > > > >>> > there is no plan to add SESSION windows. > > > >>> > Also, I think there can be more things we can do > > based on > > > >>> this new > > > >>> > syntax. For example, > > > >>> > - window sort support > > > >>> > - window union/intersect/minus support > > > >>> > - Improve dimension table join > > > >>> > We can have more deep discussion on this new > > feature > > > later > > > >>> . > > > >>> > I've also opened an jira that is related to this > > feature > > > >>> recently: > > > >>> > https://issues.apache.org/jira/browse/FLINK-18830 > > > >>> > > > > >>> > Best! > > > >>> > PengchengLiu > > > >>> > > > > >>> > 在 2020/9/25 下午10:30,“Jark Wu”<[hidden email]<mailto: > [hidden email]>> 写入: > > > >>> > > > > >>> > Hi everyone, > > > >>> > > > > >>> > I want to start a FLIP about supporting windowing > > > table-valued > > > >>> > functions > > > >>> > (TVF). > > > >>> > The main purpose of this FLIP is to improve the near > > > real-time > > > >>> (NRT) > > > >>> > experience of Flink. > > > >>> > > > > >>> > FLIP-145: > > > >>> > > > > >>> > > > > >>> > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-145%3A+Support+SQL+windowing+table-valued+function > > > >>> > > > > >>> > We want to introduce TUMBLE, HOP, CUMULATE windowing > > TVFs, > > > the > > > >>> > CUMULATE is > > > >>> > a new kind of window. > > > >>> > With the windowing TVFs, we can support richer > > operations on > > > >>> windows, > > > >>> > including window join, window TopN and so on. > > > >>> > This makes things simple: we only need to assign > > windows at > > > the > > > >>> > beginning > > > >>> > of the query, and then apply operations after that > like > > > >>> traditional > > > >>> > batch > > > >>> > SQL. > > > >>> > We hope it can help to reduce the learning curve of > > windows, > > > >>> improve > > > >>> > NRT > > > >>> > for Flink, and attract more batch users. > > > >>> > > > > >>> > A simple code snippet for 10 minutes tumbling window > > > aggregate: > > > >>> > > > > >>> > SELECT window_start, window_end, SUM(price) > > > >>> > FROM TABLE( > > > >>> > TUMBLE(TABLE Bid, DESCRIPTOR(bidtime), INTERVAL > > '10' > > > >>> MINUTES)) > > > >>> > GROUP BY window_start, window_end; > > > >>> > > > > >>> > I'm looking forward to your feedback. > > > >>> > > > > >>> > Best, > > > >>> > Jark > > > >>> > > > > >>> > > > > >>> > > > > >>> > > > >>> > > > >>> > > > > > > > > > -- > > > > Best, > > Benchao Li > > > > > -- > > Best, > Benchao Li > > > -- > > Best, > Benchao Li > |
Hi Jark,
Thanks for your reply, this makes sense to me. The scenario I used above is just a case to explain what I'm concerned about, not necessarily a production use case. We can leave it to the future to see whether other users have these use cases. Then I have no other concerns, +1 to start the VOTE. Jark Wu <[hidden email]> 于2020年10月10日周六 下午1:44写道: > Hi Benchao, > > That's a good question. > > IMO, the new windowed operators and the current time operators are two > different sets of functions, > just like time operators and non-time operators are two different sets of > functions. > I think it's fine if we don't support integrating them, just like time > operators can't be applied on non-windowed aggregate. > If users want to use time operators in the whole pipeline, then he/she can > use the grouped window aggregates instead of the window TVFs. > > The key idea of window TVF is that all the operators in the pipeline are > based on the **windows**. > In terms of syntax, if the key clause (e.g. group by, partitioned by, join > on, order by) contains window_start and window_end, > it can be translated into windowed operators. > Thus, we will have windowed CEP, windowed sort, windowed over aggregate in > the future to make it possible to build a windowed pipeline. > > But I think we can elaborate the integration more in the future if users > need it. Actually, I don't fully understand the scenario of integrating > window TVF and time operators at this point. > For example, interval join an input stream and a window join result. I > don't see why it can't be expressed by nested window join and why users > have to use interval join here. > Maybe we can wait for more inputs from users when the window TVF is > released and we can elaborate it again. > > Best, > Jark > > On Sat, 10 Oct 2020 at 12:01, 刘 芃成 <[hidden email]> wrote: > >> Hi, Benchao, >> I think I got your point, actually, in current implementation for >> group window aggregation, the value of time attributes(e.g. >> TUMBLE_ROWTIME/TUMBLE_PROCTIME) is calculated as (window_end – 1), so I >> think we can just use it directly if you need this. But I think this time >> attributes is mainly suggested to use in case of cascaded window operations. >> Regarding the example you provided, I think the semantics of the SQL in >> your example which doing interval join(e.g. with TUMBLE_ROWTIME) after >> window aggregation is not clear in the current implementation, and I think >> that’s a strong reason why we need the new TVFs syntax. >> With the new syntax, users should understand which time column to >> use and how to generate it when doing interval join and etc. >> >> Best, >> Pengcheng >> >> 发件人: Benchao Li <[hidden email]> >> 日期: 2020年10月10日 星期六 上午11:02 >> 收件人: pengcheng Liu <[hidden email]> >> 抄送: dev <[hidden email]> >> 主题: Re: [DISCUSS] FLIP-145: Support SQL windowing table-valued function >> >> Hi pengcheng, >> >> Thanks for your response. >> I knew that the original time attribute column will be retained after the >> TVF, >> what I'm questioning is how do we get the time attribute column after >> Aggregation. >> Your answer did not remove my doubts about this. >> >> It's ok if we did not plan to integrate new TVF aggregate with old "time >> attribute scenarios" >> listed in my previous email in this FLIP. However it's good to elaborate >> leave it to the future plan. >> >> pengcheng Liu <[hidden email]<mailto: >> [hidden email]>> 于2020年10月10日周六 上午10:45写道: >> Hi,Benchao, >> In TVFs, the time attributes is just passed through from parent rels, >> and the TVFs just add two >> additional window attributes(i.e. window_start & window_end). Also, I >> think the time columns can be not only a time attribute >> with type of `TimeIndicatorType` but also a regular column with type >> of `Timestamp`. >> >> For cascaded window operations, we can use window_start/window_end of >> the previous window result directly to >> indicate operating on the same window, or use new DESCRIPTOR column >> to assign new windows, in case of the change of >> the time column(e.g. in some case, the original timestamp is >> inaccurate and need some conversion to be used). >> >> You can check the definition or signature of these TVFs in the FLIP. >> e.g. >> SELECT * FROM TABLE( >> TUMBLE(TABLE Bid, DESCRIPTOR(bidtime), INTERVAL '10' MINUTES)) >> In the example, the `bidtime` is the time attribute column, which is >> the first operand of the DESCRIPTOR function. >> >> +1 start voting. >> >> Benchao Li <[hidden email]<mailto:[hidden email]>> >> 于2020年10月10日周六 上午10:08写道: >> Hi Jark, >> >> 2 & 3 sounds good to me. >> >> Regarding time attribute, >> I still have some questions, I knew it's easy to support cascaded window >> aggregate using new TVFs. >> However there are some other places where need time attribute: >> - CEP >> - interval join >> - order by >> - over window >> If there is no time attribute column, how do we integrate these old >> features with the new TVFs. >> E.g. >> StreamA -> new window aggregate -> interval join -> Sink >> / >> StreamB ----------------------------------- >> >> >> Jark Wu <[hidden email]<mailto:[hidden email]>> 于2020年10月9日周五 >> 下午11:51写道: >> Hi Benchao, >> >> 1) time attribute >> Yes. We don't need time attribute auxiliary function. Because the new >> window operations are all based on the >> window_start and window_end columns instead of on the time attributes. So >> we don't need to propagate time attributes. >> Cascaded window aggregate can be expressed by simply GROUP BY the >> window_start and window_end of the previous window result. >> I have added a cascaded window aggregate example in the Tumbling Window >> section in the FLIP. >> If you want to define proctime window aggregate, the time column in TVF >> should be a proctime attribute field (or PROCTIME() function). >> >> 2) batch support >> Yes. The proposed syntax/API are unified for batch and streaming. Batch >> support is in the plan, but may not have enough time to catch up 1.12. >> >> 3) support `grouping sets` >> This is not included in the FLIP, but I think it's great if we can support >> `grouping sets`. >> The existing window impl doesn't support this because we convert the >> LogicalAggregate into WindowAggregate in the beginning, >> the expand grouping sets rule can't be applied in this situation. >> Fortunately, with the new window impl, the conversion to WindowAggregate >> will happen at the end, so I think the expand rule can be >> applied and support this feature naturally. >> Therefore, IMO, we don't need to include this feature in this FLIP to >> avoid >> the FLIP being too large. >> This can be a follow-up issue (maybe just add tests and docs) after the >> FLIP. >> >> Best, >> Jark >> >> >> On Fri, 9 Oct 2020 at 19:09, 刘 芃成 <[hidden email]<mailto: >> [hidden email]>> wrote: >> >> > Hi,Benchao, >> > Welcome to join the discussion, yes, this new syntax can make >> SQL >> > more clear and simpler. >> > For your first question, the `window_start` and `window_end` >> > columns will be added automatically, >> > so we don't need to use auxiliary group functions to infer or >> > access the window properties. >> > >> > For the `grouping sets` on TVFs, I think it's interesting if we >> > can support it, as we already supported `grouping sets` >> > on streaming aggregates in blink planner. But I'm not sure if it >> > will be included into this FLIP. >> > >> > cc @Jark Wu >> > >> > Best, >> > Pengcheng >> > >> > >> > 在 2020/10/9 下午5:25,“Benchao Li”<[hidden email]<mailto: >> [hidden email]>> 写入: >> > >> > Thanks Jark for bringing this discussion, I like this FLIP very >> much. >> > >> > Especially the cumulate window, it's much like the current TUMBLE >> > window + >> > Fast Emit (which is an undocumented experimental feature), however, >> > it's >> > more powerful. >> > >> > And This will make the SQL semantic more standard, especially for >> the >> > HOPPING window. >> > >> > Regarding time attribute, >> > It seems that we don't need a specific function to infer the time >> > attribute >> > like >> > `TUMBLE_ROWTIME` / `TUMBLE_PROCTIME`. Then are `window_start` and >> > `window_end` >> > column a time attribute column automatically? >> > - If not, what will be the time attribute of the result relation of >> > these >> > TVFs? >> > Especially after the window aggregation. >> > - If yes, then how do we handle proctime? >> > >> > Regarding batch operators, >> > It's great to hear that we can reuse the batch operators in >> continuous >> > batch mode >> > as you mentioned in the FLIP. >> > Current window aggregate could also be used in batch mode with >> > rowtime. Do >> > you plan >> > to support these TVFs for batch mode in this FLIP? Hence the >> Table/SQL >> > is a >> > unified >> > API, it's great if we can keep the features complete both in >> streaming >> > and >> > batch mode. >> > >> > There is one more question, I don't know whether it should be >> > considered in >> > this FLIP. >> > Does the new window support `grouping sets`? (It's not supported in >> old >> > window impl). >> > >> > Jark Wu <[hidden email]<mailto:[hidden email]>> 于2020年10月9日周五 >> 下午4:14写道: >> > >> > > Hi all, >> > > >> > > I know we have a lot of discussion and development on going right >> > now but >> > > it would be great if we can get FLIP-145 into a votable state. >> > > If there are no objections, I would like to start voting in the >> next >> > days. >> > > >> > > Best, >> > > Jark >> > > >> > > On Thu, 1 Oct 2020 at 14:29, Jark Wu <[hidden email]<mailto: >> [hidden email]>> wrote: >> > > >> > > > Hi everyone, >> > > > >> > > > I have added a section for Performance Optimization to describe >> > how to >> > > > improve the performance in the short-term and long-term >> > > > and sketch the future performance potential under the new window >> > API. >> > > > Introducing the window API is just the first step, we will >> > > > continuously improve the performance to make it powerful and >> > useful. >> > > > >> > > > Best, >> > > > Jark >> > > > >> > > > On Thu, 1 Oct 2020 at 14:28, Jark Wu <[hidden email]<mailto: >> [hidden email]>> wrote: >> > > > >> > > >> Hi Pengcheng, >> > > >> >> > > >> Yes, the window TVF is part of the FLIP. Welcome to contribute >> > and join >> > > >> the discussion. >> > > >> Regarding the SESSION window aggregation, users can use the >> > existing >> > > >> grouped session window function. >> > > >> >> > > >> Best, >> > > >> Jark >> > > >> >> > > >> On Sun, 27 Sep 2020 at 21:24, liupengcheng < >> > [hidden email]<mailto:[hidden email]> >> > > > >> > > >> wrote: >> > > >> >> > > >>> Hi Jark, >> > > >>> Thanks for reply, yes, I think it's a good feature, it >> > can >> > > >>> improve the NRT scenarios >> > > >>> as you mentioned in the FLIP. Also, I think it can >> > improve the >> > > >>> streaming SQL greatly, >> > > >>> it can support richer window operations in flink SQL >> and >> > bring >> > > >>> great convenience to users. >> > > >>> (we are now only supported group window in flink). >> > > >>> >> > > >>> Regarding the SESSION window, I think it's especially >> > useful >> > > for >> > > >>> user behavior analysis(e.g. >> > > >>> counting user visits on a news website or social >> > platform), but >> > > >>> I agree that we can keep it >> > > >>> out of the FLIP now to catch up 1.12. >> > > >>> >> > > >>> Recently, I've done some work on the stream planner >> with >> > the >> > > >>> TVFs, and I'm willing to contribute >> > > >>> to this part. Is it in the plan of this FLIP? >> > > >>> >> > > >>> Best, >> > > >>> PengchengLiu >> > > >>> >> > > >>> >> > > >>> 在 2020/9/26 下午11:09,“Jark Wu”<[hidden email]<mailto: >> [hidden email]>> 写入: >> > > >>> >> > > >>> Hi pengcheng, >> > > >>> >> > > >>> That's great to see you also have the need of window join. >> > > >>> You are right, the windowing TVF is a powerful feature >> which >> > can >> > > >>> support >> > > >>> more operations in the future. >> > > >>> I think it as of the date time "partition" selection in >> > batch SQL >> > > >>> jobs, >> > > >>> with this new syntax, I think it is possible >> > > >>> to migrate traditional batch SQL jobs to Flink SQL by >> > changing a >> > > >>> few lines. >> > > >>> >> > > >>> Regarding the SESSION window, this is on purpose to keep >> it >> > out of >> > > >>> the >> > > >>> FLIP, because we want to keep the >> > > >>> FLIP small to catch up 1.12 and SESSION TVF is rarely >> useful >> > (e.g. >> > > >>> session >> > > >>> window join?). >> > > >>> >> > > >>> Best, >> > > >>> Jark >> > > >>> >> > > >>> On Fri, 25 Sep 2020 at 22:59, liupengcheng < >> > > >>> [hidden email]<mailto: >> [hidden email]>> >> > > >>> wrote: >> > > >>> >> > > >>> > Hi, Jark, >> > > >>> > I'm very interested in this feature, and I'm >> also >> > working >> > > >>> on this >> > > >>> > recently. >> > > >>> > I just have a glance at the FLIP, it's good, >> but I >> > found >> > > >>> that >> > > >>> > there is no plan to add SESSION windows. >> > > >>> > Also, I think there can be more things we can do >> > based on >> > > >>> this new >> > > >>> > syntax. For example, >> > > >>> > - window sort support >> > > >>> > - window union/intersect/minus support >> > > >>> > - Improve dimension table join >> > > >>> > We can have more deep discussion on this new >> > feature >> > > later >> > > >>> . >> > > >>> > I've also opened an jira that is related to this >> > feature >> > > >>> recently: >> > > >>> > https://issues.apache.org/jira/browse/FLINK-18830 >> > > >>> > >> > > >>> > Best! >> > > >>> > PengchengLiu >> > > >>> > >> > > >>> > 在 2020/9/25 下午10:30,“Jark Wu”<[hidden email]<mailto: >> [hidden email]>> 写入: >> > > >>> > >> > > >>> > Hi everyone, >> > > >>> > >> > > >>> > I want to start a FLIP about supporting windowing >> > > table-valued >> > > >>> > functions >> > > >>> > (TVF). >> > > >>> > The main purpose of this FLIP is to improve the near >> > > real-time >> > > >>> (NRT) >> > > >>> > experience of Flink. >> > > >>> > >> > > >>> > FLIP-145: >> > > >>> > >> > > >>> > >> > > >>> >> > > >> > >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-145%3A+Support+SQL+windowing+table-valued+function >> > > >>> > >> > > >>> > We want to introduce TUMBLE, HOP, CUMULATE windowing >> > TVFs, >> > > the >> > > >>> > CUMULATE is >> > > >>> > a new kind of window. >> > > >>> > With the windowing TVFs, we can support richer >> > operations on >> > > >>> windows, >> > > >>> > including window join, window TopN and so on. >> > > >>> > This makes things simple: we only need to assign >> > windows at >> > > the >> > > >>> > beginning >> > > >>> > of the query, and then apply operations after that >> like >> > > >>> traditional >> > > >>> > batch >> > > >>> > SQL. >> > > >>> > We hope it can help to reduce the learning curve of >> > windows, >> > > >>> improve >> > > >>> > NRT >> > > >>> > for Flink, and attract more batch users. >> > > >>> > >> > > >>> > A simple code snippet for 10 minutes tumbling window >> > > aggregate: >> > > >>> > >> > > >>> > SELECT window_start, window_end, SUM(price) >> > > >>> > FROM TABLE( >> > > >>> > TUMBLE(TABLE Bid, DESCRIPTOR(bidtime), INTERVAL >> > '10' >> > > >>> MINUTES)) >> > > >>> > GROUP BY window_start, window_end; >> > > >>> > >> > > >>> > I'm looking forward to your feedback. >> > > >>> > >> > > >>> > Best, >> > > >>> > Jark >> > > >>> > >> > > >>> > >> > > >>> > >> > > >>> >> > > >>> >> > > >>> >> > > >> > >> > >> > -- >> > >> > Best, >> > Benchao Li >> > >> >> >> -- >> >> Best, >> Benchao Li >> >> >> -- >> >> Best, >> Benchao Li >> > -- Best, Benchao Li |
Thanks for driving this Jark ~
The FLIP overall looks good, i think the window TVFs would be the main "entry point" syntax for our NTR use cases. The syntax originated from the "One SQL To Rule Them All" paper and i think we have reached an agreement. I want to make some additions to the window TVF syntax here - We support the standard syntax of polymorphic table functions with named parameters. e.g. select * from table( tumble( DATA => table Shipments, TIMECOL => descriptor(rowtime), SIZE => INTERVAL '1' MINUTE)) - The first parameter can also be any form of sub-query, e.g. select * from table( tumble((select * from Shipments), descriptor(rowtime), INTERVAL '1' MINUTE)) The current proposed syntax is a simplified one and for most of the cases it is more easy to use. The question asked by Benchao is reasonable, as we suggest the window TVF as a normal UDTF, it can be chained with any kind of other non-windowed operators, we may need some time to give clear semantics for them (especially it is good if there are real use cases), as a start, let us focus the windowed operations first. I'm also +1 for voting ~ Best, Danny Chan 在 2020年10月10日 +0800 PM2:03,Benchao Li <[hidden email]>,写道: > Hi Jark, > > Thanks for your reply, this makes sense to me. > > The scenario I used above is just a case to explain what I'm concerned > about, > not necessarily a production use case. We can leave it to the future to see > whether > other users have these use cases. > > Then I have no other concerns, +1 to start the VOTE. > > > Jark Wu <[hidden email]> 于2020年10月10日周六 下午1:44写道: > > > Hi Benchao, > > > > That's a good question. > > > > IMO, the new windowed operators and the current time operators are two > > different sets of functions, > > just like time operators and non-time operators are two different sets of > > functions. > > I think it's fine if we don't support integrating them, just like time > > operators can't be applied on non-windowed aggregate. > > If users want to use time operators in the whole pipeline, then he/she can > > use the grouped window aggregates instead of the window TVFs. > > > > The key idea of window TVF is that all the operators in the pipeline are > > based on the **windows**. > > In terms of syntax, if the key clause (e.g. group by, partitioned by, join > > on, order by) contains window_start and window_end, > > it can be translated into windowed operators. > > Thus, we will have windowed CEP, windowed sort, windowed over aggregate in > > the future to make it possible to build a windowed pipeline. > > > > But I think we can elaborate the integration more in the future if users > > need it. Actually, I don't fully understand the scenario of integrating > > window TVF and time operators at this point. > > For example, interval join an input stream and a window join result. I > > don't see why it can't be expressed by nested window join and why users > > have to use interval join here. > > Maybe we can wait for more inputs from users when the window TVF is > > released and we can elaborate it again. > > > > Best, > > Jark > > > > On Sat, 10 Oct 2020 at 12:01, 刘 芃成 <[hidden email]> wrote: > > > > > Hi, Benchao, > > > I think I got your point, actually, in current implementation for > > > group window aggregation, the value of time attributes(e.g. > > > TUMBLE_ROWTIME/TUMBLE_PROCTIME) is calculated as (window_end – 1), so I > > > think we can just use it directly if you need this. But I think this time > > > attributes is mainly suggested to use in case of cascaded window operations. > > > Regarding the example you provided, I think the semantics of the SQL in > > > your example which doing interval join(e.g. with TUMBLE_ROWTIME) after > > > window aggregation is not clear in the current implementation, and I think > > > that’s a strong reason why we need the new TVFs syntax. > > > With the new syntax, users should understand which time column to > > > use and how to generate it when doing interval join and etc. > > > > > > Best, > > > Pengcheng > > > > > > 发件人: Benchao Li <[hidden email]> > > > 日期: 2020年10月10日 星期六 上午11:02 > > > 收件人: pengcheng Liu <[hidden email]> > > > 抄送: dev <[hidden email]> > > > 主题: Re: [DISCUSS] FLIP-145: Support SQL windowing table-valued function > > > > > > Hi pengcheng, > > > > > > Thanks for your response. > > > I knew that the original time attribute column will be retained after the > > > TVF, > > > what I'm questioning is how do we get the time attribute column after > > > Aggregation. > > > Your answer did not remove my doubts about this. > > > > > > It's ok if we did not plan to integrate new TVF aggregate with old "time > > > attribute scenarios" > > > listed in my previous email in this FLIP. However it's good to elaborate > > > leave it to the future plan. > > > > > > pengcheng Liu <[hidden email]<mailto: > > > [hidden email]>> 于2020年10月10日周六 上午10:45写道: > > > Hi,Benchao, > > > In TVFs, the time attributes is just passed through from parent rels, > > > and the TVFs just add two > > > additional window attributes(i.e. window_start & window_end). Also, I > > > think the time columns can be not only a time attribute > > > with type of `TimeIndicatorType` but also a regular column with type > > > of `Timestamp`. > > > > > > For cascaded window operations, we can use window_start/window_end of > > > the previous window result directly to > > > indicate operating on the same window, or use new DESCRIPTOR column > > > to assign new windows, in case of the change of > > > the time column(e.g. in some case, the original timestamp is > > > inaccurate and need some conversion to be used). > > > > > > You can check the definition or signature of these TVFs in the FLIP. > > > e.g. > > > SELECT * FROM TABLE( > > > TUMBLE(TABLE Bid, DESCRIPTOR(bidtime), INTERVAL '10' MINUTES)) > > > In the example, the `bidtime` is the time attribute column, which is > > > the first operand of the DESCRIPTOR function. > > > > > > +1 start voting. > > > > > > Benchao Li <[hidden email]<mailto:[hidden email]>> > > > 于2020年10月10日周六 上午10:08写道: > > > Hi Jark, > > > > > > 2 & 3 sounds good to me. > > > > > > Regarding time attribute, > > > I still have some questions, I knew it's easy to support cascaded window > > > aggregate using new TVFs. > > > However there are some other places where need time attribute: > > > - CEP > > > - interval join > > > - order by > > > - over window > > > If there is no time attribute column, how do we integrate these old > > > features with the new TVFs. > > > E.g. > > > StreamA -> new window aggregate -> interval join -> Sink > > > / > > > StreamB ----------------------------------- > > > > > > > > > Jark Wu <[hidden email]<mailto:[hidden email]>> 于2020年10月9日周五 > > > 下午11:51写道: > > > Hi Benchao, > > > > > > 1) time attribute > > > Yes. We don't need time attribute auxiliary function. Because the new > > > window operations are all based on the > > > window_start and window_end columns instead of on the time attributes. So > > > we don't need to propagate time attributes. > > > Cascaded window aggregate can be expressed by simply GROUP BY the > > > window_start and window_end of the previous window result. > > > I have added a cascaded window aggregate example in the Tumbling Window > > > section in the FLIP. > > > If you want to define proctime window aggregate, the time column in TVF > > > should be a proctime attribute field (or PROCTIME() function). > > > > > > 2) batch support > > > Yes. The proposed syntax/API are unified for batch and streaming. Batch > > > support is in the plan, but may not have enough time to catch up 1.12. > > > > > > 3) support `grouping sets` > > > This is not included in the FLIP, but I think it's great if we can support > > > `grouping sets`. > > > The existing window impl doesn't support this because we convert the > > > LogicalAggregate into WindowAggregate in the beginning, > > > the expand grouping sets rule can't be applied in this situation. > > > Fortunately, with the new window impl, the conversion to WindowAggregate > > > will happen at the end, so I think the expand rule can be > > > applied and support this feature naturally. > > > Therefore, IMO, we don't need to include this feature in this FLIP to > > > avoid > > > the FLIP being too large. > > > This can be a follow-up issue (maybe just add tests and docs) after the > > > FLIP. > > > > > > Best, > > > Jark > > > > > > > > > On Fri, 9 Oct 2020 at 19:09, 刘 芃成 <[hidden email]<mailto: > > > [hidden email]>> wrote: > > > > > > > Hi,Benchao, > > > > Welcome to join the discussion, yes, this new syntax can make > > > SQL > > > > more clear and simpler. > > > > For your first question, the `window_start` and `window_end` > > > > columns will be added automatically, > > > > so we don't need to use auxiliary group functions to infer or > > > > access the window properties. > > > > > > > > For the `grouping sets` on TVFs, I think it's interesting if we > > > > can support it, as we already supported `grouping sets` > > > > on streaming aggregates in blink planner. But I'm not sure if it > > > > will be included into this FLIP. > > > > > > > > cc @Jark Wu > > > > > > > > Best, > > > > Pengcheng > > > > > > > > > > > > 在 2020/10/9 下午5:25,“Benchao Li”<[hidden email]<mailto: > > > [hidden email]>> 写入: > > > > > > > > Thanks Jark for bringing this discussion, I like this FLIP very > > > much. > > > > > > > > Especially the cumulate window, it's much like the current TUMBLE > > > > window + > > > > Fast Emit (which is an undocumented experimental feature), however, > > > > it's > > > > more powerful. > > > > > > > > And This will make the SQL semantic more standard, especially for > > > the > > > > HOPPING window. > > > > > > > > Regarding time attribute, > > > > It seems that we don't need a specific function to infer the time > > > > attribute > > > > like > > > > `TUMBLE_ROWTIME` / `TUMBLE_PROCTIME`. Then are `window_start` and > > > > `window_end` > > > > column a time attribute column automatically? > > > > - If not, what will be the time attribute of the result relation of > > > > these > > > > TVFs? > > > > Especially after the window aggregation. > > > > - If yes, then how do we handle proctime? > > > > > > > > Regarding batch operators, > > > > It's great to hear that we can reuse the batch operators in > > > continuous > > > > batch mode > > > > as you mentioned in the FLIP. > > > > Current window aggregate could also be used in batch mode with > > > > rowtime. Do > > > > you plan > > > > to support these TVFs for batch mode in this FLIP? Hence the > > > Table/SQL > > > > is a > > > > unified > > > > API, it's great if we can keep the features complete both in > > > streaming > > > > and > > > > batch mode. > > > > > > > > There is one more question, I don't know whether it should be > > > > considered in > > > > this FLIP. > > > > Does the new window support `grouping sets`? (It's not supported in > > > old > > > > window impl). > > > > > > > > Jark Wu <[hidden email]<mailto:[hidden email]>> 于2020年10月9日周五 > > > 下午4:14写道: > > > > > > > > > Hi all, > > > > > > > > > > I know we have a lot of discussion and development on going right > > > > now but > > > > > it would be great if we can get FLIP-145 into a votable state. > > > > > If there are no objections, I would like to start voting in the > > > next > > > > days. > > > > > > > > > > Best, > > > > > Jark > > > > > > > > > > On Thu, 1 Oct 2020 at 14:29, Jark Wu <[hidden email]<mailto: > > > [hidden email]>> wrote: > > > > > > > > > > > Hi everyone, > > > > > > > > > > > > I have added a section for Performance Optimization to describe > > > > how to > > > > > > improve the performance in the short-term and long-term > > > > > > and sketch the future performance potential under the new window > > > > API. > > > > > > Introducing the window API is just the first step, we will > > > > > > continuously improve the performance to make it powerful and > > > > useful. > > > > > > > > > > > > Best, > > > > > > Jark > > > > > > > > > > > > On Thu, 1 Oct 2020 at 14:28, Jark Wu <[hidden email]<mailto: > > > [hidden email]>> wrote: > > > > > > > > > > > > > Hi Pengcheng, > > > > > > > > > > > > > > Yes, the window TVF is part of the FLIP. Welcome to contribute > > > > and join > > > > > > > the discussion. > > > > > > > Regarding the SESSION window aggregation, users can use the > > > > existing > > > > > > > grouped session window function. > > > > > > > > > > > > > > Best, > > > > > > > Jark > > > > > > > > > > > > > > On Sun, 27 Sep 2020 at 21:24, liupengcheng < > > > > [hidden email]<mailto:[hidden email]> > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > Hi Jark, > > > > > > > > Thanks for reply, yes, I think it's a good feature, it > > > > can > > > > > > > > improve the NRT scenarios > > > > > > > > as you mentioned in the FLIP. Also, I think it can > > > > improve the > > > > > > > > streaming SQL greatly, > > > > > > > > it can support richer window operations in flink SQL > > > and > > > > bring > > > > > > > > great convenience to users. > > > > > > > > (we are now only supported group window in flink). > > > > > > > > > > > > > > > > Regarding the SESSION window, I think it's especially > > > > useful > > > > > for > > > > > > > > user behavior analysis(e.g. > > > > > > > > counting user visits on a news website or social > > > > platform), but > > > > > > > > I agree that we can keep it > > > > > > > > out of the FLIP now to catch up 1.12. > > > > > > > > > > > > > > > > Recently, I've done some work on the stream planner > > > with > > > > the > > > > > > > > TVFs, and I'm willing to contribute > > > > > > > > to this part. Is it in the plan of this FLIP? > > > > > > > > > > > > > > > > Best, > > > > > > > > PengchengLiu > > > > > > > > > > > > > > > > > > > > > > > > 在 2020/9/26 下午11:09,“Jark Wu”<[hidden email]<mailto: > > > [hidden email]>> 写入: > > > > > > > > > > > > > > > > Hi pengcheng, > > > > > > > > > > > > > > > > That's great to see you also have the need of window join. > > > > > > > > You are right, the windowing TVF is a powerful feature > > > which > > > > can > > > > > > > > support > > > > > > > > more operations in the future. > > > > > > > > I think it as of the date time "partition" selection in > > > > batch SQL > > > > > > > > jobs, > > > > > > > > with this new syntax, I think it is possible > > > > > > > > to migrate traditional batch SQL jobs to Flink SQL by > > > > changing a > > > > > > > > few lines. > > > > > > > > > > > > > > > > Regarding the SESSION window, this is on purpose to keep > > > it > > > > out of > > > > > > > > the > > > > > > > > FLIP, because we want to keep the > > > > > > > > FLIP small to catch up 1.12 and SESSION TVF is rarely > > > useful > > > > (e.g. > > > > > > > > session > > > > > > > > window join?). > > > > > > > > > > > > > > > > Best, > > > > > > > > Jark > > > > > > > > > > > > > > > > On Fri, 25 Sep 2020 at 22:59, liupengcheng < > > > > > > > > [hidden email]<mailto: > > > [hidden email]>> > > > > > > > > wrote: > > > > > > > > > > > > > > > > > Hi, Jark, > > > > > > > > > I'm very interested in this feature, and I'm > > > also > > > > working > > > > > > > > on this > > > > > > > > > recently. > > > > > > > > > I just have a glance at the FLIP, it's good, > > > but I > > > > found > > > > > > > > that > > > > > > > > > there is no plan to add SESSION windows. > > > > > > > > > Also, I think there can be more things we can do > > > > based on > > > > > > > > this new > > > > > > > > > syntax. For example, > > > > > > > > > - window sort support > > > > > > > > > - window union/intersect/minus support > > > > > > > > > - Improve dimension table join > > > > > > > > > We can have more deep discussion on this new > > > > feature > > > > > later > > > > > > > > . > > > > > > > > > I've also opened an jira that is related to this > > > > feature > > > > > > > > recently: > > > > > > > > > https://issues.apache.org/jira/browse/FLINK-18830 > > > > > > > > > > > > > > > > > > Best! > > > > > > > > > PengchengLiu > > > > > > > > > > > > > > > > > > 在 2020/9/25 下午10:30,“Jark Wu”<[hidden email]<mailto: > > > [hidden email]>> 写入: > > > > > > > > > > > > > > > > > > Hi everyone, > > > > > > > > > > > > > > > > > > I want to start a FLIP about supporting windowing > > > > > table-valued > > > > > > > > > functions > > > > > > > > > (TVF). > > > > > > > > > The main purpose of this FLIP is to improve the near > > > > > real-time > > > > > > > > (NRT) > > > > > > > > > experience of Flink. > > > > > > > > > > > > > > > > > > FLIP-145: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-145%3A+Support+SQL+windowing+table-valued+function > > > > > > > > > > > > > > > > > > We want to introduce TUMBLE, HOP, CUMULATE windowing > > > > TVFs, > > > > > the > > > > > > > > > CUMULATE is > > > > > > > > > a new kind of window. > > > > > > > > > With the windowing TVFs, we can support richer > > > > operations on > > > > > > > > windows, > > > > > > > > > including window join, window TopN and so on. > > > > > > > > > This makes things simple: we only need to assign > > > > windows at > > > > > the > > > > > > > > > beginning > > > > > > > > > of the query, and then apply operations after that > > > like > > > > > > > > traditional > > > > > > > > > batch > > > > > > > > > SQL. > > > > > > > > > We hope it can help to reduce the learning curve of > > > > windows, > > > > > > > > improve > > > > > > > > > NRT > > > > > > > > > for Flink, and attract more batch users. > > > > > > > > > > > > > > > > > > A simple code snippet for 10 minutes tumbling window > > > > > aggregate: > > > > > > > > > > > > > > > > > > SELECT window_start, window_end, SUM(price) > > > > > > > > > FROM TABLE( > > > > > > > > > TUMBLE(TABLE Bid, DESCRIPTOR(bidtime), INTERVAL > > > > '10' > > > > > > > > MINUTES)) > > > > > > > > > GROUP BY window_start, window_end; > > > > > > > > > > > > > > > > > > I'm looking forward to your feedback. > > > > > > > > > > > > > > > > > > Best, > > > > > > > > > Jark > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > Best, > > > > Benchao Li > > > > > > > > > > > > > -- > > > > > > Best, > > > Benchao Li > > > > > > > > > -- > > > > > > Best, > > > Benchao Li > > > > > > > -- > > Best, > Benchao Li |
In reply to this post by Jark Wu-2
Hi, Jark,
I've got some different opinions there, I think it's a very common use case to use window operators in combination with streaming operators(even those time operators). (e.g. for some tables, users only care data within a period, but for other tables, they may want the whole historical data). The pipeline may looks like this: window join -> dimension table join -> stream aggregate -> stream sort Just as what you said, the key clause can be used to distinguish whether a operator should be translated to a window operator or a streaming operator. Also, as I've mentioned before, 1) for time operator after window aggregation, the auxiliary function which is used to access time attribute column can be actually replaced with (window_end -1). Actually, we only just need to make the results of the upstream contains a time column whose range is within (window_start, window_end), and thus the downstream time operators can work on it (driving by the original watermark in the source). 2) for time operator after other window operators, the downstream time operators can access the time column directly from it's input. One more thoughts there, maybe the window TVFs can re-assign timestamps and watermarks, so that in some case when the watermark can not be retrieved from source directly(may needs some conversions), the watermark can still be assigned dynamically in the SQL(use the time column as the watermark column) and thus make it work. I think this can save much time to revise the event time column in some cases(this is a real demand in our production environment). I strongly suggest that we should support the combination usage of window operators and streaming operators. And I think we can achieve this with little work. Best, Pengcheng Jark Wu <[hidden email]> 于2020年10月10日周六 下午1:45写道: > Hi Benchao, > > That's a good question. > > IMO, the new windowed operators and the current time operators are two > different sets of functions, > just like time operators and non-time operators are two different sets of > functions. > I think it's fine if we don't support integrating them, just like time > operators can't be applied on non-windowed aggregate. > If users want to use time operators in the whole pipeline, then he/she can > use the grouped window aggregates instead of the window TVFs. > > The key idea of window TVF is that all the operators in the pipeline are > based on the **windows**. > In terms of syntax, if the key clause (e.g. group by, partitioned by, join > on, order by) contains window_start and window_end, > it can be translated into windowed operators. > Thus, we will have windowed CEP, windowed sort, windowed over aggregate in > the future to make it possible to build a windowed pipeline. > > But I think we can elaborate the integration more in the future if users > need it. Actually, I don't fully understand the scenario of integrating > window TVF and time operators at this point. > For example, interval join an input stream and a window join result. I > don't see why it can't be expressed by nested window join and why users > have to use interval join here. > Maybe we can wait for more inputs from users when the window TVF is > released and we can elaborate it again. > > Best, > Jark > > On Sat, 10 Oct 2020 at 12:01, 刘 芃成 <[hidden email]> wrote: > > > Hi, Benchao, > > I think I got your point, actually, in current implementation for > > group window aggregation, the value of time attributes(e.g. > > TUMBLE_ROWTIME/TUMBLE_PROCTIME) is calculated as (window_end – 1), so I > > think we can just use it directly if you need this. But I think this time > > attributes is mainly suggested to use in case of cascaded window > operations. > > Regarding the example you provided, I think the semantics of the SQL in > > your example which doing interval join(e.g. with TUMBLE_ROWTIME) after > > window aggregation is not clear in the current implementation, and I > think > > that’s a strong reason why we need the new TVFs syntax. > > With the new syntax, users should understand which time column to > > use and how to generate it when doing interval join and etc. > > > > Best, > > Pengcheng > > > > 发件人: Benchao Li <[hidden email]> > > 日期: 2020年10月10日 星期六 上午11:02 > > 收件人: pengcheng Liu <[hidden email]> > > 抄送: dev <[hidden email]> > > 主题: Re: [DISCUSS] FLIP-145: Support SQL windowing table-valued function > > > > Hi pengcheng, > > > > Thanks for your response. > > I knew that the original time attribute column will be retained after the > > TVF, > > what I'm questioning is how do we get the time attribute column after > > Aggregation. > > Your answer did not remove my doubts about this. > > > > It's ok if we did not plan to integrate new TVF aggregate with old "time > > attribute scenarios" > > listed in my previous email in this FLIP. However it's good to elaborate > > leave it to the future plan. > > > > pengcheng Liu <[hidden email]<mailto: > > [hidden email]>> 于2020年10月10日周六 上午10:45写道: > > Hi,Benchao, > > In TVFs, the time attributes is just passed through from parent rels, > > and the TVFs just add two > > additional window attributes(i.e. window_start & window_end). Also, I > > think the time columns can be not only a time attribute > > with type of `TimeIndicatorType` but also a regular column with type > > of `Timestamp`. > > > > For cascaded window operations, we can use window_start/window_end of > > the previous window result directly to > > indicate operating on the same window, or use new DESCRIPTOR column > > to assign new windows, in case of the change of > > the time column(e.g. in some case, the original timestamp is > > inaccurate and need some conversion to be used). > > > > You can check the definition or signature of these TVFs in the FLIP. > > e.g. > > SELECT * FROM TABLE( > > TUMBLE(TABLE Bid, DESCRIPTOR(bidtime), INTERVAL '10' MINUTES)) > > In the example, the `bidtime` is the time attribute column, which is > > the first operand of the DESCRIPTOR function. > > > > +1 start voting. > > > > Benchao Li <[hidden email]<mailto:[hidden email]>> > > 于2020年10月10日周六 上午10:08写道: > > Hi Jark, > > > > 2 & 3 sounds good to me. > > > > Regarding time attribute, > > I still have some questions, I knew it's easy to support cascaded window > > aggregate using new TVFs. > > However there are some other places where need time attribute: > > - CEP > > - interval join > > - order by > > - over window > > If there is no time attribute column, how do we integrate these old > > features with the new TVFs. > > E.g. > > StreamA -> new window aggregate -> interval join -> Sink > > / > > StreamB ----------------------------------- > > > > > > Jark Wu <[hidden email]<mailto:[hidden email]>> 于2020年10月9日周五 > > 下午11:51写道: > > Hi Benchao, > > > > 1) time attribute > > Yes. We don't need time attribute auxiliary function. Because the new > > window operations are all based on the > > window_start and window_end columns instead of on the time attributes. > So > > we don't need to propagate time attributes. > > Cascaded window aggregate can be expressed by simply GROUP BY the > > window_start and window_end of the previous window result. > > I have added a cascaded window aggregate example in the Tumbling Window > > section in the FLIP. > > If you want to define proctime window aggregate, the time column in TVF > > should be a proctime attribute field (or PROCTIME() function). > > > > 2) batch support > > Yes. The proposed syntax/API are unified for batch and streaming. Batch > > support is in the plan, but may not have enough time to catch up 1.12. > > > > 3) support `grouping sets` > > This is not included in the FLIP, but I think it's great if we can > support > > `grouping sets`. > > The existing window impl doesn't support this because we convert the > > LogicalAggregate into WindowAggregate in the beginning, > > the expand grouping sets rule can't be applied in this situation. > > Fortunately, with the new window impl, the conversion to WindowAggregate > > will happen at the end, so I think the expand rule can be > > applied and support this feature naturally. > > Therefore, IMO, we don't need to include this feature in this FLIP to > avoid > > the FLIP being too large. > > This can be a follow-up issue (maybe just add tests and docs) after the > > FLIP. > > > > Best, > > Jark > > > > > > On Fri, 9 Oct 2020 at 19:09, 刘 芃成 <[hidden email]<mailto: > > [hidden email]>> wrote: > > > > > Hi,Benchao, > > > Welcome to join the discussion, yes, this new syntax can make > SQL > > > more clear and simpler. > > > For your first question, the `window_start` and `window_end` > > > columns will be added automatically, > > > so we don't need to use auxiliary group functions to infer or > > > access the window properties. > > > > > > For the `grouping sets` on TVFs, I think it's interesting if we > > > can support it, as we already supported `grouping sets` > > > on streaming aggregates in blink planner. But I'm not sure if > it > > > will be included into this FLIP. > > > > > > cc @Jark Wu > > > > > > Best, > > > Pengcheng > > > > > > > > > 在 2020/10/9 下午5:25,“Benchao Li”<[hidden email]<mailto: > > [hidden email]>> 写入: > > > > > > Thanks Jark for bringing this discussion, I like this FLIP very > much. > > > > > > Especially the cumulate window, it's much like the current TUMBLE > > > window + > > > Fast Emit (which is an undocumented experimental feature), however, > > > it's > > > more powerful. > > > > > > And This will make the SQL semantic more standard, especially for > the > > > HOPPING window. > > > > > > Regarding time attribute, > > > It seems that we don't need a specific function to infer the time > > > attribute > > > like > > > `TUMBLE_ROWTIME` / `TUMBLE_PROCTIME`. Then are `window_start` and > > > `window_end` > > > column a time attribute column automatically? > > > - If not, what will be the time attribute of the result relation of > > > these > > > TVFs? > > > Especially after the window aggregation. > > > - If yes, then how do we handle proctime? > > > > > > Regarding batch operators, > > > It's great to hear that we can reuse the batch operators in > > continuous > > > batch mode > > > as you mentioned in the FLIP. > > > Current window aggregate could also be used in batch mode with > > > rowtime. Do > > > you plan > > > to support these TVFs for batch mode in this FLIP? Hence the > > Table/SQL > > > is a > > > unified > > > API, it's great if we can keep the features complete both in > > streaming > > > and > > > batch mode. > > > > > > There is one more question, I don't know whether it should be > > > considered in > > > this FLIP. > > > Does the new window support `grouping sets`? (It's not supported in > > old > > > window impl). > > > > > > Jark Wu <[hidden email]<mailto:[hidden email]>> 于2020年10月9日周五 > > 下午4:14写道: > > > > > > > Hi all, > > > > > > > > I know we have a lot of discussion and development on going right > > > now but > > > > it would be great if we can get FLIP-145 into a votable state. > > > > If there are no objections, I would like to start voting in the > > next > > > days. > > > > > > > > Best, > > > > Jark > > > > > > > > On Thu, 1 Oct 2020 at 14:29, Jark Wu <[hidden email]<mailto: > > [hidden email]>> wrote: > > > > > > > > > Hi everyone, > > > > > > > > > > I have added a section for Performance Optimization to describe > > > how to > > > > > improve the performance in the short-term and long-term > > > > > and sketch the future performance potential under the new > window > > > API. > > > > > Introducing the window API is just the first step, we will > > > > > continuously improve the performance to make it powerful and > > > useful. > > > > > > > > > > Best, > > > > > Jark > > > > > > > > > > On Thu, 1 Oct 2020 at 14:28, Jark Wu <[hidden email]<mailto: > > [hidden email]>> wrote: > > > > > > > > > >> Hi Pengcheng, > > > > >> > > > > >> Yes, the window TVF is part of the FLIP. Welcome to contribute > > > and join > > > > >> the discussion. > > > > >> Regarding the SESSION window aggregation, users can use the > > > existing > > > > >> grouped session window function. > > > > >> > > > > >> Best, > > > > >> Jark > > > > >> > > > > >> On Sun, 27 Sep 2020 at 21:24, liupengcheng < > > > [hidden email]<mailto:[hidden email]> > > > > > > > > > >> wrote: > > > > >> > > > > >>> Hi Jark, > > > > >>> Thanks for reply, yes, I think it's a good feature, > it > > > can > > > > >>> improve the NRT scenarios > > > > >>> as you mentioned in the FLIP. Also, I think it can > > > improve the > > > > >>> streaming SQL greatly, > > > > >>> it can support richer window operations in flink SQL > > and > > > bring > > > > >>> great convenience to users. > > > > >>> (we are now only supported group window in flink). > > > > >>> > > > > >>> Regarding the SESSION window, I think it's especially > > > useful > > > > for > > > > >>> user behavior analysis(e.g. > > > > >>> counting user visits on a news website or social > > > platform), but > > > > >>> I agree that we can keep it > > > > >>> out of the FLIP now to catch up 1.12. > > > > >>> > > > > >>> Recently, I've done some work on the stream planner > > with > > > the > > > > >>> TVFs, and I'm willing to contribute > > > > >>> to this part. Is it in the plan of this FLIP? > > > > >>> > > > > >>> Best, > > > > >>> PengchengLiu > > > > >>> > > > > >>> > > > > >>> 在 2020/9/26 下午11:09,“Jark Wu”<[hidden email]<mailto: > > [hidden email]>> 写入: > > > > >>> > > > > >>> Hi pengcheng, > > > > >>> > > > > >>> That's great to see you also have the need of window > join. > > > > >>> You are right, the windowing TVF is a powerful feature > > which > > > can > > > > >>> support > > > > >>> more operations in the future. > > > > >>> I think it as of the date time "partition" selection in > > > batch SQL > > > > >>> jobs, > > > > >>> with this new syntax, I think it is possible > > > > >>> to migrate traditional batch SQL jobs to Flink SQL by > > > changing a > > > > >>> few lines. > > > > >>> > > > > >>> Regarding the SESSION window, this is on purpose to keep > it > > > out of > > > > >>> the > > > > >>> FLIP, because we want to keep the > > > > >>> FLIP small to catch up 1.12 and SESSION TVF is rarely > > useful > > > (e.g. > > > > >>> session > > > > >>> window join?). > > > > >>> > > > > >>> Best, > > > > >>> Jark > > > > >>> > > > > >>> On Fri, 25 Sep 2020 at 22:59, liupengcheng < > > > > >>> [hidden email]<mailto: > [hidden email] > > >> > > > > >>> wrote: > > > > >>> > > > > >>> > Hi, Jark, > > > > >>> > I'm very interested in this feature, and I'm > also > > > working > > > > >>> on this > > > > >>> > recently. > > > > >>> > I just have a glance at the FLIP, it's good, > but > > I > > > found > > > > >>> that > > > > >>> > there is no plan to add SESSION windows. > > > > >>> > Also, I think there can be more things we can > do > > > based on > > > > >>> this new > > > > >>> > syntax. For example, > > > > >>> > - window sort support > > > > >>> > - window union/intersect/minus support > > > > >>> > - Improve dimension table join > > > > >>> > We can have more deep discussion on this new > > > feature > > > > later > > > > >>> . > > > > >>> > I've also opened an jira that is related to > this > > > feature > > > > >>> recently: > > > > >>> > https://issues.apache.org/jira/browse/FLINK-18830 > > > > >>> > > > > > >>> > Best! > > > > >>> > PengchengLiu > > > > >>> > > > > > >>> > 在 2020/9/25 下午10:30,“Jark Wu”<[hidden email] > <mailto: > > [hidden email]>> 写入: > > > > >>> > > > > > >>> > Hi everyone, > > > > >>> > > > > > >>> > I want to start a FLIP about supporting windowing > > > > table-valued > > > > >>> > functions > > > > >>> > (TVF). > > > > >>> > The main purpose of this FLIP is to improve the > near > > > > real-time > > > > >>> (NRT) > > > > >>> > experience of Flink. > > > > >>> > > > > > >>> > FLIP-145: > > > > >>> > > > > > >>> > > > > > >>> > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-145%3A+Support+SQL+windowing+table-valued+function > > > > >>> > > > > > >>> > We want to introduce TUMBLE, HOP, CUMULATE > windowing > > > TVFs, > > > > the > > > > >>> > CUMULATE is > > > > >>> > a new kind of window. > > > > >>> > With the windowing TVFs, we can support richer > > > operations on > > > > >>> windows, > > > > >>> > including window join, window TopN and so on. > > > > >>> > This makes things simple: we only need to assign > > > windows at > > > > the > > > > >>> > beginning > > > > >>> > of the query, and then apply operations after that > > like > > > > >>> traditional > > > > >>> > batch > > > > >>> > SQL. > > > > >>> > We hope it can help to reduce the learning curve of > > > windows, > > > > >>> improve > > > > >>> > NRT > > > > >>> > for Flink, and attract more batch users. > > > > >>> > > > > > >>> > A simple code snippet for 10 minutes tumbling > window > > > > aggregate: > > > > >>> > > > > > >>> > SELECT window_start, window_end, SUM(price) > > > > >>> > FROM TABLE( > > > > >>> > TUMBLE(TABLE Bid, DESCRIPTOR(bidtime), INTERVAL > > > '10' > > > > >>> MINUTES)) > > > > >>> > GROUP BY window_start, window_end; > > > > >>> > > > > > >>> > I'm looking forward to your feedback. > > > > >>> > > > > > >>> > Best, > > > > >>> > Jark > > > > >>> > > > > > >>> > > > > > >>> > > > > > >>> > > > > >>> > > > > >>> > > > > > > > > > > > > > -- > > > > > > Best, > > > Benchao Li > > > > > > > > > -- > > > > Best, > > Benchao Li > > > > > > -- > > > > Best, > > Benchao Li > > > |
Hi Danny,
Thanks for the hint about named params syntax, I added examples with named params in the FLIP. Best, Jark On Sat, 10 Oct 2020 at 15:03, Pengcheng Liu <[hidden email]> wrote: > Hi, Jark, > > I've got some different opinions there, I think it's a very common use > case to use > window operators in combination with streaming operators(even those > time operators). > (e.g. for some tables, users only care data within a period, but for > other tables, they may > want the whole historical data). > The pipeline may looks like this: > window join -> dimension table join -> stream aggregate -> stream sort > > Just as what you said, the key clause can be used to distinguish > whether a operator should > be translated to a window operator or a streaming operator. > > Also, as I've mentioned before, 1) for time operator after window > aggregation, the auxiliary function > which is used to access time attribute column can be actually replaced > with (window_end -1). > Actually, we only just need to make the results of the upstream > contains a time column whose > range is within (window_start, window_end), and thus the downstream > time operators can work on it > (driving by the original watermark in the source). 2) for time operator > after other window operators, > the downstream time operators can access the time column directly from > it's input. > > One more thoughts there, maybe the window TVFs can re-assign timestamps > and watermarks, so > that in some case when the watermark can not be retrieved from source > directly(may needs some > conversions), the watermark can still be assigned dynamically in the > SQL(use the time column as > the watermark column) and thus make it work. I think this can save much > time to revise the event > time column in some cases(this is a real demand in our production > environment). > > I strongly suggest that we should support the combination usage of > window operators and > streaming operators. And I think we can achieve this with little work. > > Best, > Pengcheng > > > Jark Wu <[hidden email]> 于2020年10月10日周六 下午1:45写道: > >> Hi Benchao, >> >> That's a good question. >> >> IMO, the new windowed operators and the current time operators are two >> different sets of functions, >> just like time operators and non-time operators are two different sets of >> functions. >> I think it's fine if we don't support integrating them, just like time >> operators can't be applied on non-windowed aggregate. >> If users want to use time operators in the whole pipeline, then he/she can >> use the grouped window aggregates instead of the window TVFs. >> >> The key idea of window TVF is that all the operators in the pipeline are >> based on the **windows**. >> In terms of syntax, if the key clause (e.g. group by, partitioned by, join >> on, order by) contains window_start and window_end, >> it can be translated into windowed operators. >> Thus, we will have windowed CEP, windowed sort, windowed over aggregate in >> the future to make it possible to build a windowed pipeline. >> >> But I think we can elaborate the integration more in the future if users >> need it. Actually, I don't fully understand the scenario of integrating >> window TVF and time operators at this point. >> For example, interval join an input stream and a window join result. I >> don't see why it can't be expressed by nested window join and why users >> have to use interval join here. >> Maybe we can wait for more inputs from users when the window TVF is >> released and we can elaborate it again. >> >> Best, >> Jark >> >> On Sat, 10 Oct 2020 at 12:01, 刘 芃成 <[hidden email]> wrote: >> >> > Hi, Benchao, >> > I think I got your point, actually, in current implementation for >> > group window aggregation, the value of time attributes(e.g. >> > TUMBLE_ROWTIME/TUMBLE_PROCTIME) is calculated as (window_end – 1), so I >> > think we can just use it directly if you need this. But I think this >> time >> > attributes is mainly suggested to use in case of cascaded window >> operations. >> > Regarding the example you provided, I think the semantics of the SQL in >> > your example which doing interval join(e.g. with TUMBLE_ROWTIME) after >> > window aggregation is not clear in the current implementation, and I >> think >> > that’s a strong reason why we need the new TVFs syntax. >> > With the new syntax, users should understand which time column to >> > use and how to generate it when doing interval join and etc. >> > >> > Best, >> > Pengcheng >> > >> > 发件人: Benchao Li <[hidden email]> >> > 日期: 2020年10月10日 星期六 上午11:02 >> > 收件人: pengcheng Liu <[hidden email]> >> > 抄送: dev <[hidden email]> >> > 主题: Re: [DISCUSS] FLIP-145: Support SQL windowing table-valued function >> > >> > Hi pengcheng, >> > >> > Thanks for your response. >> > I knew that the original time attribute column will be retained after >> the >> > TVF, >> > what I'm questioning is how do we get the time attribute column after >> > Aggregation. >> > Your answer did not remove my doubts about this. >> > >> > It's ok if we did not plan to integrate new TVF aggregate with old "time >> > attribute scenarios" >> > listed in my previous email in this FLIP. However it's good to elaborate >> > leave it to the future plan. >> > >> > pengcheng Liu <[hidden email]<mailto: >> > [hidden email]>> 于2020年10月10日周六 上午10:45写道: >> > Hi,Benchao, >> > In TVFs, the time attributes is just passed through from parent >> rels, >> > and the TVFs just add two >> > additional window attributes(i.e. window_start & window_end). Also, >> I >> > think the time columns can be not only a time attribute >> > with type of `TimeIndicatorType` but also a regular column with type >> > of `Timestamp`. >> > >> > For cascaded window operations, we can use window_start/window_end >> of >> > the previous window result directly to >> > indicate operating on the same window, or use new DESCRIPTOR column >> > to assign new windows, in case of the change of >> > the time column(e.g. in some case, the original timestamp is >> > inaccurate and need some conversion to be used). >> > >> > You can check the definition or signature of these TVFs in the FLIP. >> > e.g. >> > SELECT * FROM TABLE( >> > TUMBLE(TABLE Bid, DESCRIPTOR(bidtime), INTERVAL '10' MINUTES)) >> > In the example, the `bidtime` is the time attribute column, which is >> > the first operand of the DESCRIPTOR function. >> > >> > +1 start voting. >> > >> > Benchao Li <[hidden email]<mailto:[hidden email]>> >> > 于2020年10月10日周六 上午10:08写道: >> > Hi Jark, >> > >> > 2 & 3 sounds good to me. >> > >> > Regarding time attribute, >> > I still have some questions, I knew it's easy to support cascaded window >> > aggregate using new TVFs. >> > However there are some other places where need time attribute: >> > - CEP >> > - interval join >> > - order by >> > - over window >> > If there is no time attribute column, how do we integrate these old >> > features with the new TVFs. >> > E.g. >> > StreamA -> new window aggregate -> interval join -> Sink >> > / >> > StreamB ----------------------------------- >> > >> > >> > Jark Wu <[hidden email]<mailto:[hidden email]>> 于2020年10月9日周五 >> > 下午11:51写道: >> > Hi Benchao, >> > >> > 1) time attribute >> > Yes. We don't need time attribute auxiliary function. Because the new >> > window operations are all based on the >> > window_start and window_end columns instead of on the time attributes. >> So >> > we don't need to propagate time attributes. >> > Cascaded window aggregate can be expressed by simply GROUP BY the >> > window_start and window_end of the previous window result. >> > I have added a cascaded window aggregate example in the Tumbling Window >> > section in the FLIP. >> > If you want to define proctime window aggregate, the time column in TVF >> > should be a proctime attribute field (or PROCTIME() function). >> > >> > 2) batch support >> > Yes. The proposed syntax/API are unified for batch and streaming. Batch >> > support is in the plan, but may not have enough time to catch up 1.12. >> > >> > 3) support `grouping sets` >> > This is not included in the FLIP, but I think it's great if we can >> support >> > `grouping sets`. >> > The existing window impl doesn't support this because we convert the >> > LogicalAggregate into WindowAggregate in the beginning, >> > the expand grouping sets rule can't be applied in this situation. >> > Fortunately, with the new window impl, the conversion to WindowAggregate >> > will happen at the end, so I think the expand rule can be >> > applied and support this feature naturally. >> > Therefore, IMO, we don't need to include this feature in this FLIP to >> avoid >> > the FLIP being too large. >> > This can be a follow-up issue (maybe just add tests and docs) after the >> > FLIP. >> > >> > Best, >> > Jark >> > >> > >> > On Fri, 9 Oct 2020 at 19:09, 刘 芃成 <[hidden email]<mailto: >> > [hidden email]>> wrote: >> > >> > > Hi,Benchao, >> > > Welcome to join the discussion, yes, this new syntax can make >> SQL >> > > more clear and simpler. >> > > For your first question, the `window_start` and `window_end` >> > > columns will be added automatically, >> > > so we don't need to use auxiliary group functions to infer or >> > > access the window properties. >> > > >> > > For the `grouping sets` on TVFs, I think it's interesting if >> we >> > > can support it, as we already supported `grouping sets` >> > > on streaming aggregates in blink planner. But I'm not sure if >> it >> > > will be included into this FLIP. >> > > >> > > cc @Jark Wu >> > > >> > > Best, >> > > Pengcheng >> > > >> > > >> > > 在 2020/10/9 下午5:25,“Benchao Li”<[hidden email]<mailto: >> > [hidden email]>> 写入: >> > > >> > > Thanks Jark for bringing this discussion, I like this FLIP very >> much. >> > > >> > > Especially the cumulate window, it's much like the current TUMBLE >> > > window + >> > > Fast Emit (which is an undocumented experimental feature), >> however, >> > > it's >> > > more powerful. >> > > >> > > And This will make the SQL semantic more standard, especially for >> the >> > > HOPPING window. >> > > >> > > Regarding time attribute, >> > > It seems that we don't need a specific function to infer the time >> > > attribute >> > > like >> > > `TUMBLE_ROWTIME` / `TUMBLE_PROCTIME`. Then are `window_start` and >> > > `window_end` >> > > column a time attribute column automatically? >> > > - If not, what will be the time attribute of the result relation >> of >> > > these >> > > TVFs? >> > > Especially after the window aggregation. >> > > - If yes, then how do we handle proctime? >> > > >> > > Regarding batch operators, >> > > It's great to hear that we can reuse the batch operators in >> > continuous >> > > batch mode >> > > as you mentioned in the FLIP. >> > > Current window aggregate could also be used in batch mode with >> > > rowtime. Do >> > > you plan >> > > to support these TVFs for batch mode in this FLIP? Hence the >> > Table/SQL >> > > is a >> > > unified >> > > API, it's great if we can keep the features complete both in >> > streaming >> > > and >> > > batch mode. >> > > >> > > There is one more question, I don't know whether it should be >> > > considered in >> > > this FLIP. >> > > Does the new window support `grouping sets`? (It's not supported >> in >> > old >> > > window impl). >> > > >> > > Jark Wu <[hidden email]<mailto:[hidden email]>> 于2020年10月9日周五 >> > 下午4:14写道: >> > > >> > > > Hi all, >> > > > >> > > > I know we have a lot of discussion and development on going >> right >> > > now but >> > > > it would be great if we can get FLIP-145 into a votable state. >> > > > If there are no objections, I would like to start voting in the >> > next >> > > days. >> > > > >> > > > Best, >> > > > Jark >> > > > >> > > > On Thu, 1 Oct 2020 at 14:29, Jark Wu <[hidden email]<mailto: >> > [hidden email]>> wrote: >> > > > >> > > > > Hi everyone, >> > > > > >> > > > > I have added a section for Performance Optimization to >> describe >> > > how to >> > > > > improve the performance in the short-term and long-term >> > > > > and sketch the future performance potential under the new >> window >> > > API. >> > > > > Introducing the window API is just the first step, we will >> > > > > continuously improve the performance to make it powerful and >> > > useful. >> > > > > >> > > > > Best, >> > > > > Jark >> > > > > >> > > > > On Thu, 1 Oct 2020 at 14:28, Jark Wu <[hidden email] >> <mailto: >> > [hidden email]>> wrote: >> > > > > >> > > > >> Hi Pengcheng, >> > > > >> >> > > > >> Yes, the window TVF is part of the FLIP. Welcome to >> contribute >> > > and join >> > > > >> the discussion. >> > > > >> Regarding the SESSION window aggregation, users can use the >> > > existing >> > > > >> grouped session window function. >> > > > >> >> > > > >> Best, >> > > > >> Jark >> > > > >> >> > > > >> On Sun, 27 Sep 2020 at 21:24, liupengcheng < >> > > [hidden email]<mailto:[hidden email]> >> > > > > >> > > > >> wrote: >> > > > >> >> > > > >>> Hi Jark, >> > > > >>> Thanks for reply, yes, I think it's a good feature, >> it >> > > can >> > > > >>> improve the NRT scenarios >> > > > >>> as you mentioned in the FLIP. Also, I think it can >> > > improve the >> > > > >>> streaming SQL greatly, >> > > > >>> it can support richer window operations in flink SQL >> > and >> > > bring >> > > > >>> great convenience to users. >> > > > >>> (we are now only supported group window in flink). >> > > > >>> >> > > > >>> Regarding the SESSION window, I think it's >> especially >> > > useful >> > > > for >> > > > >>> user behavior analysis(e.g. >> > > > >>> counting user visits on a news website or social >> > > platform), but >> > > > >>> I agree that we can keep it >> > > > >>> out of the FLIP now to catch up 1.12. >> > > > >>> >> > > > >>> Recently, I've done some work on the stream planner >> > with >> > > the >> > > > >>> TVFs, and I'm willing to contribute >> > > > >>> to this part. Is it in the plan of this FLIP? >> > > > >>> >> > > > >>> Best, >> > > > >>> PengchengLiu >> > > > >>> >> > > > >>> >> > > > >>> 在 2020/9/26 下午11:09,“Jark Wu”<[hidden email]<mailto: >> > [hidden email]>> 写入: >> > > > >>> >> > > > >>> Hi pengcheng, >> > > > >>> >> > > > >>> That's great to see you also have the need of window >> join. >> > > > >>> You are right, the windowing TVF is a powerful feature >> > which >> > > can >> > > > >>> support >> > > > >>> more operations in the future. >> > > > >>> I think it as of the date time "partition" selection in >> > > batch SQL >> > > > >>> jobs, >> > > > >>> with this new syntax, I think it is possible >> > > > >>> to migrate traditional batch SQL jobs to Flink SQL by >> > > changing a >> > > > >>> few lines. >> > > > >>> >> > > > >>> Regarding the SESSION window, this is on purpose to >> keep it >> > > out of >> > > > >>> the >> > > > >>> FLIP, because we want to keep the >> > > > >>> FLIP small to catch up 1.12 and SESSION TVF is rarely >> > useful >> > > (e.g. >> > > > >>> session >> > > > >>> window join?). >> > > > >>> >> > > > >>> Best, >> > > > >>> Jark >> > > > >>> >> > > > >>> On Fri, 25 Sep 2020 at 22:59, liupengcheng < >> > > > >>> [hidden email]<mailto: >> [hidden email] >> > >> >> > > > >>> wrote: >> > > > >>> >> > > > >>> > Hi, Jark, >> > > > >>> > I'm very interested in this feature, and I'm >> also >> > > working >> > > > >>> on this >> > > > >>> > recently. >> > > > >>> > I just have a glance at the FLIP, it's good, >> but >> > I >> > > found >> > > > >>> that >> > > > >>> > there is no plan to add SESSION windows. >> > > > >>> > Also, I think there can be more things we can >> do >> > > based on >> > > > >>> this new >> > > > >>> > syntax. For example, >> > > > >>> > - window sort support >> > > > >>> > - window union/intersect/minus support >> > > > >>> > - Improve dimension table join >> > > > >>> > We can have more deep discussion on this new >> > > feature >> > > > later >> > > > >>> . >> > > > >>> > I've also opened an jira that is related to >> this >> > > feature >> > > > >>> recently: >> > > > >>> > https://issues.apache.org/jira/browse/FLINK-18830 >> > > > >>> > >> > > > >>> > Best! >> > > > >>> > PengchengLiu >> > > > >>> > >> > > > >>> > 在 2020/9/25 下午10:30,“Jark Wu”<[hidden email] >> <mailto: >> > [hidden email]>> 写入: >> > > > >>> > >> > > > >>> > Hi everyone, >> > > > >>> > >> > > > >>> > I want to start a FLIP about supporting windowing >> > > > table-valued >> > > > >>> > functions >> > > > >>> > (TVF). >> > > > >>> > The main purpose of this FLIP is to improve the >> near >> > > > real-time >> > > > >>> (NRT) >> > > > >>> > experience of Flink. >> > > > >>> > >> > > > >>> > FLIP-145: >> > > > >>> > >> > > > >>> > >> > > > >>> >> > > > >> > > >> > >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-145%3A+Support+SQL+windowing+table-valued+function >> > > > >>> > >> > > > >>> > We want to introduce TUMBLE, HOP, CUMULATE >> windowing >> > > TVFs, >> > > > the >> > > > >>> > CUMULATE is >> > > > >>> > a new kind of window. >> > > > >>> > With the windowing TVFs, we can support richer >> > > operations on >> > > > >>> windows, >> > > > >>> > including window join, window TopN and so on. >> > > > >>> > This makes things simple: we only need to assign >> > > windows at >> > > > the >> > > > >>> > beginning >> > > > >>> > of the query, and then apply operations after that >> > like >> > > > >>> traditional >> > > > >>> > batch >> > > > >>> > SQL. >> > > > >>> > We hope it can help to reduce the learning curve >> of >> > > windows, >> > > > >>> improve >> > > > >>> > NRT >> > > > >>> > for Flink, and attract more batch users. >> > > > >>> > >> > > > >>> > A simple code snippet for 10 minutes tumbling >> window >> > > > aggregate: >> > > > >>> > >> > > > >>> > SELECT window_start, window_end, SUM(price) >> > > > >>> > FROM TABLE( >> > > > >>> > TUMBLE(TABLE Bid, DESCRIPTOR(bidtime), >> INTERVAL >> > > '10' >> > > > >>> MINUTES)) >> > > > >>> > GROUP BY window_start, window_end; >> > > > >>> > >> > > > >>> > I'm looking forward to your feedback. >> > > > >>> > >> > > > >>> > Best, >> > > > >>> > Jark >> > > > >>> > >> > > > >>> > >> > > > >>> > >> > > > >>> >> > > > >>> >> > > > >>> >> > > > >> > > >> > > >> > > -- >> > > >> > > Best, >> > > Benchao Li >> > > >> > >> > >> > -- >> > >> > Best, >> > Benchao Li >> > >> > >> > -- >> > >> > Best, >> > Benchao Li >> > >> > |
Hi Pengcheng,
IIUC, the "stream operators" you mean is the non-time operators or called regular operators, such as regular join, regular aggregate. But you may misunderstand me, only the time operators can't be applied after the new window operators, because of missing time attributes. The regular operators can still be applied after the new window operators. Regarding using window TVFs to re-assign event-time and watermarks, I'm not sure about this. Because assigning watermark requires to define the watermark strategy, however, the window TVF doesn't provide such ability. Polymorphic table functions are table functions which just append additional columns and convert N rows into M rows, it can't touch meta information. Best, Jark On Sat, 10 Oct 2020 at 15:41, Jark Wu <[hidden email]> wrote: > Hi Danny, > > Thanks for the hint about named params syntax, I added examples with named > params in the FLIP. > > Best, > Jark > > > On Sat, 10 Oct 2020 at 15:03, Pengcheng Liu <[hidden email]> > wrote: > >> Hi, Jark, >> >> I've got some different opinions there, I think it's a very common use >> case to use >> window operators in combination with streaming operators(even those >> time operators). >> (e.g. for some tables, users only care data within a period, but for >> other tables, they may >> want the whole historical data). >> The pipeline may looks like this: >> window join -> dimension table join -> stream aggregate -> stream sort >> >> Just as what you said, the key clause can be used to distinguish >> whether a operator should >> be translated to a window operator or a streaming operator. >> >> Also, as I've mentioned before, 1) for time operator after window >> aggregation, the auxiliary function >> which is used to access time attribute column can be actually replaced >> with (window_end -1). >> Actually, we only just need to make the results of the upstream >> contains a time column whose >> range is within (window_start, window_end), and thus the downstream >> time operators can work on it >> (driving by the original watermark in the source). 2) for time >> operator after other window operators, >> the downstream time operators can access the time column directly from >> it's input. >> >> One more thoughts there, maybe the window TVFs can re-assign >> timestamps and watermarks, so >> that in some case when the watermark can not be retrieved from source >> directly(may needs some >> conversions), the watermark can still be assigned dynamically in the >> SQL(use the time column as >> the watermark column) and thus make it work. I think this can save >> much time to revise the event >> time column in some cases(this is a real demand in our production >> environment). >> >> I strongly suggest that we should support the combination usage of >> window operators and >> streaming operators. And I think we can achieve this with little work. >> >> Best, >> Pengcheng >> >> >> Jark Wu <[hidden email]> 于2020年10月10日周六 下午1:45写道: >> >>> Hi Benchao, >>> >>> That's a good question. >>> >>> IMO, the new windowed operators and the current time operators are two >>> different sets of functions, >>> just like time operators and non-time operators are two different sets of >>> functions. >>> I think it's fine if we don't support integrating them, just like time >>> operators can't be applied on non-windowed aggregate. >>> If users want to use time operators in the whole pipeline, then he/she >>> can >>> use the grouped window aggregates instead of the window TVFs. >>> >>> The key idea of window TVF is that all the operators in the pipeline are >>> based on the **windows**. >>> In terms of syntax, if the key clause (e.g. group by, partitioned by, >>> join >>> on, order by) contains window_start and window_end, >>> it can be translated into windowed operators. >>> Thus, we will have windowed CEP, windowed sort, windowed over aggregate >>> in >>> the future to make it possible to build a windowed pipeline. >>> >>> But I think we can elaborate the integration more in the future if users >>> need it. Actually, I don't fully understand the scenario of integrating >>> window TVF and time operators at this point. >>> For example, interval join an input stream and a window join result. I >>> don't see why it can't be expressed by nested window join and why users >>> have to use interval join here. >>> Maybe we can wait for more inputs from users when the window TVF is >>> released and we can elaborate it again. >>> >>> Best, >>> Jark >>> >>> On Sat, 10 Oct 2020 at 12:01, 刘 芃成 <[hidden email]> wrote: >>> >>> > Hi, Benchao, >>> > I think I got your point, actually, in current implementation >>> for >>> > group window aggregation, the value of time attributes(e.g. >>> > TUMBLE_ROWTIME/TUMBLE_PROCTIME) is calculated as (window_end – 1), so I >>> > think we can just use it directly if you need this. But I think this >>> time >>> > attributes is mainly suggested to use in case of cascaded window >>> operations. >>> > Regarding the example you provided, I think the semantics of the SQL in >>> > your example which doing interval join(e.g. with TUMBLE_ROWTIME) after >>> > window aggregation is not clear in the current implementation, and I >>> think >>> > that’s a strong reason why we need the new TVFs syntax. >>> > With the new syntax, users should understand which time column to >>> > use and how to generate it when doing interval join and etc. >>> > >>> > Best, >>> > Pengcheng >>> > >>> > 发件人: Benchao Li <[hidden email]> >>> > 日期: 2020年10月10日 星期六 上午11:02 >>> > 收件人: pengcheng Liu <[hidden email]> >>> > 抄送: dev <[hidden email]> >>> > 主题: Re: [DISCUSS] FLIP-145: Support SQL windowing table-valued function >>> > >>> > Hi pengcheng, >>> > >>> > Thanks for your response. >>> > I knew that the original time attribute column will be retained after >>> the >>> > TVF, >>> > what I'm questioning is how do we get the time attribute column after >>> > Aggregation. >>> > Your answer did not remove my doubts about this. >>> > >>> > It's ok if we did not plan to integrate new TVF aggregate with old >>> "time >>> > attribute scenarios" >>> > listed in my previous email in this FLIP. However it's good to >>> elaborate >>> > leave it to the future plan. >>> > >>> > pengcheng Liu <[hidden email]<mailto: >>> > [hidden email]>> 于2020年10月10日周六 上午10:45写道: >>> > Hi,Benchao, >>> > In TVFs, the time attributes is just passed through from parent >>> rels, >>> > and the TVFs just add two >>> > additional window attributes(i.e. window_start & window_end). >>> Also, I >>> > think the time columns can be not only a time attribute >>> > with type of `TimeIndicatorType` but also a regular column with >>> type >>> > of `Timestamp`. >>> > >>> > For cascaded window operations, we can use window_start/window_end >>> of >>> > the previous window result directly to >>> > indicate operating on the same window, or use new DESCRIPTOR >>> column >>> > to assign new windows, in case of the change of >>> > the time column(e.g. in some case, the original timestamp is >>> > inaccurate and need some conversion to be used). >>> > >>> > You can check the definition or signature of these TVFs in the >>> FLIP. >>> > e.g. >>> > SELECT * FROM TABLE( >>> > TUMBLE(TABLE Bid, DESCRIPTOR(bidtime), INTERVAL '10' MINUTES)) >>> > In the example, the `bidtime` is the time attribute column, which >>> is >>> > the first operand of the DESCRIPTOR function. >>> > >>> > +1 start voting. >>> > >>> > Benchao Li <[hidden email]<mailto:[hidden email]>> >>> > 于2020年10月10日周六 上午10:08写道: >>> > Hi Jark, >>> > >>> > 2 & 3 sounds good to me. >>> > >>> > Regarding time attribute, >>> > I still have some questions, I knew it's easy to support cascaded >>> window >>> > aggregate using new TVFs. >>> > However there are some other places where need time attribute: >>> > - CEP >>> > - interval join >>> > - order by >>> > - over window >>> > If there is no time attribute column, how do we integrate these old >>> > features with the new TVFs. >>> > E.g. >>> > StreamA -> new window aggregate -> interval join -> Sink >>> > / >>> > StreamB ----------------------------------- >>> > >>> > >>> > Jark Wu <[hidden email]<mailto:[hidden email]>> 于2020年10月9日周五 >>> > 下午11:51写道: >>> > Hi Benchao, >>> > >>> > 1) time attribute >>> > Yes. We don't need time attribute auxiliary function. Because the new >>> > window operations are all based on the >>> > window_start and window_end columns instead of on the time >>> attributes. So >>> > we don't need to propagate time attributes. >>> > Cascaded window aggregate can be expressed by simply GROUP BY the >>> > window_start and window_end of the previous window result. >>> > I have added a cascaded window aggregate example in the Tumbling Window >>> > section in the FLIP. >>> > If you want to define proctime window aggregate, the time column in TVF >>> > should be a proctime attribute field (or PROCTIME() function). >>> > >>> > 2) batch support >>> > Yes. The proposed syntax/API are unified for batch and streaming. Batch >>> > support is in the plan, but may not have enough time to catch up 1.12. >>> > >>> > 3) support `grouping sets` >>> > This is not included in the FLIP, but I think it's great if we can >>> support >>> > `grouping sets`. >>> > The existing window impl doesn't support this because we convert the >>> > LogicalAggregate into WindowAggregate in the beginning, >>> > the expand grouping sets rule can't be applied in this situation. >>> > Fortunately, with the new window impl, the conversion to >>> WindowAggregate >>> > will happen at the end, so I think the expand rule can be >>> > applied and support this feature naturally. >>> > Therefore, IMO, we don't need to include this feature in this FLIP to >>> avoid >>> > the FLIP being too large. >>> > This can be a follow-up issue (maybe just add tests and docs) after the >>> > FLIP. >>> > >>> > Best, >>> > Jark >>> > >>> > >>> > On Fri, 9 Oct 2020 at 19:09, 刘 芃成 <[hidden email]<mailto: >>> > [hidden email]>> wrote: >>> > >>> > > Hi,Benchao, >>> > > Welcome to join the discussion, yes, this new syntax can >>> make SQL >>> > > more clear and simpler. >>> > > For your first question, the `window_start` and `window_end` >>> > > columns will be added automatically, >>> > > so we don't need to use auxiliary group functions to infer or >>> > > access the window properties. >>> > > >>> > > For the `grouping sets` on TVFs, I think it's interesting if >>> we >>> > > can support it, as we already supported `grouping sets` >>> > > on streaming aggregates in blink planner. But I'm not sure >>> if it >>> > > will be included into this FLIP. >>> > > >>> > > cc @Jark Wu >>> > > >>> > > Best, >>> > > Pengcheng >>> > > >>> > > >>> > > 在 2020/10/9 下午5:25,“Benchao Li”<[hidden email]<mailto: >>> > [hidden email]>> 写入: >>> > > >>> > > Thanks Jark for bringing this discussion, I like this FLIP very >>> much. >>> > > >>> > > Especially the cumulate window, it's much like the current TUMBLE >>> > > window + >>> > > Fast Emit (which is an undocumented experimental feature), >>> however, >>> > > it's >>> > > more powerful. >>> > > >>> > > And This will make the SQL semantic more standard, especially >>> for the >>> > > HOPPING window. >>> > > >>> > > Regarding time attribute, >>> > > It seems that we don't need a specific function to infer the time >>> > > attribute >>> > > like >>> > > `TUMBLE_ROWTIME` / `TUMBLE_PROCTIME`. Then are `window_start` and >>> > > `window_end` >>> > > column a time attribute column automatically? >>> > > - If not, what will be the time attribute of the result relation >>> of >>> > > these >>> > > TVFs? >>> > > Especially after the window aggregation. >>> > > - If yes, then how do we handle proctime? >>> > > >>> > > Regarding batch operators, >>> > > It's great to hear that we can reuse the batch operators in >>> > continuous >>> > > batch mode >>> > > as you mentioned in the FLIP. >>> > > Current window aggregate could also be used in batch mode with >>> > > rowtime. Do >>> > > you plan >>> > > to support these TVFs for batch mode in this FLIP? Hence the >>> > Table/SQL >>> > > is a >>> > > unified >>> > > API, it's great if we can keep the features complete both in >>> > streaming >>> > > and >>> > > batch mode. >>> > > >>> > > There is one more question, I don't know whether it should be >>> > > considered in >>> > > this FLIP. >>> > > Does the new window support `grouping sets`? (It's not supported >>> in >>> > old >>> > > window impl). >>> > > >>> > > Jark Wu <[hidden email]<mailto:[hidden email]>> >>> 于2020年10月9日周五 >>> > 下午4:14写道: >>> > > >>> > > > Hi all, >>> > > > >>> > > > I know we have a lot of discussion and development on going >>> right >>> > > now but >>> > > > it would be great if we can get FLIP-145 into a votable state. >>> > > > If there are no objections, I would like to start voting in the >>> > next >>> > > days. >>> > > > >>> > > > Best, >>> > > > Jark >>> > > > >>> > > > On Thu, 1 Oct 2020 at 14:29, Jark Wu <[hidden email]<mailto: >>> > [hidden email]>> wrote: >>> > > > >>> > > > > Hi everyone, >>> > > > > >>> > > > > I have added a section for Performance Optimization to >>> describe >>> > > how to >>> > > > > improve the performance in the short-term and long-term >>> > > > > and sketch the future performance potential under the new >>> window >>> > > API. >>> > > > > Introducing the window API is just the first step, we will >>> > > > > continuously improve the performance to make it powerful and >>> > > useful. >>> > > > > >>> > > > > Best, >>> > > > > Jark >>> > > > > >>> > > > > On Thu, 1 Oct 2020 at 14:28, Jark Wu <[hidden email] >>> <mailto: >>> > [hidden email]>> wrote: >>> > > > > >>> > > > >> Hi Pengcheng, >>> > > > >> >>> > > > >> Yes, the window TVF is part of the FLIP. Welcome to >>> contribute >>> > > and join >>> > > > >> the discussion. >>> > > > >> Regarding the SESSION window aggregation, users can use the >>> > > existing >>> > > > >> grouped session window function. >>> > > > >> >>> > > > >> Best, >>> > > > >> Jark >>> > > > >> >>> > > > >> On Sun, 27 Sep 2020 at 21:24, liupengcheng < >>> > > [hidden email]<mailto:[hidden email]> >>> > > > > >>> > > > >> wrote: >>> > > > >> >>> > > > >>> Hi Jark, >>> > > > >>> Thanks for reply, yes, I think it's a good >>> feature, it >>> > > can >>> > > > >>> improve the NRT scenarios >>> > > > >>> as you mentioned in the FLIP. Also, I think it can >>> > > improve the >>> > > > >>> streaming SQL greatly, >>> > > > >>> it can support richer window operations in flink >>> SQL >>> > and >>> > > bring >>> > > > >>> great convenience to users. >>> > > > >>> (we are now only supported group window in flink). >>> > > > >>> >>> > > > >>> Regarding the SESSION window, I think it's >>> especially >>> > > useful >>> > > > for >>> > > > >>> user behavior analysis(e.g. >>> > > > >>> counting user visits on a news website or social >>> > > platform), but >>> > > > >>> I agree that we can keep it >>> > > > >>> out of the FLIP now to catch up 1.12. >>> > > > >>> >>> > > > >>> Recently, I've done some work on the stream planner >>> > with >>> > > the >>> > > > >>> TVFs, and I'm willing to contribute >>> > > > >>> to this part. Is it in the plan of this FLIP? >>> > > > >>> >>> > > > >>> Best, >>> > > > >>> PengchengLiu >>> > > > >>> >>> > > > >>> >>> > > > >>> 在 2020/9/26 下午11:09,“Jark Wu”<[hidden email]<mailto: >>> > [hidden email]>> 写入: >>> > > > >>> >>> > > > >>> Hi pengcheng, >>> > > > >>> >>> > > > >>> That's great to see you also have the need of window >>> join. >>> > > > >>> You are right, the windowing TVF is a powerful feature >>> > which >>> > > can >>> > > > >>> support >>> > > > >>> more operations in the future. >>> > > > >>> I think it as of the date time "partition" selection in >>> > > batch SQL >>> > > > >>> jobs, >>> > > > >>> with this new syntax, I think it is possible >>> > > > >>> to migrate traditional batch SQL jobs to Flink SQL by >>> > > changing a >>> > > > >>> few lines. >>> > > > >>> >>> > > > >>> Regarding the SESSION window, this is on purpose to >>> keep it >>> > > out of >>> > > > >>> the >>> > > > >>> FLIP, because we want to keep the >>> > > > >>> FLIP small to catch up 1.12 and SESSION TVF is rarely >>> > useful >>> > > (e.g. >>> > > > >>> session >>> > > > >>> window join?). >>> > > > >>> >>> > > > >>> Best, >>> > > > >>> Jark >>> > > > >>> >>> > > > >>> On Fri, 25 Sep 2020 at 22:59, liupengcheng < >>> > > > >>> [hidden email]<mailto: >>> [hidden email] >>> > >> >>> > > > >>> wrote: >>> > > > >>> >>> > > > >>> > Hi, Jark, >>> > > > >>> > I'm very interested in this feature, and I'm >>> also >>> > > working >>> > > > >>> on this >>> > > > >>> > recently. >>> > > > >>> > I just have a glance at the FLIP, it's good, >>> but >>> > I >>> > > found >>> > > > >>> that >>> > > > >>> > there is no plan to add SESSION windows. >>> > > > >>> > Also, I think there can be more things we >>> can do >>> > > based on >>> > > > >>> this new >>> > > > >>> > syntax. For example, >>> > > > >>> > - window sort support >>> > > > >>> > - window union/intersect/minus support >>> > > > >>> > - Improve dimension table join >>> > > > >>> > We can have more deep discussion on this new >>> > > feature >>> > > > later >>> > > > >>> . >>> > > > >>> > I've also opened an jira that is related to >>> this >>> > > feature >>> > > > >>> recently: >>> > > > >>> > https://issues.apache.org/jira/browse/FLINK-18830 >>> > > > >>> > >>> > > > >>> > Best! >>> > > > >>> > PengchengLiu >>> > > > >>> > >>> > > > >>> > 在 2020/9/25 下午10:30,“Jark Wu”<[hidden email] >>> <mailto: >>> > [hidden email]>> 写入: >>> > > > >>> > >>> > > > >>> > Hi everyone, >>> > > > >>> > >>> > > > >>> > I want to start a FLIP about supporting windowing >>> > > > table-valued >>> > > > >>> > functions >>> > > > >>> > (TVF). >>> > > > >>> > The main purpose of this FLIP is to improve the >>> near >>> > > > real-time >>> > > > >>> (NRT) >>> > > > >>> > experience of Flink. >>> > > > >>> > >>> > > > >>> > FLIP-145: >>> > > > >>> > >>> > > > >>> > >>> > > > >>> >>> > > > >>> > > >>> > >>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-145%3A+Support+SQL+windowing+table-valued+function >>> > > > >>> > >>> > > > >>> > We want to introduce TUMBLE, HOP, CUMULATE >>> windowing >>> > > TVFs, >>> > > > the >>> > > > >>> > CUMULATE is >>> > > > >>> > a new kind of window. >>> > > > >>> > With the windowing TVFs, we can support richer >>> > > operations on >>> > > > >>> windows, >>> > > > >>> > including window join, window TopN and so on. >>> > > > >>> > This makes things simple: we only need to assign >>> > > windows at >>> > > > the >>> > > > >>> > beginning >>> > > > >>> > of the query, and then apply operations after >>> that >>> > like >>> > > > >>> traditional >>> > > > >>> > batch >>> > > > >>> > SQL. >>> > > > >>> > We hope it can help to reduce the learning curve >>> of >>> > > windows, >>> > > > >>> improve >>> > > > >>> > NRT >>> > > > >>> > for Flink, and attract more batch users. >>> > > > >>> > >>> > > > >>> > A simple code snippet for 10 minutes tumbling >>> window >>> > > > aggregate: >>> > > > >>> > >>> > > > >>> > SELECT window_start, window_end, SUM(price) >>> > > > >>> > FROM TABLE( >>> > > > >>> > TUMBLE(TABLE Bid, DESCRIPTOR(bidtime), >>> INTERVAL >>> > > '10' >>> > > > >>> MINUTES)) >>> > > > >>> > GROUP BY window_start, window_end; >>> > > > >>> > >>> > > > >>> > I'm looking forward to your feedback. >>> > > > >>> > >>> > > > >>> > Best, >>> > > > >>> > Jark >>> > > > >>> > >>> > > > >>> > >>> > > > >>> > >>> > > > >>> >>> > > > >>> >>> > > > >>> >>> > > > >>> > > >>> > > >>> > > -- >>> > > >>> > > Best, >>> > > Benchao Li >>> > > >>> > >>> > >>> > -- >>> > >>> > Best, >>> > Benchao Li >>> > >>> > >>> > -- >>> > >>> > Best, >>> > Benchao Li >>> > >>> >> |
Free forum by Nabble | Edit this page |