Hi Rong Rong,
Sorry for the late reply, And thanks for your feedback! We will continue to add more convenience features to the TableAPI, such as map, flatmap, agg, flatagg, iteration etc. And I am very happy that you are interested on this proposal. Due to this is a long-term continuous work, we will push it in stages. Currently Xiaowei has started a threading outlining which talk about what we are proposing. Please see the detail in the mail thread: https://mail.google.com/mail/u/0/#search/xiaowei/FMfcgxvzLWzfvCnmvMzzSfxHTSfdwLkB <https://mail.google.com/mail/u/0/#search/xiaowei/FMfcgxvzLWzfvCnmvMzzSfxHTSfdwLkB> . The Table API Enhancement Outline as follows: https://docs.google.com/document/d/1tnpxg31EQz2-MEzSotwFzqatsB4rNLz0I-l_vPa5H4Q/edit?usp=sharing Please let we know if you have further thoughts or feedback! Thanks, Jincheng Fabian Hueske <[hidden email]> 于2018年11月5日周一 下午7:03写道: > Hi Jincheng, > > Thanks for this interesting proposal. > I like that we can push this effort forward in a very fine-grained manner, > i.e., incrementally adding more APIs to the Table API. > > However, I also have a few questions / concerns. > Today, the Table API is tightly integrated with the DataSet and DataStream > APIs. It is very easy to convert a Table into a DataSet or DataStream and > vice versa. This mean it is already easy to combine custom logic an > relational operations. What I like is that several aspects are clearly > separated like retraction and timestamp handling (see below) + all > libraries on DataStream/DataSet can be easily combined with relational > operations. > I can see that adding more functionality to the Table API would remove the > distinction between DataSet and DataStream. However, wouldn't we get a > similar benefit by extending the DataStream API for proper support for > bounded streams (as is the long-term goal of Flink)? > I'm also a bit skeptical about the optimization opportunities we would > gain. Map/FlatMap UDFs are black boxes that cannot be easily removed > without additional information (I did some research on this a few years ago > [1]). > > Moreover, I think there are a few tricky details that need to be resolved > to enable a good integration. > > 1) How to deal with retraction messages? The DataStream API does not have a > notion of retractions. How would a MapFunction or FlatMapFunction handle > retraction? Do they need to be aware of the change flag? Custom windowing > and aggregation logic would certainly need to have that information. > 2) How to deal with timestamps? The DataStream API does not give access to > timestamps. In the Table API / SQL these are exposed as regular attributes. > How can we ensure that timestamp attributes remain valid (i.e. aligned with > watermarks) if the output is produced by arbitrary code? > There might be more issues of this kind. > > My main question would be how much would we gain with this proposal over a > tight integration of Table API and DataStream API, assuming that batch > functionality is moved to DataStream? > > Best, Fabian > > [1] http://stratosphere.eu/assets/papers/openingTheBlackBoxes_12.pdf > > > Am Mo., 5. Nov. 2018 um 02:49 Uhr schrieb Rong Rong <[hidden email]>: > > > Hi Jincheng, > > > > Thank you for the proposal! I think being able to define a process / > > co-process function in table API definitely opens up a whole new level of > > applications using a unified API. > > > > In addition, as Tzu-Li and Hequn have mentioned, the benefit of > > optimization layer of Table API will already bring in additional benefit > > over directly programming on top of DataStream/DataSet API. I am very > > interested an looking forward to seeing the support for more complex use > > cases, especially iterations. It will enable table API to define much > > broader, event-driven use cases such as real-time ML prediction/training. > > > > As Timo mentioned, This will make Table API diverge from the SQL API. But > > as from my experience Table API was always giving me the impression to > be a > > more sophisticated, syntactic-aware way to express relational operations. > > Looking forward to further discussion and collaborations on the FLIP doc. > > > > -- > > Rong > > > > On Sun, Nov 4, 2018 at 5:22 PM jincheng sun <[hidden email]> > > wrote: > > > > > Hi tison, > > > > > > Thanks a lot for your feedback! > > > I am very happy to see that community contributors agree to enhanced > the > > > TableAPI. This work is a long-term continuous work, we will push it in > > > stages, we will soon complete the enhanced list of the first phase, we > > can > > > go deep discussion in google doc. thanks again for joining on the very > > > important discussion of the Flink Table API. > > > > > > Thanks, > > > Jincheng > > > > > > Tzu-Li Chen <[hidden email]> 于2018年11月2日周五 下午1:49写道: > > > > > > > Hi jingchengm > > > > > > > > Thanks a lot for your proposal! I find it is a good start point for > > > > internal optimization works and help Flink to be more > > > > user-friendly. > > > > > > > > AFAIK, DataStream is the most popular API currently that Flink > > > > users should describe their logic with detailed logic. > > > > From a more internal view the conversion from DataStream to > > > > JobGraph is quite mechanically and hard to be optimized. So when > > > > users program with DataStream, they have to learn more internals > > > > and spend a lot of time to tune for performance. > > > > With your proposal, we provide enhanced functionality of Table API, > > > > so that users can describe their job easily on Table aspect. This > gives > > > > an opportunity to Flink developers to introduce an optimize phase > > > > while transforming user program(described by Table API) to internal > > > > representation. > > > > > > > > Given a user who want to start using Flink with simple ETL, > pipelining > > > > or analytics, he would find it is most naturally described by > SQL/Table > > > > API. Further, as mentioned by @hequn, > > > > > > > > SQL is a widely used language. It follows standards, is a > > > > > descriptive language, and is easy to use > > > > > > > > > > > > thus we could expect with the enhancement of SQL/Table API, Flink > > > > becomes more friendly to users. > > > > > > > > Looking forward to the design doc/FLIP! > > > > > > > > Best, > > > > tison. > > > > > > > > > > > > jincheng sun <[hidden email]> 于2018年11月2日周五 上午11:46写道: > > > > > > > > > Hi Hequn, > > > > > Thanks for your feedback! And also thanks for our offline > discussion! > > > > > You are right, unification of batch and streaming is very important > > for > > > > > flink API. > > > > > We will provide more detailed design later, Please let me know if > you > > > > have > > > > > further thoughts or feedback. > > > > > > > > > > Thanks, > > > > > Jincheng > > > > > > > > > > Hequn Cheng <[hidden email]> 于2018年11月2日周五 上午10:02写道: > > > > > > > > > > > Hi Jincheng, > > > > > > > > > > > > Thanks a lot for your proposal. It is very encouraging! > > > > > > > > > > > > As we all know, SQL is a widely used language. It follows > > standards, > > > > is a > > > > > > descriptive language, and is easy to use. A powerful feature of > SQL > > > is > > > > > that > > > > > > it supports optimization. Users only need to care about the logic > > of > > > > the > > > > > > program. The underlying optimizer will help users optimize the > > > > > performance > > > > > > of the program. However, in terms of functionality and ease of > use, > > > in > > > > > some > > > > > > scenarios sql will be limited, as described in Jincheng's > proposal. > > > > > > > > > > > > Correspondingly, the DataStream/DataSet api can provide powerful > > > > > > functionalities. Users can write > ProcessFunction/CoProcessFunction > > > and > > > > > get > > > > > > the timer. Compared with SQL, it provides more functionalities > and > > > > > > flexibilities. However, it does not support optimization like > SQL. > > > > > > Meanwhile, DataStream/DataSet api has not been unified which > means, > > > for > > > > > the > > > > > > same logic, users need to write a job for each stream and batch. > > > > > > > > > > > > With TableApi, I think we can combine the advantages of both. > Users > > > can > > > > > > easily write relational operations and enjoy optimization. At the > > > same > > > > > > time, it supports more functionality and ease of use. Looking > > forward > > > > to > > > > > > the detailed design/FLIP. > > > > > > > > > > > > Best, > > > > > > Hequn > > > > > > > > > > > > On Fri, Nov 2, 2018 at 9:48 AM Shaoxuan Wang < > [hidden email]> > > > > > wrote: > > > > > > > > > > > > > Hi Aljoscha, > > > > > > > Glad that you like the proposal. We have completed the > prototype > > of > > > > > most > > > > > > > new proposed functionalities. Once collect the feedback from > > > > community, > > > > > > we > > > > > > > will come up with a concrete FLIP/design doc. > > > > > > > > > > > > > > Regards, > > > > > > > Shaoxuan > > > > > > > > > > > > > > > > > > > > > On Thu, Nov 1, 2018 at 8:12 PM Aljoscha Krettek < > > > [hidden email] > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > Hi Jincheng, > > > > > > > > > > > > > > > > these points sound very good! Are there any concrete > proposals > > > for > > > > > > > > changes? For example a FLIP/design document? > > > > > > > > > > > > > > > > See here for FLIPs: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals > > > > > > > > > > > > > > > > Best, > > > > > > > > Aljoscha > > > > > > > > > > > > > > > > > On 1. Nov 2018, at 12:51, jincheng sun < > > > [hidden email] > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > *--------I am sorry for the formatting of the email > content. > > I > > > > > > reformat > > > > > > > > > the **content** as follows-----------* > > > > > > > > > > > > > > > > > > *Hi ALL,* > > > > > > > > > > > > > > > > > > With the continuous efforts from the community, the Flink > > > system > > > > > has > > > > > > > been > > > > > > > > > continuously improved, which has attracted more and more > > users. > > > > > Flink > > > > > > > SQL > > > > > > > > > is a canonical, widely used relational query language. > > However, > > > > > there > > > > > > > are > > > > > > > > > still some scenarios where Flink SQL failed to meet user > > needs > > > in > > > > > > terms > > > > > > > > of > > > > > > > > > functionality and ease of use, such as: > > > > > > > > > > > > > > > > > > *1. In terms of functionality* > > > > > > > > > Iteration, user-defined window, user-defined join, > > > > user-defined > > > > > > > > > GroupReduce, etc. Users cannot express them with SQL; > > > > > > > > > > > > > > > > > > *2. In terms of ease of use* > > > > > > > > > > > > > > > > > > - Map - e.g. “dataStream.map(mapFun)”. Although > > > > > > “table.select(udf1(), > > > > > > > > > udf2(), udf3()....)” can be used to accomplish the same > > > > > function., > > > > > > > > with a > > > > > > > > > map() function returning 100 columns, one has to define > or > > > call > > > > > 100 > > > > > > > > UDFs > > > > > > > > > when using SQL, which is quite involved. > > > > > > > > > - FlatMap - e.g. “dataStrem.flatmap(flatMapFun)”. > > Similarly, > > > > it > > > > > > can > > > > > > > be > > > > > > > > > implemented with “table.join(udtf).select()”. However, it > > is > > > > > > obvious > > > > > > > > that > > > > > > > > > dataStream is easier to use than SQL. > > > > > > > > > > > > > > > > > > Due to the above two reasons, some users have to use the > > > > DataStream > > > > > > API > > > > > > > > or > > > > > > > > > the DataSet API. But when they do that, they lose the > > > unification > > > > > of > > > > > > > > batch > > > > > > > > > and streaming. They will also lose the sophisticated > > > > optimizations > > > > > > such > > > > > > > > as > > > > > > > > > codegen, aggregate join transpose and multi-stage agg from > > > Flink > > > > > SQL. > > > > > > > > > > > > > > > > > > We believe that enhancing the functionality and > productivity > > is > > > > > vital > > > > > > > for > > > > > > > > > the successful adoption of Table API. To this end, Table > API > > > > still > > > > > > > > > requires more efforts from every contributor in the > > community. > > > We > > > > > see > > > > > > > > great > > > > > > > > > opportunity in improving our user’s experience from this > > work. > > > > Any > > > > > > > > feedback > > > > > > > > > is welcome. > > > > > > > > > > > > > > > > > > Regards, > > > > > > > > > > > > > > > > > > Jincheng > > > > > > > > > > > > > > > > > > jincheng sun <[hidden email]> 于2018年11月1日周四 > > > 下午5:07写道: > > > > > > > > > > > > > > > > > >> Hi all, > > > > > > > > >> > > > > > > > > >> With the continuous efforts from the community, the Flink > > > system > > > > > has > > > > > > > > been > > > > > > > > >> continuously improved, which has attracted more and more > > > users. > > > > > > Flink > > > > > > > > SQL > > > > > > > > >> is a canonical, widely used relational query language. > > > However, > > > > > > there > > > > > > > > are > > > > > > > > >> still some scenarios where Flink SQL failed to meet user > > needs > > > > in > > > > > > > terms > > > > > > > > of > > > > > > > > >> functionality and ease of use, such as: > > > > > > > > >> > > > > > > > > >> > > > > > > > > >> - > > > > > > > > >> > > > > > > > > >> In terms of functionality > > > > > > > > >> > > > > > > > > >> Iteration, user-defined window, user-defined join, > > > user-defined > > > > > > > > >> GroupReduce, etc. Users cannot express them with SQL; > > > > > > > > >> > > > > > > > > >> - > > > > > > > > >> > > > > > > > > >> In terms of ease of use > > > > > > > > >> - > > > > > > > > >> > > > > > > > > >> Map - e.g. “dataStream.map(mapFun)”. Although > > > > > > > “table.select(udf1(), > > > > > > > > >> udf2(), udf3()....)” can be used to accomplish the > same > > > > > > > function., > > > > > > > > with a > > > > > > > > >> map() function returning 100 columns, one has to > define > > > or > > > > > call > > > > > > > > 100 UDFs > > > > > > > > >> when using SQL, which is quite involved. > > > > > > > > >> - > > > > > > > > >> > > > > > > > > >> FlatMap - e.g. “dataStrem.flatmap(flatMapFun)”. > > > Similarly, > > > > > it > > > > > > > can > > > > > > > > >> be implemented with “table.join(udtf).select()”. > > However, > > > > it > > > > > is > > > > > > > > obvious > > > > > > > > >> that datastream is easier to use than SQL. > > > > > > > > >> > > > > > > > > >> > > > > > > > > >> Due to the above two reasons, some users have to use the > > > > > DataStream > > > > > > > API > > > > > > > > or > > > > > > > > >> the DataSet API. But when they do that, they lose the > > > > unification > > > > > of > > > > > > > > batch > > > > > > > > >> and streaming. They will also lose the sophisticated > > > > optimizations > > > > > > > such > > > > > > > > as > > > > > > > > >> codegen, aggregate join transpose and multi-stage agg > from > > > > Flink > > > > > > SQL. > > > > > > > > >> > > > > > > > > >> We believe that enhancing the functionality and > productivity > > > is > > > > > > vital > > > > > > > > for > > > > > > > > >> the successful adoption of Table API. To this end, Table > > API > > > > > still > > > > > > > > >> requires more efforts from every contributor in the > > community. > > > > We > > > > > > see > > > > > > > > great > > > > > > > > >> opportunity in improving our user’s experience from this > > work. > > > > Any > > > > > > > > feedback > > > > > > > > >> is welcome. > > > > > > > > >> > > > > > > > > >> Regards, > > > > > > > > >> > > > > > > > > >> Jincheng > > > > > > > > >> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > |
Hi jingcheng,
Thanks for your proposal. I think it is a helpful enhancement for TableAPI which is a solid step forward for TableAPI. It doesn't weaken SQL or DataStream, because the conversion between DataStream and Table still works. People with advanced cases (e.g. complex and fine-grained state control) can go with DataStream, but most general cases can stay in TableAPI. This works is aiming to extend the functionality for TableAPI, to extend the usage scenario, to help TableAPI becomes a more widely used API. For example, someone want to drop one column from a 100-columns Table. Currently, we have to convert Table to DataStream and use MapFunction to do that, or select the remaining 99 columns using Table.select API. But if we support Table.drop() method for TableAPI, it will be a very convenient method and let users stay in Table. Looking forward to the more detailed design and further discussion. Regards, Jark jincheng sun <[hidden email]> 于2018年11月6日周二 下午1:05写道: > Hi Rong Rong, > > Sorry for the late reply, And thanks for your feedback! We will continue > to add more convenience features to the TableAPI, such as map, flatmap, > agg, flatagg, iteration etc. And I am very happy that you are interested on > this proposal. Due to this is a long-term continuous work, we will push it > in stages. Currently Xiaowei has started a threading outlining which talk > about what we are proposing. Please see the detail in the mail thread: > > https://mail.google.com/mail/u/0/#search/xiaowei/FMfcgxvzLWzfvCnmvMzzSfxHTSfdwLkB > < > https://mail.google.com/mail/u/0/#search/xiaowei/FMfcgxvzLWzfvCnmvMzzSfxHTSfdwLkB > > > . > > The Table API Enhancement Outline as follows: > > https://docs.google.com/document/d/1tnpxg31EQz2-MEzSotwFzqatsB4rNLz0I-l_vPa5H4Q/edit?usp=sharing > > Please let we know if you have further thoughts or feedback! > > Thanks, > Jincheng > > Fabian Hueske <[hidden email]> 于2018年11月5日周一 下午7:03写道: > > > Hi Jincheng, > > > > Thanks for this interesting proposal. > > I like that we can push this effort forward in a very fine-grained > manner, > > i.e., incrementally adding more APIs to the Table API. > > > > However, I also have a few questions / concerns. > > Today, the Table API is tightly integrated with the DataSet and > DataStream > > APIs. It is very easy to convert a Table into a DataSet or DataStream and > > vice versa. This mean it is already easy to combine custom logic an > > relational operations. What I like is that several aspects are clearly > > separated like retraction and timestamp handling (see below) + all > > libraries on DataStream/DataSet can be easily combined with relational > > operations. > > I can see that adding more functionality to the Table API would remove > the > > distinction between DataSet and DataStream. However, wouldn't we get a > > similar benefit by extending the DataStream API for proper support for > > bounded streams (as is the long-term goal of Flink)? > > I'm also a bit skeptical about the optimization opportunities we would > > gain. Map/FlatMap UDFs are black boxes that cannot be easily removed > > without additional information (I did some research on this a few years > ago > > [1]). > > > > Moreover, I think there are a few tricky details that need to be resolved > > to enable a good integration. > > > > 1) How to deal with retraction messages? The DataStream API does not > have a > > notion of retractions. How would a MapFunction or FlatMapFunction handle > > retraction? Do they need to be aware of the change flag? Custom windowing > > and aggregation logic would certainly need to have that information. > > 2) How to deal with timestamps? The DataStream API does not give access > to > > timestamps. In the Table API / SQL these are exposed as regular > attributes. > > How can we ensure that timestamp attributes remain valid (i.e. aligned > with > > watermarks) if the output is produced by arbitrary code? > > There might be more issues of this kind. > > > > My main question would be how much would we gain with this proposal over > a > > tight integration of Table API and DataStream API, assuming that batch > > functionality is moved to DataStream? > > > > Best, Fabian > > > > [1] http://stratosphere.eu/assets/papers/openingTheBlackBoxes_12.pdf > > > > > > Am Mo., 5. Nov. 2018 um 02:49 Uhr schrieb Rong Rong <[hidden email] > >: > > > > > Hi Jincheng, > > > > > > Thank you for the proposal! I think being able to define a process / > > > co-process function in table API definitely opens up a whole new level > of > > > applications using a unified API. > > > > > > In addition, as Tzu-Li and Hequn have mentioned, the benefit of > > > optimization layer of Table API will already bring in additional > benefit > > > over directly programming on top of DataStream/DataSet API. I am very > > > interested an looking forward to seeing the support for more complex > use > > > cases, especially iterations. It will enable table API to define much > > > broader, event-driven use cases such as real-time ML > prediction/training. > > > > > > As Timo mentioned, This will make Table API diverge from the SQL API. > But > > > as from my experience Table API was always giving me the impression to > > be a > > > more sophisticated, syntactic-aware way to express relational > operations. > > > Looking forward to further discussion and collaborations on the FLIP > doc. > > > > > > -- > > > Rong > > > > > > On Sun, Nov 4, 2018 at 5:22 PM jincheng sun <[hidden email]> > > > wrote: > > > > > > > Hi tison, > > > > > > > > Thanks a lot for your feedback! > > > > I am very happy to see that community contributors agree to enhanced > > the > > > > TableAPI. This work is a long-term continuous work, we will push it > in > > > > stages, we will soon complete the enhanced list of the first phase, > we > > > can > > > > go deep discussion in google doc. thanks again for joining on the > very > > > > important discussion of the Flink Table API. > > > > > > > > Thanks, > > > > Jincheng > > > > > > > > Tzu-Li Chen <[hidden email]> 于2018年11月2日周五 下午1:49写道: > > > > > > > > > Hi jingchengm > > > > > > > > > > Thanks a lot for your proposal! I find it is a good start point for > > > > > internal optimization works and help Flink to be more > > > > > user-friendly. > > > > > > > > > > AFAIK, DataStream is the most popular API currently that Flink > > > > > users should describe their logic with detailed logic. > > > > > From a more internal view the conversion from DataStream to > > > > > JobGraph is quite mechanically and hard to be optimized. So when > > > > > users program with DataStream, they have to learn more internals > > > > > and spend a lot of time to tune for performance. > > > > > With your proposal, we provide enhanced functionality of Table API, > > > > > so that users can describe their job easily on Table aspect. This > > gives > > > > > an opportunity to Flink developers to introduce an optimize phase > > > > > while transforming user program(described by Table API) to internal > > > > > representation. > > > > > > > > > > Given a user who want to start using Flink with simple ETL, > > pipelining > > > > > or analytics, he would find it is most naturally described by > > SQL/Table > > > > > API. Further, as mentioned by @hequn, > > > > > > > > > > SQL is a widely used language. It follows standards, is a > > > > > > descriptive language, and is easy to use > > > > > > > > > > > > > > > thus we could expect with the enhancement of SQL/Table API, Flink > > > > > becomes more friendly to users. > > > > > > > > > > Looking forward to the design doc/FLIP! > > > > > > > > > > Best, > > > > > tison. > > > > > > > > > > > > > > > jincheng sun <[hidden email]> 于2018年11月2日周五 上午11:46写道: > > > > > > > > > > > Hi Hequn, > > > > > > Thanks for your feedback! And also thanks for our offline > > discussion! > > > > > > You are right, unification of batch and streaming is very > important > > > for > > > > > > flink API. > > > > > > We will provide more detailed design later, Please let me know if > > you > > > > > have > > > > > > further thoughts or feedback. > > > > > > > > > > > > Thanks, > > > > > > Jincheng > > > > > > > > > > > > Hequn Cheng <[hidden email]> 于2018年11月2日周五 上午10:02写道: > > > > > > > > > > > > > Hi Jincheng, > > > > > > > > > > > > > > Thanks a lot for your proposal. It is very encouraging! > > > > > > > > > > > > > > As we all know, SQL is a widely used language. It follows > > > standards, > > > > > is a > > > > > > > descriptive language, and is easy to use. A powerful feature of > > SQL > > > > is > > > > > > that > > > > > > > it supports optimization. Users only need to care about the > logic > > > of > > > > > the > > > > > > > program. The underlying optimizer will help users optimize the > > > > > > performance > > > > > > > of the program. However, in terms of functionality and ease of > > use, > > > > in > > > > > > some > > > > > > > scenarios sql will be limited, as described in Jincheng's > > proposal. > > > > > > > > > > > > > > Correspondingly, the DataStream/DataSet api can provide > powerful > > > > > > > functionalities. Users can write > > ProcessFunction/CoProcessFunction > > > > and > > > > > > get > > > > > > > the timer. Compared with SQL, it provides more functionalities > > and > > > > > > > flexibilities. However, it does not support optimization like > > SQL. > > > > > > > Meanwhile, DataStream/DataSet api has not been unified which > > means, > > > > for > > > > > > the > > > > > > > same logic, users need to write a job for each stream and > batch. > > > > > > > > > > > > > > With TableApi, I think we can combine the advantages of both. > > Users > > > > can > > > > > > > easily write relational operations and enjoy optimization. At > the > > > > same > > > > > > > time, it supports more functionality and ease of use. Looking > > > forward > > > > > to > > > > > > > the detailed design/FLIP. > > > > > > > > > > > > > > Best, > > > > > > > Hequn > > > > > > > > > > > > > > On Fri, Nov 2, 2018 at 9:48 AM Shaoxuan Wang < > > [hidden email]> > > > > > > wrote: > > > > > > > > > > > > > > > Hi Aljoscha, > > > > > > > > Glad that you like the proposal. We have completed the > > prototype > > > of > > > > > > most > > > > > > > > new proposed functionalities. Once collect the feedback from > > > > > community, > > > > > > > we > > > > > > > > will come up with a concrete FLIP/design doc. > > > > > > > > > > > > > > > > Regards, > > > > > > > > Shaoxuan > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Nov 1, 2018 at 8:12 PM Aljoscha Krettek < > > > > [hidden email] > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > Hi Jincheng, > > > > > > > > > > > > > > > > > > these points sound very good! Are there any concrete > > proposals > > > > for > > > > > > > > > changes? For example a FLIP/design document? > > > > > > > > > > > > > > > > > > See here for FLIPs: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals > > > > > > > > > > > > > > > > > > Best, > > > > > > > > > Aljoscha > > > > > > > > > > > > > > > > > > > On 1. Nov 2018, at 12:51, jincheng sun < > > > > [hidden email] > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > *--------I am sorry for the formatting of the email > > content. > > > I > > > > > > > reformat > > > > > > > > > > the **content** as follows-----------* > > > > > > > > > > > > > > > > > > > > *Hi ALL,* > > > > > > > > > > > > > > > > > > > > With the continuous efforts from the community, the Flink > > > > system > > > > > > has > > > > > > > > been > > > > > > > > > > continuously improved, which has attracted more and more > > > users. > > > > > > Flink > > > > > > > > SQL > > > > > > > > > > is a canonical, widely used relational query language. > > > However, > > > > > > there > > > > > > > > are > > > > > > > > > > still some scenarios where Flink SQL failed to meet user > > > needs > > > > in > > > > > > > terms > > > > > > > > > of > > > > > > > > > > functionality and ease of use, such as: > > > > > > > > > > > > > > > > > > > > *1. In terms of functionality* > > > > > > > > > > Iteration, user-defined window, user-defined join, > > > > > user-defined > > > > > > > > > > GroupReduce, etc. Users cannot express them with SQL; > > > > > > > > > > > > > > > > > > > > *2. In terms of ease of use* > > > > > > > > > > > > > > > > > > > > - Map - e.g. “dataStream.map(mapFun)”. Although > > > > > > > “table.select(udf1(), > > > > > > > > > > udf2(), udf3()....)” can be used to accomplish the same > > > > > > function., > > > > > > > > > with a > > > > > > > > > > map() function returning 100 columns, one has to define > > or > > > > call > > > > > > 100 > > > > > > > > > UDFs > > > > > > > > > > when using SQL, which is quite involved. > > > > > > > > > > - FlatMap - e.g. “dataStrem.flatmap(flatMapFun)”. > > > Similarly, > > > > > it > > > > > > > can > > > > > > > > be > > > > > > > > > > implemented with “table.join(udtf).select()”. However, > it > > > is > > > > > > > obvious > > > > > > > > > that > > > > > > > > > > dataStream is easier to use than SQL. > > > > > > > > > > > > > > > > > > > > Due to the above two reasons, some users have to use the > > > > > DataStream > > > > > > > API > > > > > > > > > or > > > > > > > > > > the DataSet API. But when they do that, they lose the > > > > unification > > > > > > of > > > > > > > > > batch > > > > > > > > > > and streaming. They will also lose the sophisticated > > > > > optimizations > > > > > > > such > > > > > > > > > as > > > > > > > > > > codegen, aggregate join transpose and multi-stage agg > from > > > > Flink > > > > > > SQL. > > > > > > > > > > > > > > > > > > > > We believe that enhancing the functionality and > > productivity > > > is > > > > > > vital > > > > > > > > for > > > > > > > > > > the successful adoption of Table API. To this end, Table > > API > > > > > still > > > > > > > > > > requires more efforts from every contributor in the > > > community. > > > > We > > > > > > see > > > > > > > > > great > > > > > > > > > > opportunity in improving our user’s experience from this > > > work. > > > > > Any > > > > > > > > > feedback > > > > > > > > > > is welcome. > > > > > > > > > > > > > > > > > > > > Regards, > > > > > > > > > > > > > > > > > > > > Jincheng > > > > > > > > > > > > > > > > > > > > jincheng sun <[hidden email]> 于2018年11月1日周四 > > > > 下午5:07写道: > > > > > > > > > > > > > > > > > > > >> Hi all, > > > > > > > > > >> > > > > > > > > > >> With the continuous efforts from the community, the > Flink > > > > system > > > > > > has > > > > > > > > > been > > > > > > > > > >> continuously improved, which has attracted more and more > > > > users. > > > > > > > Flink > > > > > > > > > SQL > > > > > > > > > >> is a canonical, widely used relational query language. > > > > However, > > > > > > > there > > > > > > > > > are > > > > > > > > > >> still some scenarios where Flink SQL failed to meet user > > > needs > > > > > in > > > > > > > > terms > > > > > > > > > of > > > > > > > > > >> functionality and ease of use, such as: > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> - > > > > > > > > > >> > > > > > > > > > >> In terms of functionality > > > > > > > > > >> > > > > > > > > > >> Iteration, user-defined window, user-defined join, > > > > user-defined > > > > > > > > > >> GroupReduce, etc. Users cannot express them with SQL; > > > > > > > > > >> > > > > > > > > > >> - > > > > > > > > > >> > > > > > > > > > >> In terms of ease of use > > > > > > > > > >> - > > > > > > > > > >> > > > > > > > > > >> Map - e.g. “dataStream.map(mapFun)”. Although > > > > > > > > “table.select(udf1(), > > > > > > > > > >> udf2(), udf3()....)” can be used to accomplish the > > same > > > > > > > > function., > > > > > > > > > with a > > > > > > > > > >> map() function returning 100 columns, one has to > > define > > > > or > > > > > > call > > > > > > > > > 100 UDFs > > > > > > > > > >> when using SQL, which is quite involved. > > > > > > > > > >> - > > > > > > > > > >> > > > > > > > > > >> FlatMap - e.g. “dataStrem.flatmap(flatMapFun)”. > > > > Similarly, > > > > > > it > > > > > > > > can > > > > > > > > > >> be implemented with “table.join(udtf).select()”. > > > However, > > > > > it > > > > > > is > > > > > > > > > obvious > > > > > > > > > >> that datastream is easier to use than SQL. > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> Due to the above two reasons, some users have to use the > > > > > > DataStream > > > > > > > > API > > > > > > > > > or > > > > > > > > > >> the DataSet API. But when they do that, they lose the > > > > > unification > > > > > > of > > > > > > > > > batch > > > > > > > > > >> and streaming. They will also lose the sophisticated > > > > > optimizations > > > > > > > > such > > > > > > > > > as > > > > > > > > > >> codegen, aggregate join transpose and multi-stage agg > > from > > > > > Flink > > > > > > > SQL. > > > > > > > > > >> > > > > > > > > > >> We believe that enhancing the functionality and > > productivity > > > > is > > > > > > > vital > > > > > > > > > for > > > > > > > > > >> the successful adoption of Table API. To this end, > Table > > > API > > > > > > still > > > > > > > > > >> requires more efforts from every contributor in the > > > community. > > > > > We > > > > > > > see > > > > > > > > > great > > > > > > > > > >> opportunity in improving our user’s experience from this > > > work. > > > > > Any > > > > > > > > > feedback > > > > > > > > > >> is welcome. > > > > > > > > > >> > > > > > > > > > >> Regards, > > > > > > > > > >> > > > > > > > > > >> Jincheng > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > |
Hi Jark,
Glad to see your feedback! That's Correct, The proposal is aiming to extend the functionality for Table API! I like add "drop" to fit the use case you mentioned. Not only that, if a 100-columns Table. and our UDF needs these 100 columns, we don't want to define the eval as eval(column0...column99), we prefer to define eval as eval(Row)。Using it like this: table.select(udf (*)). All we also need to consider if we put the columns package as a row. In a scenario like this, we have Classification it as cloumn operation, and list the changes to the column operation after the map/flatMap/agg/flatAgg phase is completed. And Currently, Xiaowei has started a threading outlining which talk about what we are proposing. Please see the detail in the mail thread: Please see the detail in the mail thread: https://mail.google.com/mail/u/0/#search/xiaowei/FMfcgxvzLWzfvCnmvMzzSfxHTSfdwLkB <https://mail.google.com/mail/u/0/#search/xiaowei/FMfcgxvzLWzfvCnmvMzzSfxHTSfdwLkB> . At this stage the Table API Enhancement Outline as follows: https://docs.google.com/document/d/1tnpxg31EQz2-MEzSotwFzqatsB4rNLz0I-l_vPa5H4Q/edit?usp=sharing Please let we know if you have further thoughts or feedback! Thanks, Jincheng Jark Wu <[hidden email]> 于2018年11月6日周二 下午3:35写道: > Hi jingcheng, > > Thanks for your proposal. I think it is a helpful enhancement for TableAPI > which is a solid step forward for TableAPI. > It doesn't weaken SQL or DataStream, because the conversion between > DataStream and Table still works. > People with advanced cases (e.g. complex and fine-grained state control) > can go with DataStream, > but most general cases can stay in TableAPI. This works is aiming to extend > the functionality for TableAPI, > to extend the usage scenario, to help TableAPI becomes a more widely used > API. > > For example, someone want to drop one column from a 100-columns Table. > Currently, we have to convert > Table to DataStream and use MapFunction to do that, or select the remaining > 99 columns using Table.select API. > But if we support Table.drop() method for TableAPI, it will be a very > convenient method and let users stay in Table. > > Looking forward to the more detailed design and further discussion. > > Regards, > Jark > > jincheng sun <[hidden email]> 于2018年11月6日周二 下午1:05写道: > > > Hi Rong Rong, > > > > Sorry for the late reply, And thanks for your feedback! We will continue > > to add more convenience features to the TableAPI, such as map, flatmap, > > agg, flatagg, iteration etc. And I am very happy that you are interested > on > > this proposal. Due to this is a long-term continuous work, we will push > it > > in stages. Currently Xiaowei has started a threading outlining which > talk > > about what we are proposing. Please see the detail in the mail thread: > > > > > https://mail.google.com/mail/u/0/#search/xiaowei/FMfcgxvzLWzfvCnmvMzzSfxHTSfdwLkB > > < > > > https://mail.google.com/mail/u/0/#search/xiaowei/FMfcgxvzLWzfvCnmvMzzSfxHTSfdwLkB > > > > > . > > > > The Table API Enhancement Outline as follows: > > > > > https://docs.google.com/document/d/1tnpxg31EQz2-MEzSotwFzqatsB4rNLz0I-l_vPa5H4Q/edit?usp=sharing > > > > Please let we know if you have further thoughts or feedback! > > > > Thanks, > > Jincheng > > > > Fabian Hueske <[hidden email]> 于2018年11月5日周一 下午7:03写道: > > > > > Hi Jincheng, > > > > > > Thanks for this interesting proposal. > > > I like that we can push this effort forward in a very fine-grained > > manner, > > > i.e., incrementally adding more APIs to the Table API. > > > > > > However, I also have a few questions / concerns. > > > Today, the Table API is tightly integrated with the DataSet and > > DataStream > > > APIs. It is very easy to convert a Table into a DataSet or DataStream > and > > > vice versa. This mean it is already easy to combine custom logic an > > > relational operations. What I like is that several aspects are clearly > > > separated like retraction and timestamp handling (see below) + all > > > libraries on DataStream/DataSet can be easily combined with relational > > > operations. > > > I can see that adding more functionality to the Table API would remove > > the > > > distinction between DataSet and DataStream. However, wouldn't we get a > > > similar benefit by extending the DataStream API for proper support for > > > bounded streams (as is the long-term goal of Flink)? > > > I'm also a bit skeptical about the optimization opportunities we would > > > gain. Map/FlatMap UDFs are black boxes that cannot be easily removed > > > without additional information (I did some research on this a few years > > ago > > > [1]). > > > > > > Moreover, I think there are a few tricky details that need to be > resolved > > > to enable a good integration. > > > > > > 1) How to deal with retraction messages? The DataStream API does not > > have a > > > notion of retractions. How would a MapFunction or FlatMapFunction > handle > > > retraction? Do they need to be aware of the change flag? Custom > windowing > > > and aggregation logic would certainly need to have that information. > > > 2) How to deal with timestamps? The DataStream API does not give access > > to > > > timestamps. In the Table API / SQL these are exposed as regular > > attributes. > > > How can we ensure that timestamp attributes remain valid (i.e. aligned > > with > > > watermarks) if the output is produced by arbitrary code? > > > There might be more issues of this kind. > > > > > > My main question would be how much would we gain with this proposal > over > > a > > > tight integration of Table API and DataStream API, assuming that batch > > > functionality is moved to DataStream? > > > > > > Best, Fabian > > > > > > [1] http://stratosphere.eu/assets/papers/openingTheBlackBoxes_12.pdf > > > > > > > > > Am Mo., 5. Nov. 2018 um 02:49 Uhr schrieb Rong Rong < > [hidden email] > > >: > > > > > > > Hi Jincheng, > > > > > > > > Thank you for the proposal! I think being able to define a process / > > > > co-process function in table API definitely opens up a whole new > level > > of > > > > applications using a unified API. > > > > > > > > In addition, as Tzu-Li and Hequn have mentioned, the benefit of > > > > optimization layer of Table API will already bring in additional > > benefit > > > > over directly programming on top of DataStream/DataSet API. I am very > > > > interested an looking forward to seeing the support for more complex > > use > > > > cases, especially iterations. It will enable table API to define much > > > > broader, event-driven use cases such as real-time ML > > prediction/training. > > > > > > > > As Timo mentioned, This will make Table API diverge from the SQL API. > > But > > > > as from my experience Table API was always giving me the impression > to > > > be a > > > > more sophisticated, syntactic-aware way to express relational > > operations. > > > > Looking forward to further discussion and collaborations on the FLIP > > doc. > > > > > > > > -- > > > > Rong > > > > > > > > On Sun, Nov 4, 2018 at 5:22 PM jincheng sun < > [hidden email]> > > > > wrote: > > > > > > > > > Hi tison, > > > > > > > > > > Thanks a lot for your feedback! > > > > > I am very happy to see that community contributors agree to > enhanced > > > the > > > > > TableAPI. This work is a long-term continuous work, we will push it > > in > > > > > stages, we will soon complete the enhanced list of the first > phase, > > we > > > > can > > > > > go deep discussion in google doc. thanks again for joining on the > > very > > > > > important discussion of the Flink Table API. > > > > > > > > > > Thanks, > > > > > Jincheng > > > > > > > > > > Tzu-Li Chen <[hidden email]> 于2018年11月2日周五 下午1:49写道: > > > > > > > > > > > Hi jingchengm > > > > > > > > > > > > Thanks a lot for your proposal! I find it is a good start point > for > > > > > > internal optimization works and help Flink to be more > > > > > > user-friendly. > > > > > > > > > > > > AFAIK, DataStream is the most popular API currently that Flink > > > > > > users should describe their logic with detailed logic. > > > > > > From a more internal view the conversion from DataStream to > > > > > > JobGraph is quite mechanically and hard to be optimized. So when > > > > > > users program with DataStream, they have to learn more internals > > > > > > and spend a lot of time to tune for performance. > > > > > > With your proposal, we provide enhanced functionality of Table > API, > > > > > > so that users can describe their job easily on Table aspect. This > > > gives > > > > > > an opportunity to Flink developers to introduce an optimize phase > > > > > > while transforming user program(described by Table API) to > internal > > > > > > representation. > > > > > > > > > > > > Given a user who want to start using Flink with simple ETL, > > > pipelining > > > > > > or analytics, he would find it is most naturally described by > > > SQL/Table > > > > > > API. Further, as mentioned by @hequn, > > > > > > > > > > > > SQL is a widely used language. It follows standards, is a > > > > > > > descriptive language, and is easy to use > > > > > > > > > > > > > > > > > > thus we could expect with the enhancement of SQL/Table API, Flink > > > > > > becomes more friendly to users. > > > > > > > > > > > > Looking forward to the design doc/FLIP! > > > > > > > > > > > > Best, > > > > > > tison. > > > > > > > > > > > > > > > > > > jincheng sun <[hidden email]> 于2018年11月2日周五 上午11:46写道: > > > > > > > > > > > > > Hi Hequn, > > > > > > > Thanks for your feedback! And also thanks for our offline > > > discussion! > > > > > > > You are right, unification of batch and streaming is very > > important > > > > for > > > > > > > flink API. > > > > > > > We will provide more detailed design later, Please let me know > if > > > you > > > > > > have > > > > > > > further thoughts or feedback. > > > > > > > > > > > > > > Thanks, > > > > > > > Jincheng > > > > > > > > > > > > > > Hequn Cheng <[hidden email]> 于2018年11月2日周五 上午10:02写道: > > > > > > > > > > > > > > > Hi Jincheng, > > > > > > > > > > > > > > > > Thanks a lot for your proposal. It is very encouraging! > > > > > > > > > > > > > > > > As we all know, SQL is a widely used language. It follows > > > > standards, > > > > > > is a > > > > > > > > descriptive language, and is easy to use. A powerful feature > of > > > SQL > > > > > is > > > > > > > that > > > > > > > > it supports optimization. Users only need to care about the > > logic > > > > of > > > > > > the > > > > > > > > program. The underlying optimizer will help users optimize > the > > > > > > > performance > > > > > > > > of the program. However, in terms of functionality and ease > of > > > use, > > > > > in > > > > > > > some > > > > > > > > scenarios sql will be limited, as described in Jincheng's > > > proposal. > > > > > > > > > > > > > > > > Correspondingly, the DataStream/DataSet api can provide > > powerful > > > > > > > > functionalities. Users can write > > > ProcessFunction/CoProcessFunction > > > > > and > > > > > > > get > > > > > > > > the timer. Compared with SQL, it provides more > functionalities > > > and > > > > > > > > flexibilities. However, it does not support optimization like > > > SQL. > > > > > > > > Meanwhile, DataStream/DataSet api has not been unified which > > > means, > > > > > for > > > > > > > the > > > > > > > > same logic, users need to write a job for each stream and > > batch. > > > > > > > > > > > > > > > > With TableApi, I think we can combine the advantages of both. > > > Users > > > > > can > > > > > > > > easily write relational operations and enjoy optimization. At > > the > > > > > same > > > > > > > > time, it supports more functionality and ease of use. Looking > > > > forward > > > > > > to > > > > > > > > the detailed design/FLIP. > > > > > > > > > > > > > > > > Best, > > > > > > > > Hequn > > > > > > > > > > > > > > > > On Fri, Nov 2, 2018 at 9:48 AM Shaoxuan Wang < > > > [hidden email]> > > > > > > > wrote: > > > > > > > > > > > > > > > > > Hi Aljoscha, > > > > > > > > > Glad that you like the proposal. We have completed the > > > prototype > > > > of > > > > > > > most > > > > > > > > > new proposed functionalities. Once collect the feedback > from > > > > > > community, > > > > > > > > we > > > > > > > > > will come up with a concrete FLIP/design doc. > > > > > > > > > > > > > > > > > > Regards, > > > > > > > > > Shaoxuan > > > > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Nov 1, 2018 at 8:12 PM Aljoscha Krettek < > > > > > [hidden email] > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > Hi Jincheng, > > > > > > > > > > > > > > > > > > > > these points sound very good! Are there any concrete > > > proposals > > > > > for > > > > > > > > > > changes? For example a FLIP/design document? > > > > > > > > > > > > > > > > > > > > See here for FLIPs: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals > > > > > > > > > > > > > > > > > > > > Best, > > > > > > > > > > Aljoscha > > > > > > > > > > > > > > > > > > > > > On 1. Nov 2018, at 12:51, jincheng sun < > > > > > [hidden email] > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > *--------I am sorry for the formatting of the email > > > content. > > > > I > > > > > > > > reformat > > > > > > > > > > > the **content** as follows-----------* > > > > > > > > > > > > > > > > > > > > > > *Hi ALL,* > > > > > > > > > > > > > > > > > > > > > > With the continuous efforts from the community, the > Flink > > > > > system > > > > > > > has > > > > > > > > > been > > > > > > > > > > > continuously improved, which has attracted more and > more > > > > users. > > > > > > > Flink > > > > > > > > > SQL > > > > > > > > > > > is a canonical, widely used relational query language. > > > > However, > > > > > > > there > > > > > > > > > are > > > > > > > > > > > still some scenarios where Flink SQL failed to meet > user > > > > needs > > > > > in > > > > > > > > terms > > > > > > > > > > of > > > > > > > > > > > functionality and ease of use, such as: > > > > > > > > > > > > > > > > > > > > > > *1. In terms of functionality* > > > > > > > > > > > Iteration, user-defined window, user-defined join, > > > > > > user-defined > > > > > > > > > > > GroupReduce, etc. Users cannot express them with SQL; > > > > > > > > > > > > > > > > > > > > > > *2. In terms of ease of use* > > > > > > > > > > > > > > > > > > > > > > - Map - e.g. “dataStream.map(mapFun)”. Although > > > > > > > > “table.select(udf1(), > > > > > > > > > > > udf2(), udf3()....)” can be used to accomplish the > same > > > > > > > function., > > > > > > > > > > with a > > > > > > > > > > > map() function returning 100 columns, one has to > define > > > or > > > > > call > > > > > > > 100 > > > > > > > > > > UDFs > > > > > > > > > > > when using SQL, which is quite involved. > > > > > > > > > > > - FlatMap - e.g. “dataStrem.flatmap(flatMapFun)”. > > > > Similarly, > > > > > > it > > > > > > > > can > > > > > > > > > be > > > > > > > > > > > implemented with “table.join(udtf).select()”. > However, > > it > > > > is > > > > > > > > obvious > > > > > > > > > > that > > > > > > > > > > > dataStream is easier to use than SQL. > > > > > > > > > > > > > > > > > > > > > > Due to the above two reasons, some users have to use > the > > > > > > DataStream > > > > > > > > API > > > > > > > > > > or > > > > > > > > > > > the DataSet API. But when they do that, they lose the > > > > > unification > > > > > > > of > > > > > > > > > > batch > > > > > > > > > > > and streaming. They will also lose the sophisticated > > > > > > optimizations > > > > > > > > such > > > > > > > > > > as > > > > > > > > > > > codegen, aggregate join transpose and multi-stage agg > > from > > > > > Flink > > > > > > > SQL. > > > > > > > > > > > > > > > > > > > > > > We believe that enhancing the functionality and > > > productivity > > > > is > > > > > > > vital > > > > > > > > > for > > > > > > > > > > > the successful adoption of Table API. To this end, > Table > > > API > > > > > > still > > > > > > > > > > > requires more efforts from every contributor in the > > > > community. > > > > > We > > > > > > > see > > > > > > > > > > great > > > > > > > > > > > opportunity in improving our user’s experience from > this > > > > work. > > > > > > Any > > > > > > > > > > feedback > > > > > > > > > > > is welcome. > > > > > > > > > > > > > > > > > > > > > > Regards, > > > > > > > > > > > > > > > > > > > > > > Jincheng > > > > > > > > > > > > > > > > > > > > > > jincheng sun <[hidden email]> 于2018年11月1日周四 > > > > > 下午5:07写道: > > > > > > > > > > > > > > > > > > > > > >> Hi all, > > > > > > > > > > >> > > > > > > > > > > >> With the continuous efforts from the community, the > > Flink > > > > > system > > > > > > > has > > > > > > > > > > been > > > > > > > > > > >> continuously improved, which has attracted more and > more > > > > > users. > > > > > > > > Flink > > > > > > > > > > SQL > > > > > > > > > > >> is a canonical, widely used relational query language. > > > > > However, > > > > > > > > there > > > > > > > > > > are > > > > > > > > > > >> still some scenarios where Flink SQL failed to meet > user > > > > needs > > > > > > in > > > > > > > > > terms > > > > > > > > > > of > > > > > > > > > > >> functionality and ease of use, such as: > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > > >> - > > > > > > > > > > >> > > > > > > > > > > >> In terms of functionality > > > > > > > > > > >> > > > > > > > > > > >> Iteration, user-defined window, user-defined join, > > > > > user-defined > > > > > > > > > > >> GroupReduce, etc. Users cannot express them with SQL; > > > > > > > > > > >> > > > > > > > > > > >> - > > > > > > > > > > >> > > > > > > > > > > >> In terms of ease of use > > > > > > > > > > >> - > > > > > > > > > > >> > > > > > > > > > > >> Map - e.g. “dataStream.map(mapFun)”. Although > > > > > > > > > “table.select(udf1(), > > > > > > > > > > >> udf2(), udf3()....)” can be used to accomplish > the > > > same > > > > > > > > > function., > > > > > > > > > > with a > > > > > > > > > > >> map() function returning 100 columns, one has to > > > define > > > > > or > > > > > > > call > > > > > > > > > > 100 UDFs > > > > > > > > > > >> when using SQL, which is quite involved. > > > > > > > > > > >> - > > > > > > > > > > >> > > > > > > > > > > >> FlatMap - e.g. “dataStrem.flatmap(flatMapFun)”. > > > > > Similarly, > > > > > > > it > > > > > > > > > can > > > > > > > > > > >> be implemented with “table.join(udtf).select()”. > > > > However, > > > > > > it > > > > > > > is > > > > > > > > > > obvious > > > > > > > > > > >> that datastream is easier to use than SQL. > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > > >> Due to the above two reasons, some users have to use > the > > > > > > > DataStream > > > > > > > > > API > > > > > > > > > > or > > > > > > > > > > >> the DataSet API. But when they do that, they lose the > > > > > > unification > > > > > > > of > > > > > > > > > > batch > > > > > > > > > > >> and streaming. They will also lose the sophisticated > > > > > > optimizations > > > > > > > > > such > > > > > > > > > > as > > > > > > > > > > >> codegen, aggregate join transpose and multi-stage agg > > > from > > > > > > Flink > > > > > > > > SQL. > > > > > > > > > > >> > > > > > > > > > > >> We believe that enhancing the functionality and > > > productivity > > > > > is > > > > > > > > vital > > > > > > > > > > for > > > > > > > > > > >> the successful adoption of Table API. To this end, > > Table > > > > API > > > > > > > still > > > > > > > > > > >> requires more efforts from every contributor in the > > > > community. > > > > > > We > > > > > > > > see > > > > > > > > > > great > > > > > > > > > > >> opportunity in improving our user’s experience from > this > > > > work. > > > > > > Any > > > > > > > > > > feedback > > > > > > > > > > >> is welcome. > > > > > > > > > > >> > > > > > > > > > > >> Regards, > > > > > > > > > > >> > > > > > > > > > > >> Jincheng > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > |
Thanks for the replies Xiaowei and others!
You are right, I did not consider the batch optimization that would be missing if the DataSet API would be ported to extend the DataStream API. By extending the scope of the Table API, we can gain a holistic logical & physical optimization which would be great! Is your plan to move all DataSet API functionality into the Table API? If so, do you envision any batch-related API in DataStream at all or should this be done by converting a batch table to DataStream? I'm asking because if there would be batch features in DataStream, we would need some optimization there as well. I think the proposed separation of Table API (stateless APIs) and DataStream (APIs that expose state handling) is a good idea. On a side note, the DataSet API discouraged state handing in user function, so porting this Table API would be quite "natural". As I said before, I like that we can incrementally extend the Table API. Map and FlatMap functions do not seem too difficult. Reduce, GroupReduce, Combine, GroupCombine, MapPartition might be more tricky, esp. if we want to support retractions. Iterations should be a challenge. I assume that Calcite does not support iterations, so we probably need to split query / program and optimize parts separately (IIRC, this is also how Flink's own optimizer handles this). To what extend are you planning to support explicit physical operations like partitioning, sorting or optimizer hints? I haven't had a look in the design document that you shared. Probably, I find answers to some of my questions there ;-) Regarding the question of SQL or Table API, I agree that extending the scope of the Table API does not limit the scope for SQL. By adding more operations to the Table API we can expand it to use case that are not well-served by SQL. As others have said, we'll of course continue to extend and improve Flink's SQL support (within the bounds of the standard). Best, Fabian Am Di., 6. Nov. 2018 um 10:09 Uhr schrieb jincheng sun < [hidden email]>: > Hi Jark, > Glad to see your feedback! > That's Correct, The proposal is aiming to extend the functionality for > Table API! I like add "drop" to fit the use case you mentioned. Not only > that, if a 100-columns Table. and our UDF needs these 100 columns, we don't > want to define the eval as eval(column0...column99), we prefer to define > eval as eval(Row)。Using it like this: table.select(udf (*)). All we also > need to consider if we put the columns package as a row. In a scenario like > this, we have Classification it as cloumn operation, and list the changes > to the column operation after the map/flatMap/agg/flatAgg phase is > completed. And Currently, Xiaowei has started a threading outlining which > talk about what we are proposing. Please see the detail in the mail thread: > Please see the detail in the mail thread: > > https://mail.google.com/mail/u/0/#search/xiaowei/FMfcgxvzLWzfvCnmvMzzSfxHTSfdwLkB > < > https://mail.google.com/mail/u/0/#search/xiaowei/FMfcgxvzLWzfvCnmvMzzSfxHTSfdwLkB > > > . > > At this stage the Table API Enhancement Outline as follows: > > https://docs.google.com/document/d/1tnpxg31EQz2-MEzSotwFzqatsB4rNLz0I-l_vPa5H4Q/edit?usp=sharing > > Please let we know if you have further thoughts or feedback! > > Thanks, > Jincheng > > > Jark Wu <[hidden email]> 于2018年11月6日周二 下午3:35写道: > > > Hi jingcheng, > > > > Thanks for your proposal. I think it is a helpful enhancement for > TableAPI > > which is a solid step forward for TableAPI. > > It doesn't weaken SQL or DataStream, because the conversion between > > DataStream and Table still works. > > People with advanced cases (e.g. complex and fine-grained state control) > > can go with DataStream, > > but most general cases can stay in TableAPI. This works is aiming to > extend > > the functionality for TableAPI, > > to extend the usage scenario, to help TableAPI becomes a more widely used > > API. > > > > For example, someone want to drop one column from a 100-columns Table. > > Currently, we have to convert > > Table to DataStream and use MapFunction to do that, or select the > remaining > > 99 columns using Table.select API. > > But if we support Table.drop() method for TableAPI, it will be a very > > convenient method and let users stay in Table. > > > > Looking forward to the more detailed design and further discussion. > > > > Regards, > > Jark > > > > jincheng sun <[hidden email]> 于2018年11月6日周二 下午1:05写道: > > > > > Hi Rong Rong, > > > > > > Sorry for the late reply, And thanks for your feedback! We will > continue > > > to add more convenience features to the TableAPI, such as map, flatmap, > > > agg, flatagg, iteration etc. And I am very happy that you are > interested > > on > > > this proposal. Due to this is a long-term continuous work, we will push > > it > > > in stages. Currently Xiaowei has started a threading outlining which > > talk > > > about what we are proposing. Please see the detail in the mail thread: > > > > > > > > > https://mail.google.com/mail/u/0/#search/xiaowei/FMfcgxvzLWzfvCnmvMzzSfxHTSfdwLkB > > > < > > > > > > https://mail.google.com/mail/u/0/#search/xiaowei/FMfcgxvzLWzfvCnmvMzzSfxHTSfdwLkB > > > > > > > . > > > > > > The Table API Enhancement Outline as follows: > > > > > > > > > https://docs.google.com/document/d/1tnpxg31EQz2-MEzSotwFzqatsB4rNLz0I-l_vPa5H4Q/edit?usp=sharing > > > > > > Please let we know if you have further thoughts or feedback! > > > > > > Thanks, > > > Jincheng > > > > > > Fabian Hueske <[hidden email]> 于2018年11月5日周一 下午7:03写道: > > > > > > > Hi Jincheng, > > > > > > > > Thanks for this interesting proposal. > > > > I like that we can push this effort forward in a very fine-grained > > > manner, > > > > i.e., incrementally adding more APIs to the Table API. > > > > > > > > However, I also have a few questions / concerns. > > > > Today, the Table API is tightly integrated with the DataSet and > > > DataStream > > > > APIs. It is very easy to convert a Table into a DataSet or DataStream > > and > > > > vice versa. This mean it is already easy to combine custom logic an > > > > relational operations. What I like is that several aspects are > clearly > > > > separated like retraction and timestamp handling (see below) + all > > > > libraries on DataStream/DataSet can be easily combined with > relational > > > > operations. > > > > I can see that adding more functionality to the Table API would > remove > > > the > > > > distinction between DataSet and DataStream. However, wouldn't we get > a > > > > similar benefit by extending the DataStream API for proper support > for > > > > bounded streams (as is the long-term goal of Flink)? > > > > I'm also a bit skeptical about the optimization opportunities we > would > > > > gain. Map/FlatMap UDFs are black boxes that cannot be easily removed > > > > without additional information (I did some research on this a few > years > > > ago > > > > [1]). > > > > > > > > Moreover, I think there are a few tricky details that need to be > > resolved > > > > to enable a good integration. > > > > > > > > 1) How to deal with retraction messages? The DataStream API does not > > > have a > > > > notion of retractions. How would a MapFunction or FlatMapFunction > > handle > > > > retraction? Do they need to be aware of the change flag? Custom > > windowing > > > > and aggregation logic would certainly need to have that information. > > > > 2) How to deal with timestamps? The DataStream API does not give > access > > > to > > > > timestamps. In the Table API / SQL these are exposed as regular > > > attributes. > > > > How can we ensure that timestamp attributes remain valid (i.e. > aligned > > > with > > > > watermarks) if the output is produced by arbitrary code? > > > > There might be more issues of this kind. > > > > > > > > My main question would be how much would we gain with this proposal > > over > > > a > > > > tight integration of Table API and DataStream API, assuming that > batch > > > > functionality is moved to DataStream? > > > > > > > > Best, Fabian > > > > > > > > [1] http://stratosphere.eu/assets/papers/openingTheBlackBoxes_12.pdf > > > > > > > > > > > > Am Mo., 5. Nov. 2018 um 02:49 Uhr schrieb Rong Rong < > > [hidden email] > > > >: > > > > > > > > > Hi Jincheng, > > > > > > > > > > Thank you for the proposal! I think being able to define a process > / > > > > > co-process function in table API definitely opens up a whole new > > level > > > of > > > > > applications using a unified API. > > > > > > > > > > In addition, as Tzu-Li and Hequn have mentioned, the benefit of > > > > > optimization layer of Table API will already bring in additional > > > benefit > > > > > over directly programming on top of DataStream/DataSet API. I am > very > > > > > interested an looking forward to seeing the support for more > complex > > > use > > > > > cases, especially iterations. It will enable table API to define > much > > > > > broader, event-driven use cases such as real-time ML > > > prediction/training. > > > > > > > > > > As Timo mentioned, This will make Table API diverge from the SQL > API. > > > But > > > > > as from my experience Table API was always giving me the impression > > to > > > > be a > > > > > more sophisticated, syntactic-aware way to express relational > > > operations. > > > > > Looking forward to further discussion and collaborations on the > FLIP > > > doc. > > > > > > > > > > -- > > > > > Rong > > > > > > > > > > On Sun, Nov 4, 2018 at 5:22 PM jincheng sun < > > [hidden email]> > > > > > wrote: > > > > > > > > > > > Hi tison, > > > > > > > > > > > > Thanks a lot for your feedback! > > > > > > I am very happy to see that community contributors agree to > > enhanced > > > > the > > > > > > TableAPI. This work is a long-term continuous work, we will push > it > > > in > > > > > > stages, we will soon complete the enhanced list of the first > > phase, > > > we > > > > > can > > > > > > go deep discussion in google doc. thanks again for joining on > the > > > very > > > > > > important discussion of the Flink Table API. > > > > > > > > > > > > Thanks, > > > > > > Jincheng > > > > > > > > > > > > Tzu-Li Chen <[hidden email]> 于2018年11月2日周五 下午1:49写道: > > > > > > > > > > > > > Hi jingchengm > > > > > > > > > > > > > > Thanks a lot for your proposal! I find it is a good start point > > for > > > > > > > internal optimization works and help Flink to be more > > > > > > > user-friendly. > > > > > > > > > > > > > > AFAIK, DataStream is the most popular API currently that Flink > > > > > > > users should describe their logic with detailed logic. > > > > > > > From a more internal view the conversion from DataStream to > > > > > > > JobGraph is quite mechanically and hard to be optimized. So > when > > > > > > > users program with DataStream, they have to learn more > internals > > > > > > > and spend a lot of time to tune for performance. > > > > > > > With your proposal, we provide enhanced functionality of Table > > API, > > > > > > > so that users can describe their job easily on Table aspect. > This > > > > gives > > > > > > > an opportunity to Flink developers to introduce an optimize > phase > > > > > > > while transforming user program(described by Table API) to > > internal > > > > > > > representation. > > > > > > > > > > > > > > Given a user who want to start using Flink with simple ETL, > > > > pipelining > > > > > > > or analytics, he would find it is most naturally described by > > > > SQL/Table > > > > > > > API. Further, as mentioned by @hequn, > > > > > > > > > > > > > > SQL is a widely used language. It follows standards, is a > > > > > > > > descriptive language, and is easy to use > > > > > > > > > > > > > > > > > > > > > thus we could expect with the enhancement of SQL/Table API, > Flink > > > > > > > becomes more friendly to users. > > > > > > > > > > > > > > Looking forward to the design doc/FLIP! > > > > > > > > > > > > > > Best, > > > > > > > tison. > > > > > > > > > > > > > > > > > > > > > jincheng sun <[hidden email]> 于2018年11月2日周五 > 上午11:46写道: > > > > > > > > > > > > > > > Hi Hequn, > > > > > > > > Thanks for your feedback! And also thanks for our offline > > > > discussion! > > > > > > > > You are right, unification of batch and streaming is very > > > important > > > > > for > > > > > > > > flink API. > > > > > > > > We will provide more detailed design later, Please let me > know > > if > > > > you > > > > > > > have > > > > > > > > further thoughts or feedback. > > > > > > > > > > > > > > > > Thanks, > > > > > > > > Jincheng > > > > > > > > > > > > > > > > Hequn Cheng <[hidden email]> 于2018年11月2日周五 上午10:02写道: > > > > > > > > > > > > > > > > > Hi Jincheng, > > > > > > > > > > > > > > > > > > Thanks a lot for your proposal. It is very encouraging! > > > > > > > > > > > > > > > > > > As we all know, SQL is a widely used language. It follows > > > > > standards, > > > > > > > is a > > > > > > > > > descriptive language, and is easy to use. A powerful > feature > > of > > > > SQL > > > > > > is > > > > > > > > that > > > > > > > > > it supports optimization. Users only need to care about the > > > logic > > > > > of > > > > > > > the > > > > > > > > > program. The underlying optimizer will help users optimize > > the > > > > > > > > performance > > > > > > > > > of the program. However, in terms of functionality and ease > > of > > > > use, > > > > > > in > > > > > > > > some > > > > > > > > > scenarios sql will be limited, as described in Jincheng's > > > > proposal. > > > > > > > > > > > > > > > > > > Correspondingly, the DataStream/DataSet api can provide > > > powerful > > > > > > > > > functionalities. Users can write > > > > ProcessFunction/CoProcessFunction > > > > > > and > > > > > > > > get > > > > > > > > > the timer. Compared with SQL, it provides more > > functionalities > > > > and > > > > > > > > > flexibilities. However, it does not support optimization > like > > > > SQL. > > > > > > > > > Meanwhile, DataStream/DataSet api has not been unified > which > > > > means, > > > > > > for > > > > > > > > the > > > > > > > > > same logic, users need to write a job for each stream and > > > batch. > > > > > > > > > > > > > > > > > > With TableApi, I think we can combine the advantages of > both. > > > > Users > > > > > > can > > > > > > > > > easily write relational operations and enjoy optimization. > At > > > the > > > > > > same > > > > > > > > > time, it supports more functionality and ease of use. > Looking > > > > > forward > > > > > > > to > > > > > > > > > the detailed design/FLIP. > > > > > > > > > > > > > > > > > > Best, > > > > > > > > > Hequn > > > > > > > > > > > > > > > > > > On Fri, Nov 2, 2018 at 9:48 AM Shaoxuan Wang < > > > > [hidden email]> > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > Hi Aljoscha, > > > > > > > > > > Glad that you like the proposal. We have completed the > > > > prototype > > > > > of > > > > > > > > most > > > > > > > > > > new proposed functionalities. Once collect the feedback > > from > > > > > > > community, > > > > > > > > > we > > > > > > > > > > will come up with a concrete FLIP/design doc. > > > > > > > > > > > > > > > > > > > > Regards, > > > > > > > > > > Shaoxuan > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Nov 1, 2018 at 8:12 PM Aljoscha Krettek < > > > > > > [hidden email] > > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > Hi Jincheng, > > > > > > > > > > > > > > > > > > > > > > these points sound very good! Are there any concrete > > > > proposals > > > > > > for > > > > > > > > > > > changes? For example a FLIP/design document? > > > > > > > > > > > > > > > > > > > > > > See here for FLIPs: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals > > > > > > > > > > > > > > > > > > > > > > Best, > > > > > > > > > > > Aljoscha > > > > > > > > > > > > > > > > > > > > > > > On 1. Nov 2018, at 12:51, jincheng sun < > > > > > > [hidden email] > > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > *--------I am sorry for the formatting of the email > > > > content. > > > > > I > > > > > > > > > reformat > > > > > > > > > > > > the **content** as follows-----------* > > > > > > > > > > > > > > > > > > > > > > > > *Hi ALL,* > > > > > > > > > > > > > > > > > > > > > > > > With the continuous efforts from the community, the > > Flink > > > > > > system > > > > > > > > has > > > > > > > > > > been > > > > > > > > > > > > continuously improved, which has attracted more and > > more > > > > > users. > > > > > > > > Flink > > > > > > > > > > SQL > > > > > > > > > > > > is a canonical, widely used relational query > language. > > > > > However, > > > > > > > > there > > > > > > > > > > are > > > > > > > > > > > > still some scenarios where Flink SQL failed to meet > > user > > > > > needs > > > > > > in > > > > > > > > > terms > > > > > > > > > > > of > > > > > > > > > > > > functionality and ease of use, such as: > > > > > > > > > > > > > > > > > > > > > > > > *1. In terms of functionality* > > > > > > > > > > > > Iteration, user-defined window, user-defined join, > > > > > > > user-defined > > > > > > > > > > > > GroupReduce, etc. Users cannot express them with SQL; > > > > > > > > > > > > > > > > > > > > > > > > *2. In terms of ease of use* > > > > > > > > > > > > > > > > > > > > > > > > - Map - e.g. “dataStream.map(mapFun)”. Although > > > > > > > > > “table.select(udf1(), > > > > > > > > > > > > udf2(), udf3()....)” can be used to accomplish the > > same > > > > > > > > function., > > > > > > > > > > > with a > > > > > > > > > > > > map() function returning 100 columns, one has to > > define > > > > or > > > > > > call > > > > > > > > 100 > > > > > > > > > > > UDFs > > > > > > > > > > > > when using SQL, which is quite involved. > > > > > > > > > > > > - FlatMap - e.g. “dataStrem.flatmap(flatMapFun)”. > > > > > Similarly, > > > > > > > it > > > > > > > > > can > > > > > > > > > > be > > > > > > > > > > > > implemented with “table.join(udtf).select()”. > > However, > > > it > > > > > is > > > > > > > > > obvious > > > > > > > > > > > that > > > > > > > > > > > > dataStream is easier to use than SQL. > > > > > > > > > > > > > > > > > > > > > > > > Due to the above two reasons, some users have to use > > the > > > > > > > DataStream > > > > > > > > > API > > > > > > > > > > > or > > > > > > > > > > > > the DataSet API. But when they do that, they lose the > > > > > > unification > > > > > > > > of > > > > > > > > > > > batch > > > > > > > > > > > > and streaming. They will also lose the sophisticated > > > > > > > optimizations > > > > > > > > > such > > > > > > > > > > > as > > > > > > > > > > > > codegen, aggregate join transpose and multi-stage agg > > > from > > > > > > Flink > > > > > > > > SQL. > > > > > > > > > > > > > > > > > > > > > > > > We believe that enhancing the functionality and > > > > productivity > > > > > is > > > > > > > > vital > > > > > > > > > > for > > > > > > > > > > > > the successful adoption of Table API. To this end, > > Table > > > > API > > > > > > > still > > > > > > > > > > > > requires more efforts from every contributor in the > > > > > community. > > > > > > We > > > > > > > > see > > > > > > > > > > > great > > > > > > > > > > > > opportunity in improving our user’s experience from > > this > > > > > work. > > > > > > > Any > > > > > > > > > > > feedback > > > > > > > > > > > > is welcome. > > > > > > > > > > > > > > > > > > > > > > > > Regards, > > > > > > > > > > > > > > > > > > > > > > > > Jincheng > > > > > > > > > > > > > > > > > > > > > > > > jincheng sun <[hidden email]> > 于2018年11月1日周四 > > > > > > 下午5:07写道: > > > > > > > > > > > > > > > > > > > > > > > >> Hi all, > > > > > > > > > > > >> > > > > > > > > > > > >> With the continuous efforts from the community, the > > > Flink > > > > > > system > > > > > > > > has > > > > > > > > > > > been > > > > > > > > > > > >> continuously improved, which has attracted more and > > more > > > > > > users. > > > > > > > > > Flink > > > > > > > > > > > SQL > > > > > > > > > > > >> is a canonical, widely used relational query > language. > > > > > > However, > > > > > > > > > there > > > > > > > > > > > are > > > > > > > > > > > >> still some scenarios where Flink SQL failed to meet > > user > > > > > needs > > > > > > > in > > > > > > > > > > terms > > > > > > > > > > > of > > > > > > > > > > > >> functionality and ease of use, such as: > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > > >> - > > > > > > > > > > > >> > > > > > > > > > > > >> In terms of functionality > > > > > > > > > > > >> > > > > > > > > > > > >> Iteration, user-defined window, user-defined join, > > > > > > user-defined > > > > > > > > > > > >> GroupReduce, etc. Users cannot express them with > SQL; > > > > > > > > > > > >> > > > > > > > > > > > >> - > > > > > > > > > > > >> > > > > > > > > > > > >> In terms of ease of use > > > > > > > > > > > >> - > > > > > > > > > > > >> > > > > > > > > > > > >> Map - e.g. “dataStream.map(mapFun)”. Although > > > > > > > > > > “table.select(udf1(), > > > > > > > > > > > >> udf2(), udf3()....)” can be used to accomplish > > the > > > > same > > > > > > > > > > function., > > > > > > > > > > > with a > > > > > > > > > > > >> map() function returning 100 columns, one has > to > > > > define > > > > > > or > > > > > > > > call > > > > > > > > > > > 100 UDFs > > > > > > > > > > > >> when using SQL, which is quite involved. > > > > > > > > > > > >> - > > > > > > > > > > > >> > > > > > > > > > > > >> FlatMap - e.g. > “dataStrem.flatmap(flatMapFun)”. > > > > > > Similarly, > > > > > > > > it > > > > > > > > > > can > > > > > > > > > > > >> be implemented with > “table.join(udtf).select()”. > > > > > However, > > > > > > > it > > > > > > > > is > > > > > > > > > > > obvious > > > > > > > > > > > >> that datastream is easier to use than SQL. > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > > >> Due to the above two reasons, some users have to use > > the > > > > > > > > DataStream > > > > > > > > > > API > > > > > > > > > > > or > > > > > > > > > > > >> the DataSet API. But when they do that, they lose > the > > > > > > > unification > > > > > > > > of > > > > > > > > > > > batch > > > > > > > > > > > >> and streaming. They will also lose the sophisticated > > > > > > > optimizations > > > > > > > > > > such > > > > > > > > > > > as > > > > > > > > > > > >> codegen, aggregate join transpose and multi-stage > agg > > > > from > > > > > > > Flink > > > > > > > > > SQL. > > > > > > > > > > > >> > > > > > > > > > > > >> We believe that enhancing the functionality and > > > > productivity > > > > > > is > > > > > > > > > vital > > > > > > > > > > > for > > > > > > > > > > > >> the successful adoption of Table API. To this end, > > > Table > > > > > API > > > > > > > > still > > > > > > > > > > > >> requires more efforts from every contributor in the > > > > > community. > > > > > > > We > > > > > > > > > see > > > > > > > > > > > great > > > > > > > > > > > >> opportunity in improving our user’s experience from > > this > > > > > work. > > > > > > > Any > > > > > > > > > > > feedback > > > > > > > > > > > >> is welcome. > > > > > > > > > > > >> > > > > > > > > > > > >> Regards, > > > > > > > > > > > >> > > > > > > > > > > > >> Jincheng > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > |
Hi Fabian,
I totally agree with you that we should incrementally improve TableAPI. We don't suggest that we do anything drastic such as replacing DataSet API yet. We should see how much we can achieve by extending TableAPI cleanly. By then, we should see if there are any natural boundaries on how far we can get. Hopefully it should be much easier for the community to reach consensus at that point. As you pointed out, the APIs in DataSet have mixed flavors. Some of them are logical while others are very physical (e.g., MapPartition). We should really think what the use cases for such physical API are. Is it possible that their existence points to limitations in our current optimizations? If we do better in our optimizations, can we use more logical APIs to achieve the task? We did some investigations for the "hard" APIs in DataSet before. We will post some of the lessons we learned and hope to get feedback from you guys. Iteration is a very complex and interesting topic. I think that we can live with optimizations stopping at the boundary of iterations (at least initially). Regards, Xiaowei On Tue, Nov 6, 2018 at 6:21 PM Fabian Hueske <[hidden email]> wrote: > Thanks for the replies Xiaowei and others! > > You are right, I did not consider the batch optimization that would be > missing if the DataSet API would be ported to extend the DataStream API. > By extending the scope of the Table API, we can gain a holistic logical & > physical optimization which would be great! > Is your plan to move all DataSet API functionality into the Table API? > If so, do you envision any batch-related API in DataStream at all or should > this be done by converting a batch table to DataStream? I'm asking because > if there would be batch features in DataStream, we would need some > optimization there as well. > > I think the proposed separation of Table API (stateless APIs) and > DataStream (APIs that expose state handling) is a good idea. > On a side note, the DataSet API discouraged state handing in user function, > so porting this Table API would be quite "natural". > > As I said before, I like that we can incrementally extend the Table API. > Map and FlatMap functions do not seem too difficult. > Reduce, GroupReduce, Combine, GroupCombine, MapPartition might be more > tricky, esp. if we want to support retractions. > Iterations should be a challenge. I assume that Calcite does not support > iterations, so we probably need to split query / program and optimize parts > separately (IIRC, this is also how Flink's own optimizer handles this). > To what extend are you planning to support explicit physical operations > like partitioning, sorting or optimizer hints? > > I haven't had a look in the design document that you shared. Probably, I > find answers to some of my questions there ;-) > > Regarding the question of SQL or Table API, I agree that extending the > scope of the Table API does not limit the scope for SQL. > By adding more operations to the Table API we can expand it to use case > that are not well-served by SQL. > As others have said, we'll of course continue to extend and improve Flink's > SQL support (within the bounds of the standard). > > Best, Fabian > > Am Di., 6. Nov. 2018 um 10:09 Uhr schrieb jincheng sun < > [hidden email]>: > > > Hi Jark, > > Glad to see your feedback! > > That's Correct, The proposal is aiming to extend the functionality for > > Table API! I like add "drop" to fit the use case you mentioned. Not only > > that, if a 100-columns Table. and our UDF needs these 100 columns, we > don't > > want to define the eval as eval(column0...column99), we prefer to define > > eval as eval(Row)。Using it like this: table.select(udf (*)). All we also > > need to consider if we put the columns package as a row. In a scenario > like > > this, we have Classification it as cloumn operation, and list the > changes > > to the column operation after the map/flatMap/agg/flatAgg phase is > > completed. And Currently, Xiaowei has started a threading outlining > which > > talk about what we are proposing. Please see the detail in the mail > thread: > > Please see the detail in the mail thread: > > > > > https://mail.google.com/mail/u/0/#search/xiaowei/FMfcgxvzLWzfvCnmvMzzSfxHTSfdwLkB > > < > > > https://mail.google.com/mail/u/0/#search/xiaowei/FMfcgxvzLWzfvCnmvMzzSfxHTSfdwLkB > > > > > . > > > > At this stage the Table API Enhancement Outline as follows: > > > > > https://docs.google.com/document/d/1tnpxg31EQz2-MEzSotwFzqatsB4rNLz0I-l_vPa5H4Q/edit?usp=sharing > > > > Please let we know if you have further thoughts or feedback! > > > > Thanks, > > Jincheng > > > > > > Jark Wu <[hidden email]> 于2018年11月6日周二 下午3:35写道: > > > > > Hi jingcheng, > > > > > > Thanks for your proposal. I think it is a helpful enhancement for > > TableAPI > > > which is a solid step forward for TableAPI. > > > It doesn't weaken SQL or DataStream, because the conversion between > > > DataStream and Table still works. > > > People with advanced cases (e.g. complex and fine-grained state > control) > > > can go with DataStream, > > > but most general cases can stay in TableAPI. This works is aiming to > > extend > > > the functionality for TableAPI, > > > to extend the usage scenario, to help TableAPI becomes a more widely > used > > > API. > > > > > > For example, someone want to drop one column from a 100-columns Table. > > > Currently, we have to convert > > > Table to DataStream and use MapFunction to do that, or select the > > remaining > > > 99 columns using Table.select API. > > > But if we support Table.drop() method for TableAPI, it will be a very > > > convenient method and let users stay in Table. > > > > > > Looking forward to the more detailed design and further discussion. > > > > > > Regards, > > > Jark > > > > > > jincheng sun <[hidden email]> 于2018年11月6日周二 下午1:05写道: > > > > > > > Hi Rong Rong, > > > > > > > > Sorry for the late reply, And thanks for your feedback! We will > > continue > > > > to add more convenience features to the TableAPI, such as map, > flatmap, > > > > agg, flatagg, iteration etc. And I am very happy that you are > > interested > > > on > > > > this proposal. Due to this is a long-term continuous work, we will > push > > > it > > > > in stages. Currently Xiaowei has started a threading outlining > which > > > talk > > > > about what we are proposing. Please see the detail in the mail > thread: > > > > > > > > > > > > > > https://mail.google.com/mail/u/0/#search/xiaowei/FMfcgxvzLWzfvCnmvMzzSfxHTSfdwLkB > > > > < > > > > > > > > > > https://mail.google.com/mail/u/0/#search/xiaowei/FMfcgxvzLWzfvCnmvMzzSfxHTSfdwLkB > > > > > > > > > . > > > > > > > > The Table API Enhancement Outline as follows: > > > > > > > > > > > > > > https://docs.google.com/document/d/1tnpxg31EQz2-MEzSotwFzqatsB4rNLz0I-l_vPa5H4Q/edit?usp=sharing > > > > > > > > Please let we know if you have further thoughts or feedback! > > > > > > > > Thanks, > > > > Jincheng > > > > > > > > Fabian Hueske <[hidden email]> 于2018年11月5日周一 下午7:03写道: > > > > > > > > > Hi Jincheng, > > > > > > > > > > Thanks for this interesting proposal. > > > > > I like that we can push this effort forward in a very fine-grained > > > > manner, > > > > > i.e., incrementally adding more APIs to the Table API. > > > > > > > > > > However, I also have a few questions / concerns. > > > > > Today, the Table API is tightly integrated with the DataSet and > > > > DataStream > > > > > APIs. It is very easy to convert a Table into a DataSet or > DataStream > > > and > > > > > vice versa. This mean it is already easy to combine custom logic an > > > > > relational operations. What I like is that several aspects are > > clearly > > > > > separated like retraction and timestamp handling (see below) + all > > > > > libraries on DataStream/DataSet can be easily combined with > > relational > > > > > operations. > > > > > I can see that adding more functionality to the Table API would > > remove > > > > the > > > > > distinction between DataSet and DataStream. However, wouldn't we > get > > a > > > > > similar benefit by extending the DataStream API for proper support > > for > > > > > bounded streams (as is the long-term goal of Flink)? > > > > > I'm also a bit skeptical about the optimization opportunities we > > would > > > > > gain. Map/FlatMap UDFs are black boxes that cannot be easily > removed > > > > > without additional information (I did some research on this a few > > years > > > > ago > > > > > [1]). > > > > > > > > > > Moreover, I think there are a few tricky details that need to be > > > resolved > > > > > to enable a good integration. > > > > > > > > > > 1) How to deal with retraction messages? The DataStream API does > not > > > > have a > > > > > notion of retractions. How would a MapFunction or FlatMapFunction > > > handle > > > > > retraction? Do they need to be aware of the change flag? Custom > > > windowing > > > > > and aggregation logic would certainly need to have that > information. > > > > > 2) How to deal with timestamps? The DataStream API does not give > > access > > > > to > > > > > timestamps. In the Table API / SQL these are exposed as regular > > > > attributes. > > > > > How can we ensure that timestamp attributes remain valid (i.e. > > aligned > > > > with > > > > > watermarks) if the output is produced by arbitrary code? > > > > > There might be more issues of this kind. > > > > > > > > > > My main question would be how much would we gain with this proposal > > > over > > > > a > > > > > tight integration of Table API and DataStream API, assuming that > > batch > > > > > functionality is moved to DataStream? > > > > > > > > > > Best, Fabian > > > > > > > > > > [1] > http://stratosphere.eu/assets/papers/openingTheBlackBoxes_12.pdf > > > > > > > > > > > > > > > Am Mo., 5. Nov. 2018 um 02:49 Uhr schrieb Rong Rong < > > > [hidden email] > > > > >: > > > > > > > > > > > Hi Jincheng, > > > > > > > > > > > > Thank you for the proposal! I think being able to define a > process > > / > > > > > > co-process function in table API definitely opens up a whole new > > > level > > > > of > > > > > > applications using a unified API. > > > > > > > > > > > > In addition, as Tzu-Li and Hequn have mentioned, the benefit of > > > > > > optimization layer of Table API will already bring in additional > > > > benefit > > > > > > over directly programming on top of DataStream/DataSet API. I am > > very > > > > > > interested an looking forward to seeing the support for more > > complex > > > > use > > > > > > cases, especially iterations. It will enable table API to define > > much > > > > > > broader, event-driven use cases such as real-time ML > > > > prediction/training. > > > > > > > > > > > > As Timo mentioned, This will make Table API diverge from the SQL > > API. > > > > But > > > > > > as from my experience Table API was always giving me the > impression > > > to > > > > > be a > > > > > > more sophisticated, syntactic-aware way to express relational > > > > operations. > > > > > > Looking forward to further discussion and collaborations on the > > FLIP > > > > doc. > > > > > > > > > > > > -- > > > > > > Rong > > > > > > > > > > > > On Sun, Nov 4, 2018 at 5:22 PM jincheng sun < > > > [hidden email]> > > > > > > wrote: > > > > > > > > > > > > > Hi tison, > > > > > > > > > > > > > > Thanks a lot for your feedback! > > > > > > > I am very happy to see that community contributors agree to > > > enhanced > > > > > the > > > > > > > TableAPI. This work is a long-term continuous work, we will > push > > it > > > > in > > > > > > > stages, we will soon complete the enhanced list of the first > > > phase, > > > > we > > > > > > can > > > > > > > go deep discussion in google doc. thanks again for joining on > > the > > > > very > > > > > > > important discussion of the Flink Table API. > > > > > > > > > > > > > > Thanks, > > > > > > > Jincheng > > > > > > > > > > > > > > Tzu-Li Chen <[hidden email]> 于2018年11月2日周五 下午1:49写道: > > > > > > > > > > > > > > > Hi jingchengm > > > > > > > > > > > > > > > > Thanks a lot for your proposal! I find it is a good start > point > > > for > > > > > > > > internal optimization works and help Flink to be more > > > > > > > > user-friendly. > > > > > > > > > > > > > > > > AFAIK, DataStream is the most popular API currently that > Flink > > > > > > > > users should describe their logic with detailed logic. > > > > > > > > From a more internal view the conversion from DataStream to > > > > > > > > JobGraph is quite mechanically and hard to be optimized. So > > when > > > > > > > > users program with DataStream, they have to learn more > > internals > > > > > > > > and spend a lot of time to tune for performance. > > > > > > > > With your proposal, we provide enhanced functionality of > Table > > > API, > > > > > > > > so that users can describe their job easily on Table aspect. > > This > > > > > gives > > > > > > > > an opportunity to Flink developers to introduce an optimize > > phase > > > > > > > > while transforming user program(described by Table API) to > > > internal > > > > > > > > representation. > > > > > > > > > > > > > > > > Given a user who want to start using Flink with simple ETL, > > > > > pipelining > > > > > > > > or analytics, he would find it is most naturally described by > > > > > SQL/Table > > > > > > > > API. Further, as mentioned by @hequn, > > > > > > > > > > > > > > > > SQL is a widely used language. It follows standards, is a > > > > > > > > > descriptive language, and is easy to use > > > > > > > > > > > > > > > > > > > > > > > > thus we could expect with the enhancement of SQL/Table API, > > Flink > > > > > > > > becomes more friendly to users. > > > > > > > > > > > > > > > > Looking forward to the design doc/FLIP! > > > > > > > > > > > > > > > > Best, > > > > > > > > tison. > > > > > > > > > > > > > > > > > > > > > > > > jincheng sun <[hidden email]> 于2018年11月2日周五 > > 上午11:46写道: > > > > > > > > > > > > > > > > > Hi Hequn, > > > > > > > > > Thanks for your feedback! And also thanks for our offline > > > > > discussion! > > > > > > > > > You are right, unification of batch and streaming is very > > > > important > > > > > > for > > > > > > > > > flink API. > > > > > > > > > We will provide more detailed design later, Please let me > > know > > > if > > > > > you > > > > > > > > have > > > > > > > > > further thoughts or feedback. > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > Jincheng > > > > > > > > > > > > > > > > > > Hequn Cheng <[hidden email]> 于2018年11月2日周五 > 上午10:02写道: > > > > > > > > > > > > > > > > > > > Hi Jincheng, > > > > > > > > > > > > > > > > > > > > Thanks a lot for your proposal. It is very encouraging! > > > > > > > > > > > > > > > > > > > > As we all know, SQL is a widely used language. It follows > > > > > > standards, > > > > > > > > is a > > > > > > > > > > descriptive language, and is easy to use. A powerful > > feature > > > of > > > > > SQL > > > > > > > is > > > > > > > > > that > > > > > > > > > > it supports optimization. Users only need to care about > the > > > > logic > > > > > > of > > > > > > > > the > > > > > > > > > > program. The underlying optimizer will help users > optimize > > > the > > > > > > > > > performance > > > > > > > > > > of the program. However, in terms of functionality and > ease > > > of > > > > > use, > > > > > > > in > > > > > > > > > some > > > > > > > > > > scenarios sql will be limited, as described in Jincheng's > > > > > proposal. > > > > > > > > > > > > > > > > > > > > Correspondingly, the DataStream/DataSet api can provide > > > > powerful > > > > > > > > > > functionalities. Users can write > > > > > ProcessFunction/CoProcessFunction > > > > > > > and > > > > > > > > > get > > > > > > > > > > the timer. Compared with SQL, it provides more > > > functionalities > > > > > and > > > > > > > > > > flexibilities. However, it does not support optimization > > like > > > > > SQL. > > > > > > > > > > Meanwhile, DataStream/DataSet api has not been unified > > which > > > > > means, > > > > > > > for > > > > > > > > > the > > > > > > > > > > same logic, users need to write a job for each stream and > > > > batch. > > > > > > > > > > > > > > > > > > > > With TableApi, I think we can combine the advantages of > > both. > > > > > Users > > > > > > > can > > > > > > > > > > easily write relational operations and enjoy > optimization. > > At > > > > the > > > > > > > same > > > > > > > > > > time, it supports more functionality and ease of use. > > Looking > > > > > > forward > > > > > > > > to > > > > > > > > > > the detailed design/FLIP. > > > > > > > > > > > > > > > > > > > > Best, > > > > > > > > > > Hequn > > > > > > > > > > > > > > > > > > > > On Fri, Nov 2, 2018 at 9:48 AM Shaoxuan Wang < > > > > > [hidden email]> > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > Hi Aljoscha, > > > > > > > > > > > Glad that you like the proposal. We have completed the > > > > > prototype > > > > > > of > > > > > > > > > most > > > > > > > > > > > new proposed functionalities. Once collect the feedback > > > from > > > > > > > > community, > > > > > > > > > > we > > > > > > > > > > > will come up with a concrete FLIP/design doc. > > > > > > > > > > > > > > > > > > > > > > Regards, > > > > > > > > > > > Shaoxuan > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Nov 1, 2018 at 8:12 PM Aljoscha Krettek < > > > > > > > [hidden email] > > > > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > Hi Jincheng, > > > > > > > > > > > > > > > > > > > > > > > > these points sound very good! Are there any concrete > > > > > proposals > > > > > > > for > > > > > > > > > > > > changes? For example a FLIP/design document? > > > > > > > > > > > > > > > > > > > > > > > > See here for FLIPs: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals > > > > > > > > > > > > > > > > > > > > > > > > Best, > > > > > > > > > > > > Aljoscha > > > > > > > > > > > > > > > > > > > > > > > > > On 1. Nov 2018, at 12:51, jincheng sun < > > > > > > > [hidden email] > > > > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > *--------I am sorry for the formatting of the email > > > > > content. > > > > > > I > > > > > > > > > > reformat > > > > > > > > > > > > > the **content** as follows-----------* > > > > > > > > > > > > > > > > > > > > > > > > > > *Hi ALL,* > > > > > > > > > > > > > > > > > > > > > > > > > > With the continuous efforts from the community, the > > > Flink > > > > > > > system > > > > > > > > > has > > > > > > > > > > > been > > > > > > > > > > > > > continuously improved, which has attracted more and > > > more > > > > > > users. > > > > > > > > > Flink > > > > > > > > > > > SQL > > > > > > > > > > > > > is a canonical, widely used relational query > > language. > > > > > > However, > > > > > > > > > there > > > > > > > > > > > are > > > > > > > > > > > > > still some scenarios where Flink SQL failed to meet > > > user > > > > > > needs > > > > > > > in > > > > > > > > > > terms > > > > > > > > > > > > of > > > > > > > > > > > > > functionality and ease of use, such as: > > > > > > > > > > > > > > > > > > > > > > > > > > *1. In terms of functionality* > > > > > > > > > > > > > Iteration, user-defined window, user-defined > join, > > > > > > > > user-defined > > > > > > > > > > > > > GroupReduce, etc. Users cannot express them with > SQL; > > > > > > > > > > > > > > > > > > > > > > > > > > *2. In terms of ease of use* > > > > > > > > > > > > > > > > > > > > > > > > > > - Map - e.g. “dataStream.map(mapFun)”. Although > > > > > > > > > > “table.select(udf1(), > > > > > > > > > > > > > udf2(), udf3()....)” can be used to accomplish > the > > > same > > > > > > > > > function., > > > > > > > > > > > > with a > > > > > > > > > > > > > map() function returning 100 columns, one has to > > > define > > > > > or > > > > > > > call > > > > > > > > > 100 > > > > > > > > > > > > UDFs > > > > > > > > > > > > > when using SQL, which is quite involved. > > > > > > > > > > > > > - FlatMap - e.g. > “dataStrem.flatmap(flatMapFun)”. > > > > > > Similarly, > > > > > > > > it > > > > > > > > > > can > > > > > > > > > > > be > > > > > > > > > > > > > implemented with “table.join(udtf).select()”. > > > However, > > > > it > > > > > > is > > > > > > > > > > obvious > > > > > > > > > > > > that > > > > > > > > > > > > > dataStream is easier to use than SQL. > > > > > > > > > > > > > > > > > > > > > > > > > > Due to the above two reasons, some users have to > use > > > the > > > > > > > > DataStream > > > > > > > > > > API > > > > > > > > > > > > or > > > > > > > > > > > > > the DataSet API. But when they do that, they lose > the > > > > > > > unification > > > > > > > > > of > > > > > > > > > > > > batch > > > > > > > > > > > > > and streaming. They will also lose the > sophisticated > > > > > > > > optimizations > > > > > > > > > > such > > > > > > > > > > > > as > > > > > > > > > > > > > codegen, aggregate join transpose and multi-stage > agg > > > > from > > > > > > > Flink > > > > > > > > > SQL. > > > > > > > > > > > > > > > > > > > > > > > > > > We believe that enhancing the functionality and > > > > > productivity > > > > > > is > > > > > > > > > vital > > > > > > > > > > > for > > > > > > > > > > > > > the successful adoption of Table API. To this end, > > > Table > > > > > API > > > > > > > > still > > > > > > > > > > > > > requires more efforts from every contributor in the > > > > > > community. > > > > > > > We > > > > > > > > > see > > > > > > > > > > > > great > > > > > > > > > > > > > opportunity in improving our user’s experience from > > > this > > > > > > work. > > > > > > > > Any > > > > > > > > > > > > feedback > > > > > > > > > > > > > is welcome. > > > > > > > > > > > > > > > > > > > > > > > > > > Regards, > > > > > > > > > > > > > > > > > > > > > > > > > > Jincheng > > > > > > > > > > > > > > > > > > > > > > > > > > jincheng sun <[hidden email]> > > 于2018年11月1日周四 > > > > > > > 下午5:07写道: > > > > > > > > > > > > > > > > > > > > > > > > > >> Hi all, > > > > > > > > > > > > >> > > > > > > > > > > > > >> With the continuous efforts from the community, > the > > > > Flink > > > > > > > system > > > > > > > > > has > > > > > > > > > > > > been > > > > > > > > > > > > >> continuously improved, which has attracted more > and > > > more > > > > > > > users. > > > > > > > > > > Flink > > > > > > > > > > > > SQL > > > > > > > > > > > > >> is a canonical, widely used relational query > > language. > > > > > > > However, > > > > > > > > > > there > > > > > > > > > > > > are > > > > > > > > > > > > >> still some scenarios where Flink SQL failed to > meet > > > user > > > > > > needs > > > > > > > > in > > > > > > > > > > > terms > > > > > > > > > > > > of > > > > > > > > > > > > >> functionality and ease of use, such as: > > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > > >> - > > > > > > > > > > > > >> > > > > > > > > > > > > >> In terms of functionality > > > > > > > > > > > > >> > > > > > > > > > > > > >> Iteration, user-defined window, user-defined join, > > > > > > > user-defined > > > > > > > > > > > > >> GroupReduce, etc. Users cannot express them with > > SQL; > > > > > > > > > > > > >> > > > > > > > > > > > > >> - > > > > > > > > > > > > >> > > > > > > > > > > > > >> In terms of ease of use > > > > > > > > > > > > >> - > > > > > > > > > > > > >> > > > > > > > > > > > > >> Map - e.g. “dataStream.map(mapFun)”. Although > > > > > > > > > > > “table.select(udf1(), > > > > > > > > > > > > >> udf2(), udf3()....)” can be used to > accomplish > > > the > > > > > same > > > > > > > > > > > function., > > > > > > > > > > > > with a > > > > > > > > > > > > >> map() function returning 100 columns, one has > > to > > > > > define > > > > > > > or > > > > > > > > > call > > > > > > > > > > > > 100 UDFs > > > > > > > > > > > > >> when using SQL, which is quite involved. > > > > > > > > > > > > >> - > > > > > > > > > > > > >> > > > > > > > > > > > > >> FlatMap - e.g. > > “dataStrem.flatmap(flatMapFun)”. > > > > > > > Similarly, > > > > > > > > > it > > > > > > > > > > > can > > > > > > > > > > > > >> be implemented with > > “table.join(udtf).select()”. > > > > > > However, > > > > > > > > it > > > > > > > > > is > > > > > > > > > > > > obvious > > > > > > > > > > > > >> that datastream is easier to use than SQL. > > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > > >> Due to the above two reasons, some users have to > use > > > the > > > > > > > > > DataStream > > > > > > > > > > > API > > > > > > > > > > > > or > > > > > > > > > > > > >> the DataSet API. But when they do that, they lose > > the > > > > > > > > unification > > > > > > > > > of > > > > > > > > > > > > batch > > > > > > > > > > > > >> and streaming. They will also lose the > sophisticated > > > > > > > > optimizations > > > > > > > > > > > such > > > > > > > > > > > > as > > > > > > > > > > > > >> codegen, aggregate join transpose and multi-stage > > agg > > > > > from > > > > > > > > Flink > > > > > > > > > > SQL. > > > > > > > > > > > > >> > > > > > > > > > > > > >> We believe that enhancing the functionality and > > > > > productivity > > > > > > > is > > > > > > > > > > vital > > > > > > > > > > > > for > > > > > > > > > > > > >> the successful adoption of Table API. To this end, > > > > Table > > > > > > API > > > > > > > > > still > > > > > > > > > > > > >> requires more efforts from every contributor in > the > > > > > > community. > > > > > > > > We > > > > > > > > > > see > > > > > > > > > > > > great > > > > > > > > > > > > >> opportunity in improving our user’s experience > from > > > this > > > > > > work. > > > > > > > > Any > > > > > > > > > > > > feedback > > > > > > > > > > > > >> is welcome. > > > > > > > > > > > > >> > > > > > > > > > > > > >> Regards, > > > > > > > > > > > > >> > > > > > > > > > > > > >> Jincheng > > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > |
In reply to this post by Fabian Hueske-2
Hi Fabian,
Thank you for your deep thoughts in this regard, I think most of questions you had mentioned are very worthy of in-depth discussion! I want share thoughts about following questions: 1. Do we need move all DataSet API functionality into the Table API? I think most of dataset functionality should be add into the TableAPI, such as map, flatmap, groupReduce etc., Because these are very easy to use for the user. 2. Do we support explicit physical operations like partitioning, sorting or optimizer hints? I think we do not want add the physical operations, e.g.: sortPartition,partitionCustom etc. From the points of my view, those physical operations are used to optimization, which can be solved by hints(I think we should add hints feature to both tableAPI and SQL). 3. Do we want to support retractions in iteration? I think support iteration is a very complicated function。I am not sure, but i think the implementation of the iteration may be implemented according to the current batch mode, and the retraction is temporarily not supported, assuming that the trained data will not be updated in the current iteration. The updated data will be used in the next iteration. So I think we should in-depth discussion in a new threading. BTW, I find that you have had leave the very useful comments in the google doc: https://docs.google.com/document/d/1tnpxg31EQz2-MEzSotwFzqatsB4rNLz0I-l_vPa5H4Q/edit# Thanks again for both your mail feedback and doc comments ! Best, Jincheng Fabian Hueske <[hidden email]> 于2018年11月6日周二 下午6:21写道: > Thanks for the replies Xiaowei and others! > > You are right, I did not consider the batch optimization that would be > missing if the DataSet API would be ported to extend the DataStream API. > By extending the scope of the Table API, we can gain a holistic logical & > physical optimization which would be great! > Is your plan to move all DataSet API functionality into the Table API? > If so, do you envision any batch-related API in DataStream at all or should > this be done by converting a batch table to DataStream? I'm asking because > if there would be batch features in DataStream, we would need some > optimization there as well. > > I think the proposed separation of Table API (stateless APIs) and > DataStream (APIs that expose state handling) is a good idea. > On a side note, the DataSet API discouraged state handing in user function, > so porting this Table API would be quite "natural". > > As I said before, I like that we can incrementally extend the Table API. > Map and FlatMap functions do not seem too difficult. > Reduce, GroupReduce, Combine, GroupCombine, MapPartition might be more > tricky, esp. if we want to support retractions. > Iterations should be a challenge. I assume that Calcite does not support > iterations, so we probably need to split query / program and optimize parts > separately (IIRC, this is also how Flink's own optimizer handles this). > To what extend are you planning to support explicit physical operations > like partitioning, sorting or optimizer hints? > > I haven't had a look in the design document that you shared. Probably, I > find answers to some of my questions there ;-) > > Regarding the question of SQL or Table API, I agree that extending the > scope of the Table API does not limit the scope for SQL. > By adding more operations to the Table API we can expand it to use case > that are not well-served by SQL. > As others have said, we'll of course continue to extend and improve Flink's > SQL support (within the bounds of the standard). > > Best, Fabian > > Am Di., 6. Nov. 2018 um 10:09 Uhr schrieb jincheng sun < > [hidden email]>: > > > Hi Jark, > > Glad to see your feedback! > > That's Correct, The proposal is aiming to extend the functionality for > > Table API! I like add "drop" to fit the use case you mentioned. Not only > > that, if a 100-columns Table. and our UDF needs these 100 columns, we > don't > > want to define the eval as eval(column0...column99), we prefer to define > > eval as eval(Row)。Using it like this: table.select(udf (*)). All we also > > need to consider if we put the columns package as a row. In a scenario > like > > this, we have Classification it as cloumn operation, and list the > changes > > to the column operation after the map/flatMap/agg/flatAgg phase is > > completed. And Currently, Xiaowei has started a threading outlining > which > > talk about what we are proposing. Please see the detail in the mail > thread: > > Please see the detail in the mail thread: > > > > > https://mail.google.com/mail/u/0/#search/xiaowei/FMfcgxvzLWzfvCnmvMzzSfxHTSfdwLkB > > < > > > https://mail.google.com/mail/u/0/#search/xiaowei/FMfcgxvzLWzfvCnmvMzzSfxHTSfdwLkB > > > > > . > > > > At this stage the Table API Enhancement Outline as follows: > > > > > https://docs.google.com/document/d/1tnpxg31EQz2-MEzSotwFzqatsB4rNLz0I-l_vPa5H4Q/edit?usp=sharing > > > > Please let we know if you have further thoughts or feedback! > > > > Thanks, > > Jincheng > > > > > > Jark Wu <[hidden email]> 于2018年11月6日周二 下午3:35写道: > > > > > Hi jingcheng, > > > > > > Thanks for your proposal. I think it is a helpful enhancement for > > TableAPI > > > which is a solid step forward for TableAPI. > > > It doesn't weaken SQL or DataStream, because the conversion between > > > DataStream and Table still works. > > > People with advanced cases (e.g. complex and fine-grained state > control) > > > can go with DataStream, > > > but most general cases can stay in TableAPI. This works is aiming to > > extend > > > the functionality for TableAPI, > > > to extend the usage scenario, to help TableAPI becomes a more widely > used > > > API. > > > > > > For example, someone want to drop one column from a 100-columns Table. > > > Currently, we have to convert > > > Table to DataStream and use MapFunction to do that, or select the > > remaining > > > 99 columns using Table.select API. > > > But if we support Table.drop() method for TableAPI, it will be a very > > > convenient method and let users stay in Table. > > > > > > Looking forward to the more detailed design and further discussion. > > > > > > Regards, > > > Jark > > > > > > jincheng sun <[hidden email]> 于2018年11月6日周二 下午1:05写道: > > > > > > > Hi Rong Rong, > > > > > > > > Sorry for the late reply, And thanks for your feedback! We will > > continue > > > > to add more convenience features to the TableAPI, such as map, > flatmap, > > > > agg, flatagg, iteration etc. And I am very happy that you are > > interested > > > on > > > > this proposal. Due to this is a long-term continuous work, we will > push > > > it > > > > in stages. Currently Xiaowei has started a threading outlining > which > > > talk > > > > about what we are proposing. Please see the detail in the mail > thread: > > > > > > > > > > > > > > https://mail.google.com/mail/u/0/#search/xiaowei/FMfcgxvzLWzfvCnmvMzzSfxHTSfdwLkB > > > > < > > > > > > > > > > https://mail.google.com/mail/u/0/#search/xiaowei/FMfcgxvzLWzfvCnmvMzzSfxHTSfdwLkB > > > > > > > > > . > > > > > > > > The Table API Enhancement Outline as follows: > > > > > > > > > > > > > > https://docs.google.com/document/d/1tnpxg31EQz2-MEzSotwFzqatsB4rNLz0I-l_vPa5H4Q/edit?usp=sharing > > > > > > > > Please let we know if you have further thoughts or feedback! > > > > > > > > Thanks, > > > > Jincheng > > > > > > > > Fabian Hueske <[hidden email]> 于2018年11月5日周一 下午7:03写道: > > > > > > > > > Hi Jincheng, > > > > > > > > > > Thanks for this interesting proposal. > > > > > I like that we can push this effort forward in a very fine-grained > > > > manner, > > > > > i.e., incrementally adding more APIs to the Table API. > > > > > > > > > > However, I also have a few questions / concerns. > > > > > Today, the Table API is tightly integrated with the DataSet and > > > > DataStream > > > > > APIs. It is very easy to convert a Table into a DataSet or > DataStream > > > and > > > > > vice versa. This mean it is already easy to combine custom logic an > > > > > relational operations. What I like is that several aspects are > > clearly > > > > > separated like retraction and timestamp handling (see below) + all > > > > > libraries on DataStream/DataSet can be easily combined with > > relational > > > > > operations. > > > > > I can see that adding more functionality to the Table API would > > remove > > > > the > > > > > distinction between DataSet and DataStream. However, wouldn't we > get > > a > > > > > similar benefit by extending the DataStream API for proper support > > for > > > > > bounded streams (as is the long-term goal of Flink)? > > > > > I'm also a bit skeptical about the optimization opportunities we > > would > > > > > gain. Map/FlatMap UDFs are black boxes that cannot be easily > removed > > > > > without additional information (I did some research on this a few > > years > > > > ago > > > > > [1]). > > > > > > > > > > Moreover, I think there are a few tricky details that need to be > > > resolved > > > > > to enable a good integration. > > > > > > > > > > 1) How to deal with retraction messages? The DataStream API does > not > > > > have a > > > > > notion of retractions. How would a MapFunction or FlatMapFunction > > > handle > > > > > retraction? Do they need to be aware of the change flag? Custom > > > windowing > > > > > and aggregation logic would certainly need to have that > information. > > > > > 2) How to deal with timestamps? The DataStream API does not give > > access > > > > to > > > > > timestamps. In the Table API / SQL these are exposed as regular > > > > attributes. > > > > > How can we ensure that timestamp attributes remain valid (i.e. > > aligned > > > > with > > > > > watermarks) if the output is produced by arbitrary code? > > > > > There might be more issues of this kind. > > > > > > > > > > My main question would be how much would we gain with this proposal > > > over > > > > a > > > > > tight integration of Table API and DataStream API, assuming that > > batch > > > > > functionality is moved to DataStream? > > > > > > > > > > Best, Fabian > > > > > > > > > > [1] > http://stratosphere.eu/assets/papers/openingTheBlackBoxes_12.pdf > > > > > > > > > > > > > > > Am Mo., 5. Nov. 2018 um 02:49 Uhr schrieb Rong Rong < > > > [hidden email] > > > > >: > > > > > > > > > > > Hi Jincheng, > > > > > > > > > > > > Thank you for the proposal! I think being able to define a > process > > / > > > > > > co-process function in table API definitely opens up a whole new > > > level > > > > of > > > > > > applications using a unified API. > > > > > > > > > > > > In addition, as Tzu-Li and Hequn have mentioned, the benefit of > > > > > > optimization layer of Table API will already bring in additional > > > > benefit > > > > > > over directly programming on top of DataStream/DataSet API. I am > > very > > > > > > interested an looking forward to seeing the support for more > > complex > > > > use > > > > > > cases, especially iterations. It will enable table API to define > > much > > > > > > broader, event-driven use cases such as real-time ML > > > > prediction/training. > > > > > > > > > > > > As Timo mentioned, This will make Table API diverge from the SQL > > API. > > > > But > > > > > > as from my experience Table API was always giving me the > impression > > > to > > > > > be a > > > > > > more sophisticated, syntactic-aware way to express relational > > > > operations. > > > > > > Looking forward to further discussion and collaborations on the > > FLIP > > > > doc. > > > > > > > > > > > > -- > > > > > > Rong > > > > > > > > > > > > On Sun, Nov 4, 2018 at 5:22 PM jincheng sun < > > > [hidden email]> > > > > > > wrote: > > > > > > > > > > > > > Hi tison, > > > > > > > > > > > > > > Thanks a lot for your feedback! > > > > > > > I am very happy to see that community contributors agree to > > > enhanced > > > > > the > > > > > > > TableAPI. This work is a long-term continuous work, we will > push > > it > > > > in > > > > > > > stages, we will soon complete the enhanced list of the first > > > phase, > > > > we > > > > > > can > > > > > > > go deep discussion in google doc. thanks again for joining on > > the > > > > very > > > > > > > important discussion of the Flink Table API. > > > > > > > > > > > > > > Thanks, > > > > > > > Jincheng > > > > > > > > > > > > > > Tzu-Li Chen <[hidden email]> 于2018年11月2日周五 下午1:49写道: > > > > > > > > > > > > > > > Hi jingchengm > > > > > > > > > > > > > > > > Thanks a lot for your proposal! I find it is a good start > point > > > for > > > > > > > > internal optimization works and help Flink to be more > > > > > > > > user-friendly. > > > > > > > > > > > > > > > > AFAIK, DataStream is the most popular API currently that > Flink > > > > > > > > users should describe their logic with detailed logic. > > > > > > > > From a more internal view the conversion from DataStream to > > > > > > > > JobGraph is quite mechanically and hard to be optimized. So > > when > > > > > > > > users program with DataStream, they have to learn more > > internals > > > > > > > > and spend a lot of time to tune for performance. > > > > > > > > With your proposal, we provide enhanced functionality of > Table > > > API, > > > > > > > > so that users can describe their job easily on Table aspect. > > This > > > > > gives > > > > > > > > an opportunity to Flink developers to introduce an optimize > > phase > > > > > > > > while transforming user program(described by Table API) to > > > internal > > > > > > > > representation. > > > > > > > > > > > > > > > > Given a user who want to start using Flink with simple ETL, > > > > > pipelining > > > > > > > > or analytics, he would find it is most naturally described by > > > > > SQL/Table > > > > > > > > API. Further, as mentioned by @hequn, > > > > > > > > > > > > > > > > SQL is a widely used language. It follows standards, is a > > > > > > > > > descriptive language, and is easy to use > > > > > > > > > > > > > > > > > > > > > > > > thus we could expect with the enhancement of SQL/Table API, > > Flink > > > > > > > > becomes more friendly to users. > > > > > > > > > > > > > > > > Looking forward to the design doc/FLIP! > > > > > > > > > > > > > > > > Best, > > > > > > > > tison. > > > > > > > > > > > > > > > > > > > > > > > > jincheng sun <[hidden email]> 于2018年11月2日周五 > > 上午11:46写道: > > > > > > > > > > > > > > > > > Hi Hequn, > > > > > > > > > Thanks for your feedback! And also thanks for our offline > > > > > discussion! > > > > > > > > > You are right, unification of batch and streaming is very > > > > important > > > > > > for > > > > > > > > > flink API. > > > > > > > > > We will provide more detailed design later, Please let me > > know > > > if > > > > > you > > > > > > > > have > > > > > > > > > further thoughts or feedback. > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > Jincheng > > > > > > > > > > > > > > > > > > Hequn Cheng <[hidden email]> 于2018年11月2日周五 > 上午10:02写道: > > > > > > > > > > > > > > > > > > > Hi Jincheng, > > > > > > > > > > > > > > > > > > > > Thanks a lot for your proposal. It is very encouraging! > > > > > > > > > > > > > > > > > > > > As we all know, SQL is a widely used language. It follows > > > > > > standards, > > > > > > > > is a > > > > > > > > > > descriptive language, and is easy to use. A powerful > > feature > > > of > > > > > SQL > > > > > > > is > > > > > > > > > that > > > > > > > > > > it supports optimization. Users only need to care about > the > > > > logic > > > > > > of > > > > > > > > the > > > > > > > > > > program. The underlying optimizer will help users > optimize > > > the > > > > > > > > > performance > > > > > > > > > > of the program. However, in terms of functionality and > ease > > > of > > > > > use, > > > > > > > in > > > > > > > > > some > > > > > > > > > > scenarios sql will be limited, as described in Jincheng's > > > > > proposal. > > > > > > > > > > > > > > > > > > > > Correspondingly, the DataStream/DataSet api can provide > > > > powerful > > > > > > > > > > functionalities. Users can write > > > > > ProcessFunction/CoProcessFunction > > > > > > > and > > > > > > > > > get > > > > > > > > > > the timer. Compared with SQL, it provides more > > > functionalities > > > > > and > > > > > > > > > > flexibilities. However, it does not support optimization > > like > > > > > SQL. > > > > > > > > > > Meanwhile, DataStream/DataSet api has not been unified > > which > > > > > means, > > > > > > > for > > > > > > > > > the > > > > > > > > > > same logic, users need to write a job for each stream and > > > > batch. > > > > > > > > > > > > > > > > > > > > With TableApi, I think we can combine the advantages of > > both. > > > > > Users > > > > > > > can > > > > > > > > > > easily write relational operations and enjoy > optimization. > > At > > > > the > > > > > > > same > > > > > > > > > > time, it supports more functionality and ease of use. > > Looking > > > > > > forward > > > > > > > > to > > > > > > > > > > the detailed design/FLIP. > > > > > > > > > > > > > > > > > > > > Best, > > > > > > > > > > Hequn > > > > > > > > > > > > > > > > > > > > On Fri, Nov 2, 2018 at 9:48 AM Shaoxuan Wang < > > > > > [hidden email]> > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > Hi Aljoscha, > > > > > > > > > > > Glad that you like the proposal. We have completed the > > > > > prototype > > > > > > of > > > > > > > > > most > > > > > > > > > > > new proposed functionalities. Once collect the feedback > > > from > > > > > > > > community, > > > > > > > > > > we > > > > > > > > > > > will come up with a concrete FLIP/design doc. > > > > > > > > > > > > > > > > > > > > > > Regards, > > > > > > > > > > > Shaoxuan > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Nov 1, 2018 at 8:12 PM Aljoscha Krettek < > > > > > > > [hidden email] > > > > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > Hi Jincheng, > > > > > > > > > > > > > > > > > > > > > > > > these points sound very good! Are there any concrete > > > > > proposals > > > > > > > for > > > > > > > > > > > > changes? For example a FLIP/design document? > > > > > > > > > > > > > > > > > > > > > > > > See here for FLIPs: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals > > > > > > > > > > > > > > > > > > > > > > > > Best, > > > > > > > > > > > > Aljoscha > > > > > > > > > > > > > > > > > > > > > > > > > On 1. Nov 2018, at 12:51, jincheng sun < > > > > > > > [hidden email] > > > > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > *--------I am sorry for the formatting of the email > > > > > content. > > > > > > I > > > > > > > > > > reformat > > > > > > > > > > > > > the **content** as follows-----------* > > > > > > > > > > > > > > > > > > > > > > > > > > *Hi ALL,* > > > > > > > > > > > > > > > > > > > > > > > > > > With the continuous efforts from the community, the > > > Flink > > > > > > > system > > > > > > > > > has > > > > > > > > > > > been > > > > > > > > > > > > > continuously improved, which has attracted more and > > > more > > > > > > users. > > > > > > > > > Flink > > > > > > > > > > > SQL > > > > > > > > > > > > > is a canonical, widely used relational query > > language. > > > > > > However, > > > > > > > > > there > > > > > > > > > > > are > > > > > > > > > > > > > still some scenarios where Flink SQL failed to meet > > > user > > > > > > needs > > > > > > > in > > > > > > > > > > terms > > > > > > > > > > > > of > > > > > > > > > > > > > functionality and ease of use, such as: > > > > > > > > > > > > > > > > > > > > > > > > > > *1. In terms of functionality* > > > > > > > > > > > > > Iteration, user-defined window, user-defined > join, > > > > > > > > user-defined > > > > > > > > > > > > > GroupReduce, etc. Users cannot express them with > SQL; > > > > > > > > > > > > > > > > > > > > > > > > > > *2. In terms of ease of use* > > > > > > > > > > > > > > > > > > > > > > > > > > - Map - e.g. “dataStream.map(mapFun)”. Although > > > > > > > > > > “table.select(udf1(), > > > > > > > > > > > > > udf2(), udf3()....)” can be used to accomplish > the > > > same > > > > > > > > > function., > > > > > > > > > > > > with a > > > > > > > > > > > > > map() function returning 100 columns, one has to > > > define > > > > > or > > > > > > > call > > > > > > > > > 100 > > > > > > > > > > > > UDFs > > > > > > > > > > > > > when using SQL, which is quite involved. > > > > > > > > > > > > > - FlatMap - e.g. > “dataStrem.flatmap(flatMapFun)”. > > > > > > Similarly, > > > > > > > > it > > > > > > > > > > can > > > > > > > > > > > be > > > > > > > > > > > > > implemented with “table.join(udtf).select()”. > > > However, > > > > it > > > > > > is > > > > > > > > > > obvious > > > > > > > > > > > > that > > > > > > > > > > > > > dataStream is easier to use than SQL. > > > > > > > > > > > > > > > > > > > > > > > > > > Due to the above two reasons, some users have to > use > > > the > > > > > > > > DataStream > > > > > > > > > > API > > > > > > > > > > > > or > > > > > > > > > > > > > the DataSet API. But when they do that, they lose > the > > > > > > > unification > > > > > > > > > of > > > > > > > > > > > > batch > > > > > > > > > > > > > and streaming. They will also lose the > sophisticated > > > > > > > > optimizations > > > > > > > > > > such > > > > > > > > > > > > as > > > > > > > > > > > > > codegen, aggregate join transpose and multi-stage > agg > > > > from > > > > > > > Flink > > > > > > > > > SQL. > > > > > > > > > > > > > > > > > > > > > > > > > > We believe that enhancing the functionality and > > > > > productivity > > > > > > is > > > > > > > > > vital > > > > > > > > > > > for > > > > > > > > > > > > > the successful adoption of Table API. To this end, > > > Table > > > > > API > > > > > > > > still > > > > > > > > > > > > > requires more efforts from every contributor in the > > > > > > community. > > > > > > > We > > > > > > > > > see > > > > > > > > > > > > great > > > > > > > > > > > > > opportunity in improving our user’s experience from > > > this > > > > > > work. > > > > > > > > Any > > > > > > > > > > > > feedback > > > > > > > > > > > > > is welcome. > > > > > > > > > > > > > > > > > > > > > > > > > > Regards, > > > > > > > > > > > > > > > > > > > > > > > > > > Jincheng > > > > > > > > > > > > > > > > > > > > > > > > > > jincheng sun <[hidden email]> > > 于2018年11月1日周四 > > > > > > > 下午5:07写道: > > > > > > > > > > > > > > > > > > > > > > > > > >> Hi all, > > > > > > > > > > > > >> > > > > > > > > > > > > >> With the continuous efforts from the community, > the > > > > Flink > > > > > > > system > > > > > > > > > has > > > > > > > > > > > > been > > > > > > > > > > > > >> continuously improved, which has attracted more > and > > > more > > > > > > > users. > > > > > > > > > > Flink > > > > > > > > > > > > SQL > > > > > > > > > > > > >> is a canonical, widely used relational query > > language. > > > > > > > However, > > > > > > > > > > there > > > > > > > > > > > > are > > > > > > > > > > > > >> still some scenarios where Flink SQL failed to > meet > > > user > > > > > > needs > > > > > > > > in > > > > > > > > > > > terms > > > > > > > > > > > > of > > > > > > > > > > > > >> functionality and ease of use, such as: > > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > > >> - > > > > > > > > > > > > >> > > > > > > > > > > > > >> In terms of functionality > > > > > > > > > > > > >> > > > > > > > > > > > > >> Iteration, user-defined window, user-defined join, > > > > > > > user-defined > > > > > > > > > > > > >> GroupReduce, etc. Users cannot express them with > > SQL; > > > > > > > > > > > > >> > > > > > > > > > > > > >> - > > > > > > > > > > > > >> > > > > > > > > > > > > >> In terms of ease of use > > > > > > > > > > > > >> - > > > > > > > > > > > > >> > > > > > > > > > > > > >> Map - e.g. “dataStream.map(mapFun)”. Although > > > > > > > > > > > “table.select(udf1(), > > > > > > > > > > > > >> udf2(), udf3()....)” can be used to > accomplish > > > the > > > > > same > > > > > > > > > > > function., > > > > > > > > > > > > with a > > > > > > > > > > > > >> map() function returning 100 columns, one has > > to > > > > > define > > > > > > > or > > > > > > > > > call > > > > > > > > > > > > 100 UDFs > > > > > > > > > > > > >> when using SQL, which is quite involved. > > > > > > > > > > > > >> - > > > > > > > > > > > > >> > > > > > > > > > > > > >> FlatMap - e.g. > > “dataStrem.flatmap(flatMapFun)”. > > > > > > > Similarly, > > > > > > > > > it > > > > > > > > > > > can > > > > > > > > > > > > >> be implemented with > > “table.join(udtf).select()”. > > > > > > However, > > > > > > > > it > > > > > > > > > is > > > > > > > > > > > > obvious > > > > > > > > > > > > >> that datastream is easier to use than SQL. > > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > > >> Due to the above two reasons, some users have to > use > > > the > > > > > > > > > DataStream > > > > > > > > > > > API > > > > > > > > > > > > or > > > > > > > > > > > > >> the DataSet API. But when they do that, they lose > > the > > > > > > > > unification > > > > > > > > > of > > > > > > > > > > > > batch > > > > > > > > > > > > >> and streaming. They will also lose the > sophisticated > > > > > > > > optimizations > > > > > > > > > > > such > > > > > > > > > > > > as > > > > > > > > > > > > >> codegen, aggregate join transpose and multi-stage > > agg > > > > > from > > > > > > > > Flink > > > > > > > > > > SQL. > > > > > > > > > > > > >> > > > > > > > > > > > > >> We believe that enhancing the functionality and > > > > > productivity > > > > > > > is > > > > > > > > > > vital > > > > > > > > > > > > for > > > > > > > > > > > > >> the successful adoption of Table API. To this end, > > > > Table > > > > > > API > > > > > > > > > still > > > > > > > > > > > > >> requires more efforts from every contributor in > the > > > > > > community. > > > > > > > > We > > > > > > > > > > see > > > > > > > > > > > > great > > > > > > > > > > > > >> opportunity in improving our user’s experience > from > > > this > > > > > > work. > > > > > > > > Any > > > > > > > > > > > > feedback > > > > > > > > > > > > >> is welcome. > > > > > > > > > > > > >> > > > > > > > > > > > > >> Regards, > > > > > > > > > > > > >> > > > > > > > > > > > > >> Jincheng > > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > |
Hi,
An analysis of orthogonal functions would be great! There is certainly some overlap in the functions provided by the DataSet API. In the past, I found that having low-level functions helped a lot to efficiently implement complex logic. Without partitionByHash, sortPartition, sort, mapPartition, combine, etc it would not be possible to (efficiently) implement certain operators for the Table API, SQL or the Cascading-On-Flink port that I did a while back. I could imaging that these APIs would be useful to implement DSLs on top of the Table API, such as Gelly. Anyway, I completely agree that these physical operators should not be the first step. If we find that these methods are not needed, even better! Let's try to keep this thread focused on the general proposal of extending the scope of the Table API and keep the discussion of concrete proposal that Xiaowei shared in the other thread (and the design doc). That will help to keep all related comments in one place ;-) Best, Fabian Am Di., 6. Nov. 2018 um 13:01 Uhr schrieb jincheng sun < [hidden email]>: > Hi Fabian, > Thank you for your deep thoughts in this regard, I think most of questions > you had mentioned are very worthy of in-depth discussion! I want share > thoughts about following questions: > > 1. Do we need move all DataSet API functionality into the Table API? > I think most of dataset functionality should be add into the TableAPI, such > as map, flatmap, groupReduce etc., Because these are very easy to use for > the user. > > 2. Do we support explicit physical operations like partitioning, sorting or > optimizer hints? > I think we do not want add the physical operations, e.g.: > sortPartition,partitionCustom etc. From the points of my view, those > physical operations are used to optimization, which can be solved by > hints(I think we should add hints feature to both tableAPI and SQL). > > 3. Do we want to support retractions in iteration? > I think support iteration is a very complicated function。I am not sure, > but i think the implementation of the iteration may be implemented > according to the current batch mode, and the retraction is temporarily not > supported, assuming that the trained data will not be updated in the > current iteration. The updated data will be used in the next iteration. So > I think we should in-depth discussion in a new threading. > > BTW, I find that you have had leave the very useful comments in the google > doc: > > https://docs.google.com/document/d/1tnpxg31EQz2-MEzSotwFzqatsB4rNLz0I-l_vPa5H4Q/edit# > > Thanks again for both your mail feedback and doc comments ! > > Best, > Jincheng > > > > Fabian Hueske <[hidden email]> 于2018年11月6日周二 下午6:21写道: > > > Thanks for the replies Xiaowei and others! > > > > You are right, I did not consider the batch optimization that would be > > missing if the DataSet API would be ported to extend the DataStream API. > > By extending the scope of the Table API, we can gain a holistic logical & > > physical optimization which would be great! > > Is your plan to move all DataSet API functionality into the Table API? > > If so, do you envision any batch-related API in DataStream at all or > should > > this be done by converting a batch table to DataStream? I'm asking > because > > if there would be batch features in DataStream, we would need some > > optimization there as well. > > > > I think the proposed separation of Table API (stateless APIs) and > > DataStream (APIs that expose state handling) is a good idea. > > On a side note, the DataSet API discouraged state handing in user > function, > > so porting this Table API would be quite "natural". > > > > As I said before, I like that we can incrementally extend the Table API. > > Map and FlatMap functions do not seem too difficult. > > Reduce, GroupReduce, Combine, GroupCombine, MapPartition might be more > > tricky, esp. if we want to support retractions. > > Iterations should be a challenge. I assume that Calcite does not support > > iterations, so we probably need to split query / program and optimize > parts > > separately (IIRC, this is also how Flink's own optimizer handles this). > > To what extend are you planning to support explicit physical operations > > like partitioning, sorting or optimizer hints? > > > > I haven't had a look in the design document that you shared. Probably, I > > find answers to some of my questions there ;-) > > > > Regarding the question of SQL or Table API, I agree that extending the > > scope of the Table API does not limit the scope for SQL. > > By adding more operations to the Table API we can expand it to use case > > that are not well-served by SQL. > > As others have said, we'll of course continue to extend and improve > Flink's > > SQL support (within the bounds of the standard). > > > > Best, Fabian > > > > Am Di., 6. Nov. 2018 um 10:09 Uhr schrieb jincheng sun < > > [hidden email]>: > > > > > Hi Jark, > > > Glad to see your feedback! > > > That's Correct, The proposal is aiming to extend the functionality for > > > Table API! I like add "drop" to fit the use case you mentioned. Not > only > > > that, if a 100-columns Table. and our UDF needs these 100 columns, we > > don't > > > want to define the eval as eval(column0...column99), we prefer to > define > > > eval as eval(Row)。Using it like this: table.select(udf (*)). All we > also > > > need to consider if we put the columns package as a row. In a scenario > > like > > > this, we have Classification it as cloumn operation, and list the > > changes > > > to the column operation after the map/flatMap/agg/flatAgg phase is > > > completed. And Currently, Xiaowei has started a threading outlining > > which > > > talk about what we are proposing. Please see the detail in the mail > > thread: > > > Please see the detail in the mail thread: > > > > > > > > > https://mail.google.com/mail/u/0/#search/xiaowei/FMfcgxvzLWzfvCnmvMzzSfxHTSfdwLkB > > > < > > > > > > https://mail.google.com/mail/u/0/#search/xiaowei/FMfcgxvzLWzfvCnmvMzzSfxHTSfdwLkB > > > > > > > . > > > > > > At this stage the Table API Enhancement Outline as follows: > > > > > > > > > https://docs.google.com/document/d/1tnpxg31EQz2-MEzSotwFzqatsB4rNLz0I-l_vPa5H4Q/edit?usp=sharing > > > > > > Please let we know if you have further thoughts or feedback! > > > > > > Thanks, > > > Jincheng > > > > > > > > > Jark Wu <[hidden email]> 于2018年11月6日周二 下午3:35写道: > > > > > > > Hi jingcheng, > > > > > > > > Thanks for your proposal. I think it is a helpful enhancement for > > > TableAPI > > > > which is a solid step forward for TableAPI. > > > > It doesn't weaken SQL or DataStream, because the conversion between > > > > DataStream and Table still works. > > > > People with advanced cases (e.g. complex and fine-grained state > > control) > > > > can go with DataStream, > > > > but most general cases can stay in TableAPI. This works is aiming to > > > extend > > > > the functionality for TableAPI, > > > > to extend the usage scenario, to help TableAPI becomes a more widely > > used > > > > API. > > > > > > > > For example, someone want to drop one column from a 100-columns > Table. > > > > Currently, we have to convert > > > > Table to DataStream and use MapFunction to do that, or select the > > > remaining > > > > 99 columns using Table.select API. > > > > But if we support Table.drop() method for TableAPI, it will be a very > > > > convenient method and let users stay in Table. > > > > > > > > Looking forward to the more detailed design and further discussion. > > > > > > > > Regards, > > > > Jark > > > > > > > > jincheng sun <[hidden email]> 于2018年11月6日周二 下午1:05写道: > > > > > > > > > Hi Rong Rong, > > > > > > > > > > Sorry for the late reply, And thanks for your feedback! We will > > > continue > > > > > to add more convenience features to the TableAPI, such as map, > > flatmap, > > > > > agg, flatagg, iteration etc. And I am very happy that you are > > > interested > > > > on > > > > > this proposal. Due to this is a long-term continuous work, we will > > push > > > > it > > > > > in stages. Currently Xiaowei has started a threading outlining > > which > > > > talk > > > > > about what we are proposing. Please see the detail in the mail > > thread: > > > > > > > > > > > > > > > > > > > > https://mail.google.com/mail/u/0/#search/xiaowei/FMfcgxvzLWzfvCnmvMzzSfxHTSfdwLkB > > > > > < > > > > > > > > > > > > > > > https://mail.google.com/mail/u/0/#search/xiaowei/FMfcgxvzLWzfvCnmvMzzSfxHTSfdwLkB > > > > > > > > > > > . > > > > > > > > > > The Table API Enhancement Outline as follows: > > > > > > > > > > > > > > > > > > > > https://docs.google.com/document/d/1tnpxg31EQz2-MEzSotwFzqatsB4rNLz0I-l_vPa5H4Q/edit?usp=sharing > > > > > > > > > > Please let we know if you have further thoughts or feedback! > > > > > > > > > > Thanks, > > > > > Jincheng > > > > > > > > > > Fabian Hueske <[hidden email]> 于2018年11月5日周一 下午7:03写道: > > > > > > > > > > > Hi Jincheng, > > > > > > > > > > > > Thanks for this interesting proposal. > > > > > > I like that we can push this effort forward in a very > fine-grained > > > > > manner, > > > > > > i.e., incrementally adding more APIs to the Table API. > > > > > > > > > > > > However, I also have a few questions / concerns. > > > > > > Today, the Table API is tightly integrated with the DataSet and > > > > > DataStream > > > > > > APIs. It is very easy to convert a Table into a DataSet or > > DataStream > > > > and > > > > > > vice versa. This mean it is already easy to combine custom logic > an > > > > > > relational operations. What I like is that several aspects are > > > clearly > > > > > > separated like retraction and timestamp handling (see below) + > all > > > > > > libraries on DataStream/DataSet can be easily combined with > > > relational > > > > > > operations. > > > > > > I can see that adding more functionality to the Table API would > > > remove > > > > > the > > > > > > distinction between DataSet and DataStream. However, wouldn't we > > get > > > a > > > > > > similar benefit by extending the DataStream API for proper > support > > > for > > > > > > bounded streams (as is the long-term goal of Flink)? > > > > > > I'm also a bit skeptical about the optimization opportunities we > > > would > > > > > > gain. Map/FlatMap UDFs are black boxes that cannot be easily > > removed > > > > > > without additional information (I did some research on this a few > > > years > > > > > ago > > > > > > [1]). > > > > > > > > > > > > Moreover, I think there are a few tricky details that need to be > > > > resolved > > > > > > to enable a good integration. > > > > > > > > > > > > 1) How to deal with retraction messages? The DataStream API does > > not > > > > > have a > > > > > > notion of retractions. How would a MapFunction or FlatMapFunction > > > > handle > > > > > > retraction? Do they need to be aware of the change flag? Custom > > > > windowing > > > > > > and aggregation logic would certainly need to have that > > information. > > > > > > 2) How to deal with timestamps? The DataStream API does not give > > > access > > > > > to > > > > > > timestamps. In the Table API / SQL these are exposed as regular > > > > > attributes. > > > > > > How can we ensure that timestamp attributes remain valid (i.e. > > > aligned > > > > > with > > > > > > watermarks) if the output is produced by arbitrary code? > > > > > > There might be more issues of this kind. > > > > > > > > > > > > My main question would be how much would we gain with this > proposal > > > > over > > > > > a > > > > > > tight integration of Table API and DataStream API, assuming that > > > batch > > > > > > functionality is moved to DataStream? > > > > > > > > > > > > Best, Fabian > > > > > > > > > > > > [1] > > http://stratosphere.eu/assets/papers/openingTheBlackBoxes_12.pdf > > > > > > > > > > > > > > > > > > Am Mo., 5. Nov. 2018 um 02:49 Uhr schrieb Rong Rong < > > > > [hidden email] > > > > > >: > > > > > > > > > > > > > Hi Jincheng, > > > > > > > > > > > > > > Thank you for the proposal! I think being able to define a > > process > > > / > > > > > > > co-process function in table API definitely opens up a whole > new > > > > level > > > > > of > > > > > > > applications using a unified API. > > > > > > > > > > > > > > In addition, as Tzu-Li and Hequn have mentioned, the benefit of > > > > > > > optimization layer of Table API will already bring in > additional > > > > > benefit > > > > > > > over directly programming on top of DataStream/DataSet API. I > am > > > very > > > > > > > interested an looking forward to seeing the support for more > > > complex > > > > > use > > > > > > > cases, especially iterations. It will enable table API to > define > > > much > > > > > > > broader, event-driven use cases such as real-time ML > > > > > prediction/training. > > > > > > > > > > > > > > As Timo mentioned, This will make Table API diverge from the > SQL > > > API. > > > > > But > > > > > > > as from my experience Table API was always giving me the > > impression > > > > to > > > > > > be a > > > > > > > more sophisticated, syntactic-aware way to express relational > > > > > operations. > > > > > > > Looking forward to further discussion and collaborations on the > > > FLIP > > > > > doc. > > > > > > > > > > > > > > -- > > > > > > > Rong > > > > > > > > > > > > > > On Sun, Nov 4, 2018 at 5:22 PM jincheng sun < > > > > [hidden email]> > > > > > > > wrote: > > > > > > > > > > > > > > > Hi tison, > > > > > > > > > > > > > > > > Thanks a lot for your feedback! > > > > > > > > I am very happy to see that community contributors agree to > > > > enhanced > > > > > > the > > > > > > > > TableAPI. This work is a long-term continuous work, we will > > push > > > it > > > > > in > > > > > > > > stages, we will soon complete the enhanced list of the first > > > > phase, > > > > > we > > > > > > > can > > > > > > > > go deep discussion in google doc. thanks again for joining > on > > > the > > > > > very > > > > > > > > important discussion of the Flink Table API. > > > > > > > > > > > > > > > > Thanks, > > > > > > > > Jincheng > > > > > > > > > > > > > > > > Tzu-Li Chen <[hidden email]> 于2018年11月2日周五 下午1:49写道: > > > > > > > > > > > > > > > > > Hi jingchengm > > > > > > > > > > > > > > > > > > Thanks a lot for your proposal! I find it is a good start > > point > > > > for > > > > > > > > > internal optimization works and help Flink to be more > > > > > > > > > user-friendly. > > > > > > > > > > > > > > > > > > AFAIK, DataStream is the most popular API currently that > > Flink > > > > > > > > > users should describe their logic with detailed logic. > > > > > > > > > From a more internal view the conversion from DataStream to > > > > > > > > > JobGraph is quite mechanically and hard to be optimized. So > > > when > > > > > > > > > users program with DataStream, they have to learn more > > > internals > > > > > > > > > and spend a lot of time to tune for performance. > > > > > > > > > With your proposal, we provide enhanced functionality of > > Table > > > > API, > > > > > > > > > so that users can describe their job easily on Table > aspect. > > > This > > > > > > gives > > > > > > > > > an opportunity to Flink developers to introduce an optimize > > > phase > > > > > > > > > while transforming user program(described by Table API) to > > > > internal > > > > > > > > > representation. > > > > > > > > > > > > > > > > > > Given a user who want to start using Flink with simple ETL, > > > > > > pipelining > > > > > > > > > or analytics, he would find it is most naturally described > by > > > > > > SQL/Table > > > > > > > > > API. Further, as mentioned by @hequn, > > > > > > > > > > > > > > > > > > SQL is a widely used language. It follows standards, is a > > > > > > > > > > descriptive language, and is easy to use > > > > > > > > > > > > > > > > > > > > > > > > > > > thus we could expect with the enhancement of SQL/Table API, > > > Flink > > > > > > > > > becomes more friendly to users. > > > > > > > > > > > > > > > > > > Looking forward to the design doc/FLIP! > > > > > > > > > > > > > > > > > > Best, > > > > > > > > > tison. > > > > > > > > > > > > > > > > > > > > > > > > > > > jincheng sun <[hidden email]> 于2018年11月2日周五 > > > 上午11:46写道: > > > > > > > > > > > > > > > > > > > Hi Hequn, > > > > > > > > > > Thanks for your feedback! And also thanks for our offline > > > > > > discussion! > > > > > > > > > > You are right, unification of batch and streaming is very > > > > > important > > > > > > > for > > > > > > > > > > flink API. > > > > > > > > > > We will provide more detailed design later, Please let me > > > know > > > > if > > > > > > you > > > > > > > > > have > > > > > > > > > > further thoughts or feedback. > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > Jincheng > > > > > > > > > > > > > > > > > > > > Hequn Cheng <[hidden email]> 于2018年11月2日周五 > > 上午10:02写道: > > > > > > > > > > > > > > > > > > > > > Hi Jincheng, > > > > > > > > > > > > > > > > > > > > > > Thanks a lot for your proposal. It is very encouraging! > > > > > > > > > > > > > > > > > > > > > > As we all know, SQL is a widely used language. It > follows > > > > > > > standards, > > > > > > > > > is a > > > > > > > > > > > descriptive language, and is easy to use. A powerful > > > feature > > > > of > > > > > > SQL > > > > > > > > is > > > > > > > > > > that > > > > > > > > > > > it supports optimization. Users only need to care about > > the > > > > > logic > > > > > > > of > > > > > > > > > the > > > > > > > > > > > program. The underlying optimizer will help users > > optimize > > > > the > > > > > > > > > > performance > > > > > > > > > > > of the program. However, in terms of functionality and > > ease > > > > of > > > > > > use, > > > > > > > > in > > > > > > > > > > some > > > > > > > > > > > scenarios sql will be limited, as described in > Jincheng's > > > > > > proposal. > > > > > > > > > > > > > > > > > > > > > > Correspondingly, the DataStream/DataSet api can provide > > > > > powerful > > > > > > > > > > > functionalities. Users can write > > > > > > ProcessFunction/CoProcessFunction > > > > > > > > and > > > > > > > > > > get > > > > > > > > > > > the timer. Compared with SQL, it provides more > > > > functionalities > > > > > > and > > > > > > > > > > > flexibilities. However, it does not support > optimization > > > like > > > > > > SQL. > > > > > > > > > > > Meanwhile, DataStream/DataSet api has not been unified > > > which > > > > > > means, > > > > > > > > for > > > > > > > > > > the > > > > > > > > > > > same logic, users need to write a job for each stream > and > > > > > batch. > > > > > > > > > > > > > > > > > > > > > > With TableApi, I think we can combine the advantages of > > > both. > > > > > > Users > > > > > > > > can > > > > > > > > > > > easily write relational operations and enjoy > > optimization. > > > At > > > > > the > > > > > > > > same > > > > > > > > > > > time, it supports more functionality and ease of use. > > > Looking > > > > > > > forward > > > > > > > > > to > > > > > > > > > > > the detailed design/FLIP. > > > > > > > > > > > > > > > > > > > > > > Best, > > > > > > > > > > > Hequn > > > > > > > > > > > > > > > > > > > > > > On Fri, Nov 2, 2018 at 9:48 AM Shaoxuan Wang < > > > > > > [hidden email]> > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > Hi Aljoscha, > > > > > > > > > > > > Glad that you like the proposal. We have completed > the > > > > > > prototype > > > > > > > of > > > > > > > > > > most > > > > > > > > > > > > new proposed functionalities. Once collect the > feedback > > > > from > > > > > > > > > community, > > > > > > > > > > > we > > > > > > > > > > > > will come up with a concrete FLIP/design doc. > > > > > > > > > > > > > > > > > > > > > > > > Regards, > > > > > > > > > > > > Shaoxuan > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Nov 1, 2018 at 8:12 PM Aljoscha Krettek < > > > > > > > > [hidden email] > > > > > > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > Hi Jincheng, > > > > > > > > > > > > > > > > > > > > > > > > > > these points sound very good! Are there any > concrete > > > > > > proposals > > > > > > > > for > > > > > > > > > > > > > changes? For example a FLIP/design document? > > > > > > > > > > > > > > > > > > > > > > > > > > See here for FLIPs: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals > > > > > > > > > > > > > > > > > > > > > > > > > > Best, > > > > > > > > > > > > > Aljoscha > > > > > > > > > > > > > > > > > > > > > > > > > > > On 1. Nov 2018, at 12:51, jincheng sun < > > > > > > > > [hidden email] > > > > > > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > *--------I am sorry for the formatting of the > > > > > > content. > > > > > > > I > > > > > > > > > > > reformat > > > > > > > > > > > > > > the **content** as follows-----------* > > > > > > > > > > > > > > > > > > > > > > > > > > > > *Hi ALL,* > > > > > > > > > > > > > > > > > > > > > > > > > > > > With the continuous efforts from the community, > the > > > > Flink > > > > > > > > system > > > > > > > > > > has > > > > > > > > > > > > been > > > > > > > > > > > > > > continuously improved, which has attracted more > and > > > > more > > > > > > > users. > > > > > > > > > > Flink > > > > > > > > > > > > SQL > > > > > > > > > > > > > > is a canonical, widely used relational query > > > language. > > > > > > > However, > > > > > > > > > > there > > > > > > > > > > > > are > > > > > > > > > > > > > > still some scenarios where Flink SQL failed to > meet > > > > user > > > > > > > needs > > > > > > > > in > > > > > > > > > > > terms > > > > > > > > > > > > > of > > > > > > > > > > > > > > functionality and ease of use, such as: > > > > > > > > > > > > > > > > > > > > > > > > > > > > *1. In terms of functionality* > > > > > > > > > > > > > > Iteration, user-defined window, user-defined > > join, > > > > > > > > > user-defined > > > > > > > > > > > > > > GroupReduce, etc. Users cannot express them with > > SQL; > > > > > > > > > > > > > > > > > > > > > > > > > > > > *2. In terms of ease of use* > > > > > > > > > > > > > > > > > > > > > > > > > > > > - Map - e.g. “dataStream.map(mapFun)”. Although > > > > > > > > > > > “table.select(udf1(), > > > > > > > > > > > > > > udf2(), udf3()....)” can be used to accomplish > > the > > > > same > > > > > > > > > > function., > > > > > > > > > > > > > with a > > > > > > > > > > > > > > map() function returning 100 columns, one has > to > > > > define > > > > > > or > > > > > > > > call > > > > > > > > > > 100 > > > > > > > > > > > > > UDFs > > > > > > > > > > > > > > when using SQL, which is quite involved. > > > > > > > > > > > > > > - FlatMap - e.g. > > “dataStrem.flatmap(flatMapFun)”. > > > > > > > Similarly, > > > > > > > > > it > > > > > > > > > > > can > > > > > > > > > > > > be > > > > > > > > > > > > > > implemented with “table.join(udtf).select()”. > > > > However, > > > > > it > > > > > > > is > > > > > > > > > > > obvious > > > > > > > > > > > > > that > > > > > > > > > > > > > > dataStream is easier to use than SQL. > > > > > > > > > > > > > > > > > > > > > > > > > > > > Due to the above two reasons, some users have to > > use > > > > the > > > > > > > > > DataStream > > > > > > > > > > > API > > > > > > > > > > > > > or > > > > > > > > > > > > > > the DataSet API. But when they do that, they lose > > the > > > > > > > > unification > > > > > > > > > > of > > > > > > > > > > > > > batch > > > > > > > > > > > > > > and streaming. They will also lose the > > sophisticated > > > > > > > > > optimizations > > > > > > > > > > > such > > > > > > > > > > > > > as > > > > > > > > > > > > > > codegen, aggregate join transpose and multi-stage > > agg > > > > > from > > > > > > > > Flink > > > > > > > > > > SQL. > > > > > > > > > > > > > > > > > > > > > > > > > > > > We believe that enhancing the functionality and > > > > > > productivity > > > > > > > is > > > > > > > > > > vital > > > > > > > > > > > > for > > > > > > > > > > > > > > the successful adoption of Table API. To this > end, > > > > Table > > > > > > API > > > > > > > > > still > > > > > > > > > > > > > > requires more efforts from every contributor in > the > > > > > > > community. > > > > > > > > We > > > > > > > > > > see > > > > > > > > > > > > > great > > > > > > > > > > > > > > opportunity in improving our user’s experience > from > > > > this > > > > > > > work. > > > > > > > > > Any > > > > > > > > > > > > > feedback > > > > > > > > > > > > > > is welcome. > > > > > > > > > > > > > > > > > > > > > > > > > > > > Regards, > > > > > > > > > > > > > > > > > > > > > > > > > > > > Jincheng > > > > > > > > > > > > > > > > > > > > > > > > > > > > jincheng sun <[hidden email]> > > > 于2018年11月1日周四 > > > > > > > > 下午5:07写道: > > > > > > > > > > > > > > > > > > > > > > > > > > > >> Hi all, > > > > > > > > > > > > > >> > > > > > > > > > > > > > >> With the continuous efforts from the community, > > the > > > > > Flink > > > > > > > > system > > > > > > > > > > has > > > > > > > > > > > > > been > > > > > > > > > > > > > >> continuously improved, which has attracted more > > and > > > > more > > > > > > > > users. > > > > > > > > > > > Flink > > > > > > > > > > > > > SQL > > > > > > > > > > > > > >> is a canonical, widely used relational query > > > language. > > > > > > > > However, > > > > > > > > > > > there > > > > > > > > > > > > > are > > > > > > > > > > > > > >> still some scenarios where Flink SQL failed to > > meet > > > > user > > > > > > > needs > > > > > > > > > in > > > > > > > > > > > > terms > > > > > > > > > > > > > of > > > > > > > > > > > > > >> functionality and ease of use, such as: > > > > > > > > > > > > > >> > > > > > > > > > > > > > >> > > > > > > > > > > > > > >> - > > > > > > > > > > > > > >> > > > > > > > > > > > > > >> In terms of functionality > > > > > > > > > > > > > >> > > > > > > > > > > > > > >> Iteration, user-defined window, user-defined > join, > > > > > > > > user-defined > > > > > > > > > > > > > >> GroupReduce, etc. Users cannot express them with > > > SQL; > > > > > > > > > > > > > >> > > > > > > > > > > > > > >> - > > > > > > > > > > > > > >> > > > > > > > > > > > > > >> In terms of ease of use > > > > > > > > > > > > > >> - > > > > > > > > > > > > > >> > > > > > > > > > > > > > >> Map - e.g. “dataStream.map(mapFun)”. > Although > > > > > > > > > > > > “table.select(udf1(), > > > > > > > > > > > > > >> udf2(), udf3()....)” can be used to > > accomplish > > > > the > > > > > > same > > > > > > > > > > > > function., > > > > > > > > > > > > > with a > > > > > > > > > > > > > >> map() function returning 100 columns, one > has > > > to > > > > > > define > > > > > > > > or > > > > > > > > > > call > > > > > > > > > > > > > 100 UDFs > > > > > > > > > > > > > >> when using SQL, which is quite involved. > > > > > > > > > > > > > >> - > > > > > > > > > > > > > >> > > > > > > > > > > > > > >> FlatMap - e.g. > > > “dataStrem.flatmap(flatMapFun)”. > > > > > > > > Similarly, > > > > > > > > > > it > > > > > > > > > > > > can > > > > > > > > > > > > > >> be implemented with > > > “table.join(udtf).select()”. > > > > > > > However, > > > > > > > > > it > > > > > > > > > > is > > > > > > > > > > > > > obvious > > > > > > > > > > > > > >> that datastream is easier to use than SQL. > > > > > > > > > > > > > >> > > > > > > > > > > > > > >> > > > > > > > > > > > > > >> Due to the above two reasons, some users have to > > use > > > > the > > > > > > > > > > DataStream > > > > > > > > > > > > API > > > > > > > > > > > > > or > > > > > > > > > > > > > >> the DataSet API. But when they do that, they > lose > > > the > > > > > > > > > unification > > > > > > > > > > of > > > > > > > > > > > > > batch > > > > > > > > > > > > > >> and streaming. They will also lose the > > sophisticated > > > > > > > > > optimizations > > > > > > > > > > > > such > > > > > > > > > > > > > as > > > > > > > > > > > > > >> codegen, aggregate join transpose and > multi-stage > > > agg > > > > > > from > > > > > > > > > Flink > > > > > > > > > > > SQL. > > > > > > > > > > > > > >> > > > > > > > > > > > > > >> We believe that enhancing the functionality and > > > > > > productivity > > > > > > > > is > > > > > > > > > > > vital > > > > > > > > > > > > > for > > > > > > > > > > > > > >> the successful adoption of Table API. To this > end, > > > > > Table > > > > > > > API > > > > > > > > > > still > > > > > > > > > > > > > >> requires more efforts from every contributor in > > the > > > > > > > community. > > > > > > > > > We > > > > > > > > > > > see > > > > > > > > > > > > > great > > > > > > > > > > > > > >> opportunity in improving our user’s experience > > from > > > > this > > > > > > > work. > > > > > > > > > Any > > > > > > > > > > > > > feedback > > > > > > > > > > > > > >> is welcome. > > > > > > > > > > > > > >> > > > > > > > > > > > > > >> Regards, > > > > > > > > > > > > > >> > > > > > > > > > > > > > >> Jincheng > > > > > > > > > > > > > >> > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > |
Hi,
What is our intended division/border between Table API and DataSet or DataStream? If we want Table API to drift away from SQL that would be a valid question. > Another distinguishing feature of DataStream API is that users get direct > access to state/statebackend which we intensionally avoided in Table API Do we really want to make Table API an equivalent of DataSet/DataStream API but without a state? Drawing boundary in such way would make it more difficult for users to pick the right tool if for many use cases they could use both. What if then at some point of time they came to conclusion that they need a small state access for something? If that’s our intended end goal separation between Table API and DataStream API, it would be very very weird having two very similar APIs, that have tons of small differences, but are basically equivalent modulo state accesses. Maybe instead of duplicating efforts and work between our different APIs we should more focus on either interoperability or unifying them? For example if we would like to differentiate those APIs because of the presence/lack of optimiser, maybe the APIs should be the same, but there should be a way tell whether the UDF/operator is deterministic, has side effects, etc. And if such operator is found in the plan, the nodes below and above could still be subject to regular optimisation rules. Piotrek > On 6 Nov 2018, at 14:04, Fabian Hueske <[hidden email]> wrote: > > Hi, > > An analysis of orthogonal functions would be great! > There is certainly some overlap in the functions provided by the DataSet > API. > > In the past, I found that having low-level functions helped a lot to > efficiently implement complex logic. > Without partitionByHash, sortPartition, sort, mapPartition, combine, etc it > would not be possible to (efficiently) implement certain operators for the > Table API, SQL or the Cascading-On-Flink port that I did a while back. > I could imaging that these APIs would be useful to implement DSLs on top of > the Table API, such as Gelly. > > Anyway, I completely agree that these physical operators should not be the > first step. > If we find that these methods are not needed, even better! > > Let's try to keep this thread focused on the general proposal of extending > the scope of the Table API and keep the discussion of concrete proposal > that Xiaowei shared in the other thread (and the design doc). > That will help to keep all related comments in one place ;-) > > Best, Fabian > > > Am Di., 6. Nov. 2018 um 13:01 Uhr schrieb jincheng sun < > [hidden email]>: > >> Hi Fabian, >> Thank you for your deep thoughts in this regard, I think most of questions >> you had mentioned are very worthy of in-depth discussion! I want share >> thoughts about following questions: >> >> 1. Do we need move all DataSet API functionality into the Table API? >> I think most of dataset functionality should be add into the TableAPI, such >> as map, flatmap, groupReduce etc., Because these are very easy to use for >> the user. >> >> 2. Do we support explicit physical operations like partitioning, sorting or >> optimizer hints? >> I think we do not want add the physical operations, e.g.: >> sortPartition,partitionCustom etc. From the points of my view, those >> physical operations are used to optimization, which can be solved by >> hints(I think we should add hints feature to both tableAPI and SQL). >> >> 3. Do we want to support retractions in iteration? >> I think support iteration is a very complicated function。I am not sure, >> but i think the implementation of the iteration may be implemented >> according to the current batch mode, and the retraction is temporarily not >> supported, assuming that the trained data will not be updated in the >> current iteration. The updated data will be used in the next iteration. So >> I think we should in-depth discussion in a new threading. >> >> BTW, I find that you have had leave the very useful comments in the google >> doc: >> >> https://docs.google.com/document/d/1tnpxg31EQz2-MEzSotwFzqatsB4rNLz0I-l_vPa5H4Q/edit# >> >> Thanks again for both your mail feedback and doc comments ! >> >> Best, >> Jincheng >> >> >> >> Fabian Hueske <[hidden email]> 于2018年11月6日周二 下午6:21写道: >> >>> Thanks for the replies Xiaowei and others! >>> >>> You are right, I did not consider the batch optimization that would be >>> missing if the DataSet API would be ported to extend the DataStream API. >>> By extending the scope of the Table API, we can gain a holistic logical & >>> physical optimization which would be great! >>> Is your plan to move all DataSet API functionality into the Table API? >>> If so, do you envision any batch-related API in DataStream at all or >> should >>> this be done by converting a batch table to DataStream? I'm asking >> because >>> if there would be batch features in DataStream, we would need some >>> optimization there as well. >>> >>> I think the proposed separation of Table API (stateless APIs) and >>> DataStream (APIs that expose state handling) is a good idea. >>> On a side note, the DataSet API discouraged state handing in user >> function, >>> so porting this Table API would be quite "natural". >>> >>> As I said before, I like that we can incrementally extend the Table API. >>> Map and FlatMap functions do not seem too difficult. >>> Reduce, GroupReduce, Combine, GroupCombine, MapPartition might be more >>> tricky, esp. if we want to support retractions. >>> Iterations should be a challenge. I assume that Calcite does not support >>> iterations, so we probably need to split query / program and optimize >> parts >>> separately (IIRC, this is also how Flink's own optimizer handles this). >>> To what extend are you planning to support explicit physical operations >>> like partitioning, sorting or optimizer hints? >>> >>> I haven't had a look in the design document that you shared. Probably, I >>> find answers to some of my questions there ;-) >>> >>> Regarding the question of SQL or Table API, I agree that extending the >>> scope of the Table API does not limit the scope for SQL. >>> By adding more operations to the Table API we can expand it to use case >>> that are not well-served by SQL. >>> As others have said, we'll of course continue to extend and improve >> Flink's >>> SQL support (within the bounds of the standard). >>> >>> Best, Fabian >>> >>> Am Di., 6. Nov. 2018 um 10:09 Uhr schrieb jincheng sun < >>> [hidden email]>: >>> >>>> Hi Jark, >>>> Glad to see your feedback! >>>> That's Correct, The proposal is aiming to extend the functionality for >>>> Table API! I like add "drop" to fit the use case you mentioned. Not >> only >>>> that, if a 100-columns Table. and our UDF needs these 100 columns, we >>> don't >>>> want to define the eval as eval(column0...column99), we prefer to >> define >>>> eval as eval(Row)。Using it like this: table.select(udf (*)). All we >> also >>>> need to consider if we put the columns package as a row. In a scenario >>> like >>>> this, we have Classification it as cloumn operation, and list the >>> changes >>>> to the column operation after the map/flatMap/agg/flatAgg phase is >>>> completed. And Currently, Xiaowei has started a threading outlining >>> which >>>> talk about what we are proposing. Please see the detail in the mail >>> thread: >>>> Please see the detail in the mail thread: >>>> >>>> >>> >> https://mail.google.com/mail/u/0/#search/xiaowei/FMfcgxvzLWzfvCnmvMzzSfxHTSfdwLkB >>>> < >>>> >>> >> https://mail.google.com/mail/u/0/#search/xiaowei/FMfcgxvzLWzfvCnmvMzzSfxHTSfdwLkB >>>>> >>>> . >>>> >>>> At this stage the Table API Enhancement Outline as follows: >>>> >>>> >>> >> https://docs.google.com/document/d/1tnpxg31EQz2-MEzSotwFzqatsB4rNLz0I-l_vPa5H4Q/edit?usp=sharing >>>> >>>> Please let we know if you have further thoughts or feedback! >>>> >>>> Thanks, >>>> Jincheng >>>> >>>> >>>> Jark Wu <[hidden email]> 于2018年11月6日周二 下午3:35写道: >>>> >>>>> Hi jingcheng, >>>>> >>>>> Thanks for your proposal. I think it is a helpful enhancement for >>>> TableAPI >>>>> which is a solid step forward for TableAPI. >>>>> It doesn't weaken SQL or DataStream, because the conversion between >>>>> DataStream and Table still works. >>>>> People with advanced cases (e.g. complex and fine-grained state >>> control) >>>>> can go with DataStream, >>>>> but most general cases can stay in TableAPI. This works is aiming to >>>> extend >>>>> the functionality for TableAPI, >>>>> to extend the usage scenario, to help TableAPI becomes a more widely >>> used >>>>> API. >>>>> >>>>> For example, someone want to drop one column from a 100-columns >> Table. >>>>> Currently, we have to convert >>>>> Table to DataStream and use MapFunction to do that, or select the >>>> remaining >>>>> 99 columns using Table.select API. >>>>> But if we support Table.drop() method for TableAPI, it will be a very >>>>> convenient method and let users stay in Table. >>>>> >>>>> Looking forward to the more detailed design and further discussion. >>>>> >>>>> Regards, >>>>> Jark >>>>> >>>>> jincheng sun <[hidden email]> 于2018年11月6日周二 下午1:05写道: >>>>> >>>>>> Hi Rong Rong, >>>>>> >>>>>> Sorry for the late reply, And thanks for your feedback! We will >>>> continue >>>>>> to add more convenience features to the TableAPI, such as map, >>> flatmap, >>>>>> agg, flatagg, iteration etc. And I am very happy that you are >>>> interested >>>>> on >>>>>> this proposal. Due to this is a long-term continuous work, we will >>> push >>>>> it >>>>>> in stages. Currently Xiaowei has started a threading outlining >>> which >>>>> talk >>>>>> about what we are proposing. Please see the detail in the mail >>> thread: >>>>>> >>>>>> >>>>> >>>> >>> >> https://mail.google.com/mail/u/0/#search/xiaowei/FMfcgxvzLWzfvCnmvMzzSfxHTSfdwLkB >>>>>> < >>>>>> >>>>> >>>> >>> >> https://mail.google.com/mail/u/0/#search/xiaowei/FMfcgxvzLWzfvCnmvMzzSfxHTSfdwLkB >>>>>>> >>>>>> . >>>>>> >>>>>> The Table API Enhancement Outline as follows: >>>>>> >>>>>> >>>>> >>>> >>> >> https://docs.google.com/document/d/1tnpxg31EQz2-MEzSotwFzqatsB4rNLz0I-l_vPa5H4Q/edit?usp=sharing >>>>>> >>>>>> Please let we know if you have further thoughts or feedback! >>>>>> >>>>>> Thanks, >>>>>> Jincheng >>>>>> >>>>>> Fabian Hueske <[hidden email]> 于2018年11月5日周一 下午7:03写道: >>>>>> >>>>>>> Hi Jincheng, >>>>>>> >>>>>>> Thanks for this interesting proposal. >>>>>>> I like that we can push this effort forward in a very >> fine-grained >>>>>> manner, >>>>>>> i.e., incrementally adding more APIs to the Table API. >>>>>>> >>>>>>> However, I also have a few questions / concerns. >>>>>>> Today, the Table API is tightly integrated with the DataSet and >>>>>> DataStream >>>>>>> APIs. It is very easy to convert a Table into a DataSet or >>> DataStream >>>>> and >>>>>>> vice versa. This mean it is already easy to combine custom logic >> an >>>>>>> relational operations. What I like is that several aspects are >>>> clearly >>>>>>> separated like retraction and timestamp handling (see below) + >> all >>>>>>> libraries on DataStream/DataSet can be easily combined with >>>> relational >>>>>>> operations. >>>>>>> I can see that adding more functionality to the Table API would >>>> remove >>>>>> the >>>>>>> distinction between DataSet and DataStream. However, wouldn't we >>> get >>>> a >>>>>>> similar benefit by extending the DataStream API for proper >> support >>>> for >>>>>>> bounded streams (as is the long-term goal of Flink)? >>>>>>> I'm also a bit skeptical about the optimization opportunities we >>>> would >>>>>>> gain. Map/FlatMap UDFs are black boxes that cannot be easily >>> removed >>>>>>> without additional information (I did some research on this a few >>>> years >>>>>> ago >>>>>>> [1]). >>>>>>> >>>>>>> Moreover, I think there are a few tricky details that need to be >>>>> resolved >>>>>>> to enable a good integration. >>>>>>> >>>>>>> 1) How to deal with retraction messages? The DataStream API does >>> not >>>>>> have a >>>>>>> notion of retractions. How would a MapFunction or FlatMapFunction >>>>> handle >>>>>>> retraction? Do they need to be aware of the change flag? Custom >>>>> windowing >>>>>>> and aggregation logic would certainly need to have that >>> information. >>>>>>> 2) How to deal with timestamps? The DataStream API does not give >>>> access >>>>>> to >>>>>>> timestamps. In the Table API / SQL these are exposed as regular >>>>>> attributes. >>>>>>> How can we ensure that timestamp attributes remain valid (i.e. >>>> aligned >>>>>> with >>>>>>> watermarks) if the output is produced by arbitrary code? >>>>>>> There might be more issues of this kind. >>>>>>> >>>>>>> My main question would be how much would we gain with this >> proposal >>>>> over >>>>>> a >>>>>>> tight integration of Table API and DataStream API, assuming that >>>> batch >>>>>>> functionality is moved to DataStream? >>>>>>> >>>>>>> Best, Fabian >>>>>>> >>>>>>> [1] >>> http://stratosphere.eu/assets/papers/openingTheBlackBoxes_12.pdf >>>>>>> >>>>>>> >>>>>>> Am Mo., 5. Nov. 2018 um 02:49 Uhr schrieb Rong Rong < >>>>> [hidden email] >>>>>>> : >>>>>>> >>>>>>>> Hi Jincheng, >>>>>>>> >>>>>>>> Thank you for the proposal! I think being able to define a >>> process >>>> / >>>>>>>> co-process function in table API definitely opens up a whole >> new >>>>> level >>>>>> of >>>>>>>> applications using a unified API. >>>>>>>> >>>>>>>> In addition, as Tzu-Li and Hequn have mentioned, the benefit of >>>>>>>> optimization layer of Table API will already bring in >> additional >>>>>> benefit >>>>>>>> over directly programming on top of DataStream/DataSet API. I >> am >>>> very >>>>>>>> interested an looking forward to seeing the support for more >>>> complex >>>>>> use >>>>>>>> cases, especially iterations. It will enable table API to >> define >>>> much >>>>>>>> broader, event-driven use cases such as real-time ML >>>>>> prediction/training. >>>>>>>> >>>>>>>> As Timo mentioned, This will make Table API diverge from the >> SQL >>>> API. >>>>>> But >>>>>>>> as from my experience Table API was always giving me the >>> impression >>>>> to >>>>>>> be a >>>>>>>> more sophisticated, syntactic-aware way to express relational >>>>>> operations. >>>>>>>> Looking forward to further discussion and collaborations on the >>>> FLIP >>>>>> doc. >>>>>>>> >>>>>>>> -- >>>>>>>> Rong >>>>>>>> >>>>>>>> On Sun, Nov 4, 2018 at 5:22 PM jincheng sun < >>>>> [hidden email]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Hi tison, >>>>>>>>> >>>>>>>>> Thanks a lot for your feedback! >>>>>>>>> I am very happy to see that community contributors agree to >>>>> enhanced >>>>>>> the >>>>>>>>> TableAPI. This work is a long-term continuous work, we will >>> push >>>> it >>>>>> in >>>>>>>>> stages, we will soon complete the enhanced list of the first >>>>> phase, >>>>>> we >>>>>>>> can >>>>>>>>> go deep discussion in google doc. thanks again for joining >> on >>>> the >>>>>> very >>>>>>>>> important discussion of the Flink Table API. >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Jincheng >>>>>>>>> >>>>>>>>> Tzu-Li Chen <[hidden email]> 于2018年11月2日周五 下午1:49写道: >>>>>>>>> >>>>>>>>>> Hi jingchengm >>>>>>>>>> >>>>>>>>>> Thanks a lot for your proposal! I find it is a good start >>> point >>>>> for >>>>>>>>>> internal optimization works and help Flink to be more >>>>>>>>>> user-friendly. >>>>>>>>>> >>>>>>>>>> AFAIK, DataStream is the most popular API currently that >>> Flink >>>>>>>>>> users should describe their logic with detailed logic. >>>>>>>>>> From a more internal view the conversion from DataStream to >>>>>>>>>> JobGraph is quite mechanically and hard to be optimized. So >>>> when >>>>>>>>>> users program with DataStream, they have to learn more >>>> internals >>>>>>>>>> and spend a lot of time to tune for performance. >>>>>>>>>> With your proposal, we provide enhanced functionality of >>> Table >>>>> API, >>>>>>>>>> so that users can describe their job easily on Table >> aspect. >>>> This >>>>>>> gives >>>>>>>>>> an opportunity to Flink developers to introduce an optimize >>>> phase >>>>>>>>>> while transforming user program(described by Table API) to >>>>> internal >>>>>>>>>> representation. >>>>>>>>>> >>>>>>>>>> Given a user who want to start using Flink with simple ETL, >>>>>>> pipelining >>>>>>>>>> or analytics, he would find it is most naturally described >> by >>>>>>> SQL/Table >>>>>>>>>> API. Further, as mentioned by @hequn, >>>>>>>>>> >>>>>>>>>> SQL is a widely used language. It follows standards, is a >>>>>>>>>>> descriptive language, and is easy to use >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> thus we could expect with the enhancement of SQL/Table API, >>>> Flink >>>>>>>>>> becomes more friendly to users. >>>>>>>>>> >>>>>>>>>> Looking forward to the design doc/FLIP! >>>>>>>>>> >>>>>>>>>> Best, >>>>>>>>>> tison. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> jincheng sun <[hidden email]> 于2018年11月2日周五 >>>> 上午11:46写道: >>>>>>>>>> >>>>>>>>>>> Hi Hequn, >>>>>>>>>>> Thanks for your feedback! And also thanks for our offline >>>>>>> discussion! >>>>>>>>>>> You are right, unification of batch and streaming is very >>>>>> important >>>>>>>> for >>>>>>>>>>> flink API. >>>>>>>>>>> We will provide more detailed design later, Please let me >>>> know >>>>> if >>>>>>> you >>>>>>>>>> have >>>>>>>>>>> further thoughts or feedback. >>>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>> Jincheng >>>>>>>>>>> >>>>>>>>>>> Hequn Cheng <[hidden email]> 于2018年11月2日周五 >>> 上午10:02写道: >>>>>>>>>>> >>>>>>>>>>>> Hi Jincheng, >>>>>>>>>>>> >>>>>>>>>>>> Thanks a lot for your proposal. It is very encouraging! >>>>>>>>>>>> >>>>>>>>>>>> As we all know, SQL is a widely used language. It >> follows >>>>>>>> standards, >>>>>>>>>> is a >>>>>>>>>>>> descriptive language, and is easy to use. A powerful >>>> feature >>>>> of >>>>>>> SQL >>>>>>>>> is >>>>>>>>>>> that >>>>>>>>>>>> it supports optimization. Users only need to care about >>> the >>>>>> logic >>>>>>>> of >>>>>>>>>> the >>>>>>>>>>>> program. The underlying optimizer will help users >>> optimize >>>>> the >>>>>>>>>>> performance >>>>>>>>>>>> of the program. However, in terms of functionality and >>> ease >>>>> of >>>>>>> use, >>>>>>>>> in >>>>>>>>>>> some >>>>>>>>>>>> scenarios sql will be limited, as described in >> Jincheng's >>>>>>> proposal. >>>>>>>>>>>> >>>>>>>>>>>> Correspondingly, the DataStream/DataSet api can provide >>>>>> powerful >>>>>>>>>>>> functionalities. Users can write >>>>>>> ProcessFunction/CoProcessFunction >>>>>>>>> and >>>>>>>>>>> get >>>>>>>>>>>> the timer. Compared with SQL, it provides more >>>>> functionalities >>>>>>> and >>>>>>>>>>>> flexibilities. However, it does not support >> optimization >>>> like >>>>>>> SQL. >>>>>>>>>>>> Meanwhile, DataStream/DataSet api has not been unified >>>> which >>>>>>> means, >>>>>>>>> for >>>>>>>>>>> the >>>>>>>>>>>> same logic, users need to write a job for each stream >> and >>>>>> batch. >>>>>>>>>>>> >>>>>>>>>>>> With TableApi, I think we can combine the advantages of >>>> both. >>>>>>> Users >>>>>>>>> can >>>>>>>>>>>> easily write relational operations and enjoy >>> optimization. >>>> At >>>>>> the >>>>>>>>> same >>>>>>>>>>>> time, it supports more functionality and ease of use. >>>> Looking >>>>>>>> forward >>>>>>>>>> to >>>>>>>>>>>> the detailed design/FLIP. >>>>>>>>>>>> >>>>>>>>>>>> Best, >>>>>>>>>>>> Hequn >>>>>>>>>>>> >>>>>>>>>>>> On Fri, Nov 2, 2018 at 9:48 AM Shaoxuan Wang < >>>>>>> [hidden email]> >>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hi Aljoscha, >>>>>>>>>>>>> Glad that you like the proposal. We have completed >> the >>>>>>> prototype >>>>>>>> of >>>>>>>>>>> most >>>>>>>>>>>>> new proposed functionalities. Once collect the >> feedback >>>>> from >>>>>>>>>> community, >>>>>>>>>>>> we >>>>>>>>>>>>> will come up with a concrete FLIP/design doc. >>>>>>>>>>>>> >>>>>>>>>>>>> Regards, >>>>>>>>>>>>> Shaoxuan >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Thu, Nov 1, 2018 at 8:12 PM Aljoscha Krettek < >>>>>>>>> [hidden email] >>>>>>>>>>> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Hi Jincheng, >>>>>>>>>>>>>> >>>>>>>>>>>>>> these points sound very good! Are there any >> concrete >>>>>>> proposals >>>>>>>>> for >>>>>>>>>>>>>> changes? For example a FLIP/design document? >>>>>>>>>>>>>> >>>>>>>>>>>>>> See here for FLIPs: >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals >>>>>>>>>>>>>> >>>>>>>>>>>>>> Best, >>>>>>>>>>>>>> Aljoscha >>>>>>>>>>>>>> >>>>>>>>>>>>>>> On 1. Nov 2018, at 12:51, jincheng sun < >>>>>>>>> [hidden email] >>>>>>>>>>> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> *--------I am sorry for the formatting of the >>>>>>> content. >>>>>>>> I >>>>>>>>>>>> reformat >>>>>>>>>>>>>>> the **content** as follows-----------* >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> *Hi ALL,* >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> With the continuous efforts from the community, >> the >>>>> Flink >>>>>>>>> system >>>>>>>>>>> has >>>>>>>>>>>>> been >>>>>>>>>>>>>>> continuously improved, which has attracted more >> and >>>>> more >>>>>>>> users. >>>>>>>>>>> Flink >>>>>>>>>>>>> SQL >>>>>>>>>>>>>>> is a canonical, widely used relational query >>>> language. >>>>>>>> However, >>>>>>>>>>> there >>>>>>>>>>>>> are >>>>>>>>>>>>>>> still some scenarios where Flink SQL failed to >> meet >>>>> user >>>>>>>> needs >>>>>>>>> in >>>>>>>>>>>> terms >>>>>>>>>>>>>> of >>>>>>>>>>>>>>> functionality and ease of use, such as: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> *1. In terms of functionality* >>>>>>>>>>>>>>> Iteration, user-defined window, user-defined >>> join, >>>>>>>>>> user-defined >>>>>>>>>>>>>>> GroupReduce, etc. Users cannot express them with >>> SQL; >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> *2. In terms of ease of use* >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> - Map - e.g. “dataStream.map(mapFun)”. Although >>>>>>>>>>>> “table.select(udf1(), >>>>>>>>>>>>>>> udf2(), udf3()....)” can be used to accomplish >>> the >>>>> same >>>>>>>>>>> function., >>>>>>>>>>>>>> with a >>>>>>>>>>>>>>> map() function returning 100 columns, one has >> to >>>>> define >>>>>>> or >>>>>>>>> call >>>>>>>>>>> 100 >>>>>>>>>>>>>> UDFs >>>>>>>>>>>>>>> when using SQL, which is quite involved. >>>>>>>>>>>>>>> - FlatMap - e.g. >>> “dataStrem.flatmap(flatMapFun)”. >>>>>>>> Similarly, >>>>>>>>>> it >>>>>>>>>>>> can >>>>>>>>>>>>> be >>>>>>>>>>>>>>> implemented with “table.join(udtf).select()”. >>>>> However, >>>>>> it >>>>>>>> is >>>>>>>>>>>> obvious >>>>>>>>>>>>>> that >>>>>>>>>>>>>>> dataStream is easier to use than SQL. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Due to the above two reasons, some users have to >>> use >>>>> the >>>>>>>>>> DataStream >>>>>>>>>>>> API >>>>>>>>>>>>>> or >>>>>>>>>>>>>>> the DataSet API. But when they do that, they lose >>> the >>>>>>>>> unification >>>>>>>>>>> of >>>>>>>>>>>>>> batch >>>>>>>>>>>>>>> and streaming. They will also lose the >>> sophisticated >>>>>>>>>> optimizations >>>>>>>>>>>> such >>>>>>>>>>>>>> as >>>>>>>>>>>>>>> codegen, aggregate join transpose and multi-stage >>> agg >>>>>> from >>>>>>>>> Flink >>>>>>>>>>> SQL. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> We believe that enhancing the functionality and >>>>>>> productivity >>>>>>>> is >>>>>>>>>>> vital >>>>>>>>>>>>> for >>>>>>>>>>>>>>> the successful adoption of Table API. To this >> end, >>>>> Table >>>>>>> API >>>>>>>>>> still >>>>>>>>>>>>>>> requires more efforts from every contributor in >> the >>>>>>>> community. >>>>>>>>> We >>>>>>>>>>> see >>>>>>>>>>>>>> great >>>>>>>>>>>>>>> opportunity in improving our user’s experience >> from >>>>> this >>>>>>>> work. >>>>>>>>>> Any >>>>>>>>>>>>>> feedback >>>>>>>>>>>>>>> is welcome. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Regards, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Jincheng >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> jincheng sun <[hidden email]> >>>> 于2018年11月1日周四 >>>>>>>>> 下午5:07写道: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Hi all, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> With the continuous efforts from the community, >>> the >>>>>> Flink >>>>>>>>> system >>>>>>>>>>> has >>>>>>>>>>>>>> been >>>>>>>>>>>>>>>> continuously improved, which has attracted more >>> and >>>>> more >>>>>>>>> users. >>>>>>>>>>>> Flink >>>>>>>>>>>>>> SQL >>>>>>>>>>>>>>>> is a canonical, widely used relational query >>>> language. >>>>>>>>> However, >>>>>>>>>>>> there >>>>>>>>>>>>>> are >>>>>>>>>>>>>>>> still some scenarios where Flink SQL failed to >>> meet >>>>> user >>>>>>>> needs >>>>>>>>>> in >>>>>>>>>>>>> terms >>>>>>>>>>>>>> of >>>>>>>>>>>>>>>> functionality and ease of use, such as: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> - >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> In terms of functionality >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Iteration, user-defined window, user-defined >> join, >>>>>>>>> user-defined >>>>>>>>>>>>>>>> GroupReduce, etc. Users cannot express them with >>>> SQL; >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> - >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> In terms of ease of use >>>>>>>>>>>>>>>> - >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Map - e.g. “dataStream.map(mapFun)”. >> Although >>>>>>>>>>>>> “table.select(udf1(), >>>>>>>>>>>>>>>> udf2(), udf3()....)” can be used to >>> accomplish >>>>> the >>>>>>> same >>>>>>>>>>>>> function., >>>>>>>>>>>>>> with a >>>>>>>>>>>>>>>> map() function returning 100 columns, one >> has >>>> to >>>>>>> define >>>>>>>>> or >>>>>>>>>>> call >>>>>>>>>>>>>> 100 UDFs >>>>>>>>>>>>>>>> when using SQL, which is quite involved. >>>>>>>>>>>>>>>> - >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> FlatMap - e.g. >>>> “dataStrem.flatmap(flatMapFun)”. >>>>>>>>> Similarly, >>>>>>>>>>> it >>>>>>>>>>>>> can >>>>>>>>>>>>>>>> be implemented with >>>> “table.join(udtf).select()”. >>>>>>>> However, >>>>>>>>>> it >>>>>>>>>>> is >>>>>>>>>>>>>> obvious >>>>>>>>>>>>>>>> that datastream is easier to use than SQL. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Due to the above two reasons, some users have to >>> use >>>>> the >>>>>>>>>>> DataStream >>>>>>>>>>>>> API >>>>>>>>>>>>>> or >>>>>>>>>>>>>>>> the DataSet API. But when they do that, they >> lose >>>> the >>>>>>>>>> unification >>>>>>>>>>> of >>>>>>>>>>>>>> batch >>>>>>>>>>>>>>>> and streaming. They will also lose the >>> sophisticated >>>>>>>>>> optimizations >>>>>>>>>>>>> such >>>>>>>>>>>>>> as >>>>>>>>>>>>>>>> codegen, aggregate join transpose and >> multi-stage >>>> agg >>>>>>> from >>>>>>>>>> Flink >>>>>>>>>>>> SQL. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> We believe that enhancing the functionality and >>>>>>> productivity >>>>>>>>> is >>>>>>>>>>>> vital >>>>>>>>>>>>>> for >>>>>>>>>>>>>>>> the successful adoption of Table API. To this >> end, >>>>>> Table >>>>>>>> API >>>>>>>>>>> still >>>>>>>>>>>>>>>> requires more efforts from every contributor in >>> the >>>>>>>> community. >>>>>>>>>> We >>>>>>>>>>>> see >>>>>>>>>>>>>> great >>>>>>>>>>>>>>>> opportunity in improving our user’s experience >>> from >>>>> this >>>>>>>> work. >>>>>>>>>> Any >>>>>>>>>>>>>> feedback >>>>>>>>>>>>>>>> is welcome. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Regards, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Jincheng >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> |
Hi all,
Thank you for your replies and comments. I have similar consideration like Piotrek. My opinion is that two APIs are enough for Flink, a declarative one (SQL) and one imperative one (DataStream). From my perspective, most of users prefer SQL at most time and turn to Data Stream when the logic is complex. TableAPI, which falls somewhere between SQL and DataStream and thus obtains some benefits from both sides, is never the choice currently because it needs some programming efforts in Ad-hoc senarios and it is short of expressiveness for complex logic. I agree that more optimization opportunities can be gained from Table API than DataStream. But I think the benefits are marginal. The ostacle is the side effects in imperative languages. Symbolic programming can do some help, but as it is still declarative, it is not easy to deal with the side effects. SCOPE does some good work in the optimization opportunities gained from the program analysis in UDFs [1, 2], which may help understand the problem. I also agree that the enhancement of Table API does not mean the weaken of SQL. But given the LIMITED resources in the community, I think we should focus our work on SQL instead of TableAPI at present. Many users are looking forward to Flink SQL which can provide good performance for both streaming and batch processing. I think the demand is more urgent and should be put in the first place. [1] Spotting Code Optimizations in Data-Parallel Pipelines through PeriSCOPE. In OSDI 2012. [2] Optimizing Data Shuffling in Data-Parallel Computation by Understanding User-Defined Functions. In NSDI 2012. Regards, Xiaogang Piotr Nowojski <[hidden email]> 于2018年11月6日周二 下午9:24写道: > Hi, > > What is our intended division/border between Table API and DataSet or > DataStream? If we want Table API to drift away from SQL that would be a > valid question. > > > Another distinguishing feature of DataStream API is that users get direct > > access to state/statebackend which we intensionally avoided in Table API > > Do we really want to make Table API an equivalent of DataSet/DataStream > API but without a state? Drawing boundary in such way would make it more > difficult for users to pick the right tool if for many use cases they could > use both. What if then at some point of time they came to conclusion that > they need a small state access for something? If that’s our intended end > goal separation between Table API and DataStream API, it would be very very > weird having two very similar APIs, that have tons of small differences, > but are basically equivalent modulo state accesses. > > Maybe instead of duplicating efforts and work between our different APIs > we should more focus on either interoperability or unifying them? > > For example if we would like to differentiate those APIs because of the > presence/lack of optimiser, maybe the APIs should be the same, but there > should be a way tell whether the UDF/operator is deterministic, has side > effects, etc. And if such operator is found in the plan, the nodes below > and above could still be subject to regular optimisation rules. > > Piotrek > > > On 6 Nov 2018, at 14:04, Fabian Hueske <[hidden email]> wrote: > > > > Hi, > > > > An analysis of orthogonal functions would be great! > > There is certainly some overlap in the functions provided by the DataSet > > API. > > > > In the past, I found that having low-level functions helped a lot to > > efficiently implement complex logic. > > Without partitionByHash, sortPartition, sort, mapPartition, combine, etc > it > > would not be possible to (efficiently) implement certain operators for > the > > Table API, SQL or the Cascading-On-Flink port that I did a while back. > > I could imaging that these APIs would be useful to implement DSLs on top > of > > the Table API, such as Gelly. > > > > Anyway, I completely agree that these physical operators should not be > the > > first step. > > If we find that these methods are not needed, even better! > > > > Let's try to keep this thread focused on the general proposal of > extending > > the scope of the Table API and keep the discussion of concrete proposal > > that Xiaowei shared in the other thread (and the design doc). > > That will help to keep all related comments in one place ;-) > > > > Best, Fabian > > > > > > Am Di., 6. Nov. 2018 um 13:01 Uhr schrieb jincheng sun < > > [hidden email]>: > > > >> Hi Fabian, > >> Thank you for your deep thoughts in this regard, I think most of > questions > >> you had mentioned are very worthy of in-depth discussion! I want share > >> thoughts about following questions: > >> > >> 1. Do we need move all DataSet API functionality into the Table API? > >> I think most of dataset functionality should be add into the TableAPI, > such > >> as map, flatmap, groupReduce etc., Because these are very easy to use > for > >> the user. > >> > >> 2. Do we support explicit physical operations like partitioning, > sorting or > >> optimizer hints? > >> I think we do not want add the physical operations, e.g.: > >> sortPartition,partitionCustom etc. From the points of my view, those > >> physical operations are used to optimization, which can be solved by > >> hints(I think we should add hints feature to both tableAPI and SQL). > >> > >> 3. Do we want to support retractions in iteration? > >> I think support iteration is a very complicated function。I am not sure, > >> but i think the implementation of the iteration may be implemented > >> according to the current batch mode, and the retraction is temporarily > not > >> supported, assuming that the trained data will not be updated in the > >> current iteration. The updated data will be used in the next > iteration. So > >> I think we should in-depth discussion in a new threading. > >> > >> BTW, I find that you have had leave the very useful comments in the > >> doc: > >> > >> > https://docs.google.com/document/d/1tnpxg31EQz2-MEzSotwFzqatsB4rNLz0I-l_vPa5H4Q/edit# > >> > >> Thanks again for both your mail feedback and doc comments ! > >> > >> Best, > >> Jincheng > >> > >> > >> > >> Fabian Hueske <[hidden email]> 于2018年11月6日周二 下午6:21写道: > >> > >>> Thanks for the replies Xiaowei and others! > >>> > >>> You are right, I did not consider the batch optimization that would be > >>> missing if the DataSet API would be ported to extend the DataStream > API. > >>> By extending the scope of the Table API, we can gain a holistic > logical & > >>> physical optimization which would be great! > >>> Is your plan to move all DataSet API functionality into the Table API? > >>> If so, do you envision any batch-related API in DataStream at all or > >> should > >>> this be done by converting a batch table to DataStream? I'm asking > >> because > >>> if there would be batch features in DataStream, we would need some > >>> optimization there as well. > >>> > >>> I think the proposed separation of Table API (stateless APIs) and > >>> DataStream (APIs that expose state handling) is a good idea. > >>> On a side note, the DataSet API discouraged state handing in user > >> function, > >>> so porting this Table API would be quite "natural". > >>> > >>> As I said before, I like that we can incrementally extend the Table > API. > >>> Map and FlatMap functions do not seem too difficult. > >>> Reduce, GroupReduce, Combine, GroupCombine, MapPartition might be more > >>> tricky, esp. if we want to support retractions. > >>> Iterations should be a challenge. I assume that Calcite does not > support > >>> iterations, so we probably need to split query / program and optimize > >> parts > >>> separately (IIRC, this is also how Flink's own optimizer handles this). > >>> To what extend are you planning to support explicit physical operations > >>> like partitioning, sorting or optimizer hints? > >>> > >>> I haven't had a look in the design document that you shared. Probably, > I > >>> find answers to some of my questions there ;-) > >>> > >>> Regarding the question of SQL or Table API, I agree that extending the > >>> scope of the Table API does not limit the scope for SQL. > >>> By adding more operations to the Table API we can expand it to use case > >>> that are not well-served by SQL. > >>> As others have said, we'll of course continue to extend and improve > >> Flink's > >>> SQL support (within the bounds of the standard). > >>> > >>> Best, Fabian > >>> > >>> Am Di., 6. Nov. 2018 um 10:09 Uhr schrieb jincheng sun < > >>> [hidden email]>: > >>> > >>>> Hi Jark, > >>>> Glad to see your feedback! > >>>> That's Correct, The proposal is aiming to extend the functionality for > >>>> Table API! I like add "drop" to fit the use case you mentioned. Not > >> only > >>>> that, if a 100-columns Table. and our UDF needs these 100 columns, we > >>> don't > >>>> want to define the eval as eval(column0...column99), we prefer to > >> define > >>>> eval as eval(Row)。Using it like this: table.select(udf (*)). All we > >> also > >>>> need to consider if we put the columns package as a row. In a scenario > >>> like > >>>> this, we have Classification it as cloumn operation, and list the > >>> changes > >>>> to the column operation after the map/flatMap/agg/flatAgg phase is > >>>> completed. And Currently, Xiaowei has started a threading outlining > >>> which > >>>> talk about what we are proposing. Please see the detail in the mail > >>> thread: > >>>> Please see the detail in the mail thread: > >>>> > >>>> > >>> > >> > https://mail.google.com/mail/u/0/#search/xiaowei/FMfcgxvzLWzfvCnmvMzzSfxHTSfdwLkB > >>>> < > >>>> > >>> > >> > https://mail.google.com/mail/u/0/#search/xiaowei/FMfcgxvzLWzfvCnmvMzzSfxHTSfdwLkB > >>>>> > >>>> . > >>>> > >>>> At this stage the Table API Enhancement Outline as follows: > >>>> > >>>> > >>> > >> > https://docs.google.com/document/d/1tnpxg31EQz2-MEzSotwFzqatsB4rNLz0I-l_vPa5H4Q/edit?usp=sharing > >>>> > >>>> Please let we know if you have further thoughts or feedback! > >>>> > >>>> Thanks, > >>>> Jincheng > >>>> > >>>> > >>>> Jark Wu <[hidden email]> 于2018年11月6日周二 下午3:35写道: > >>>> > >>>>> Hi jingcheng, > >>>>> > >>>>> Thanks for your proposal. I think it is a helpful enhancement for > >>>> TableAPI > >>>>> which is a solid step forward for TableAPI. > >>>>> It doesn't weaken SQL or DataStream, because the conversion between > >>>>> DataStream and Table still works. > >>>>> People with advanced cases (e.g. complex and fine-grained state > >>> control) > >>>>> can go with DataStream, > >>>>> but most general cases can stay in TableAPI. This works is aiming to > >>>> extend > >>>>> the functionality for TableAPI, > >>>>> to extend the usage scenario, to help TableAPI becomes a more widely > >>> used > >>>>> API. > >>>>> > >>>>> For example, someone want to drop one column from a 100-columns > >> Table. > >>>>> Currently, we have to convert > >>>>> Table to DataStream and use MapFunction to do that, or select the > >>>> remaining > >>>>> 99 columns using Table.select API. > >>>>> But if we support Table.drop() method for TableAPI, it will be a very > >>>>> convenient method and let users stay in Table. > >>>>> > >>>>> Looking forward to the more detailed design and further discussion. > >>>>> > >>>>> Regards, > >>>>> Jark > >>>>> > >>>>> jincheng sun <[hidden email]> 于2018年11月6日周二 下午1:05写道: > >>>>> > >>>>>> Hi Rong Rong, > >>>>>> > >>>>>> Sorry for the late reply, And thanks for your feedback! We will > >>>> continue > >>>>>> to add more convenience features to the TableAPI, such as map, > >>> flatmap, > >>>>>> agg, flatagg, iteration etc. And I am very happy that you are > >>>> interested > >>>>> on > >>>>>> this proposal. Due to this is a long-term continuous work, we will > >>> push > >>>>> it > >>>>>> in stages. Currently Xiaowei has started a threading outlining > >>> which > >>>>> talk > >>>>>> about what we are proposing. Please see the detail in the mail > >>> thread: > >>>>>> > >>>>>> > >>>>> > >>>> > >>> > >> > https://mail.google.com/mail/u/0/#search/xiaowei/FMfcgxvzLWzfvCnmvMzzSfxHTSfdwLkB > >>>>>> < > >>>>>> > >>>>> > >>>> > >>> > >> > https://mail.google.com/mail/u/0/#search/xiaowei/FMfcgxvzLWzfvCnmvMzzSfxHTSfdwLkB > >>>>>>> > >>>>>> . > >>>>>> > >>>>>> The Table API Enhancement Outline as follows: > >>>>>> > >>>>>> > >>>>> > >>>> > >>> > >> > https://docs.google.com/document/d/1tnpxg31EQz2-MEzSotwFzqatsB4rNLz0I-l_vPa5H4Q/edit?usp=sharing > >>>>>> > >>>>>> Please let we know if you have further thoughts or feedback! > >>>>>> > >>>>>> Thanks, > >>>>>> Jincheng > >>>>>> > >>>>>> Fabian Hueske <[hidden email]> 于2018年11月5日周一 下午7:03写道: > >>>>>> > >>>>>>> Hi Jincheng, > >>>>>>> > >>>>>>> Thanks for this interesting proposal. > >>>>>>> I like that we can push this effort forward in a very > >> fine-grained > >>>>>> manner, > >>>>>>> i.e., incrementally adding more APIs to the Table API. > >>>>>>> > >>>>>>> However, I also have a few questions / concerns. > >>>>>>> Today, the Table API is tightly integrated with the DataSet and > >>>>>> DataStream > >>>>>>> APIs. It is very easy to convert a Table into a DataSet or > >>> DataStream > >>>>> and > >>>>>>> vice versa. This mean it is already easy to combine custom logic > >> an > >>>>>>> relational operations. What I like is that several aspects are > >>>> clearly > >>>>>>> separated like retraction and timestamp handling (see below) + > >> all > >>>>>>> libraries on DataStream/DataSet can be easily combined with > >>>> relational > >>>>>>> operations. > >>>>>>> I can see that adding more functionality to the Table API would > >>>> remove > >>>>>> the > >>>>>>> distinction between DataSet and DataStream. However, wouldn't we > >>> get > >>>> a > >>>>>>> similar benefit by extending the DataStream API for proper > >> support > >>>> for > >>>>>>> bounded streams (as is the long-term goal of Flink)? > >>>>>>> I'm also a bit skeptical about the optimization opportunities we > >>>> would > >>>>>>> gain. Map/FlatMap UDFs are black boxes that cannot be easily > >>> removed > >>>>>>> without additional information (I did some research on this a few > >>>> years > >>>>>> ago > >>>>>>> [1]). > >>>>>>> > >>>>>>> Moreover, I think there are a few tricky details that need to be > >>>>> resolved > >>>>>>> to enable a good integration. > >>>>>>> > >>>>>>> 1) How to deal with retraction messages? The DataStream API does > >>> not > >>>>>> have a > >>>>>>> notion of retractions. How would a MapFunction or FlatMapFunction > >>>>> handle > >>>>>>> retraction? Do they need to be aware of the change flag? Custom > >>>>> windowing > >>>>>>> and aggregation logic would certainly need to have that > >>> information. > >>>>>>> 2) How to deal with timestamps? The DataStream API does not give > >>>> access > >>>>>> to > >>>>>>> timestamps. In the Table API / SQL these are exposed as regular > >>>>>> attributes. > >>>>>>> How can we ensure that timestamp attributes remain valid (i.e. > >>>> aligned > >>>>>> with > >>>>>>> watermarks) if the output is produced by arbitrary code? > >>>>>>> There might be more issues of this kind. > >>>>>>> > >>>>>>> My main question would be how much would we gain with this > >> proposal > >>>>> over > >>>>>> a > >>>>>>> tight integration of Table API and DataStream API, assuming that > >>>> batch > >>>>>>> functionality is moved to DataStream? > >>>>>>> > >>>>>>> Best, Fabian > >>>>>>> > >>>>>>> [1] > >>> http://stratosphere.eu/assets/papers/openingTheBlackBoxes_12.pdf > >>>>>>> > >>>>>>> > >>>>>>> Am Mo., 5. Nov. 2018 um 02:49 Uhr schrieb Rong Rong < > >>>>> [hidden email] > >>>>>>> : > >>>>>>> > >>>>>>>> Hi Jincheng, > >>>>>>>> > >>>>>>>> Thank you for the proposal! I think being able to define a > >>> process > >>>> / > >>>>>>>> co-process function in table API definitely opens up a whole > >> new > >>>>> level > >>>>>> of > >>>>>>>> applications using a unified API. > >>>>>>>> > >>>>>>>> In addition, as Tzu-Li and Hequn have mentioned, the benefit of > >>>>>>>> optimization layer of Table API will already bring in > >> additional > >>>>>> benefit > >>>>>>>> over directly programming on top of DataStream/DataSet API. I > >> am > >>>> very > >>>>>>>> interested an looking forward to seeing the support for more > >>>> complex > >>>>>> use > >>>>>>>> cases, especially iterations. It will enable table API to > >> define > >>>> much > >>>>>>>> broader, event-driven use cases such as real-time ML > >>>>>> prediction/training. > >>>>>>>> > >>>>>>>> As Timo mentioned, This will make Table API diverge from the > >> SQL > >>>> API. > >>>>>> But > >>>>>>>> as from my experience Table API was always giving me the > >>> impression > >>>>> to > >>>>>>> be a > >>>>>>>> more sophisticated, syntactic-aware way to express relational > >>>>>> operations. > >>>>>>>> Looking forward to further discussion and collaborations on the > >>>> FLIP > >>>>>> doc. > >>>>>>>> > >>>>>>>> -- > >>>>>>>> Rong > >>>>>>>> > >>>>>>>> On Sun, Nov 4, 2018 at 5:22 PM jincheng sun < > >>>>> [hidden email]> > >>>>>>>> wrote: > >>>>>>>> > >>>>>>>>> Hi tison, > >>>>>>>>> > >>>>>>>>> Thanks a lot for your feedback! > >>>>>>>>> I am very happy to see that community contributors agree to > >>>>> enhanced > >>>>>>> the > >>>>>>>>> TableAPI. This work is a long-term continuous work, we will > >>> push > >>>> it > >>>>>> in > >>>>>>>>> stages, we will soon complete the enhanced list of the first > >>>>> phase, > >>>>>> we > >>>>>>>> can > >>>>>>>>> go deep discussion in google doc. thanks again for joining > >> on > >>>> the > >>>>>> very > >>>>>>>>> important discussion of the Flink Table API. > >>>>>>>>> > >>>>>>>>> Thanks, > >>>>>>>>> Jincheng > >>>>>>>>> > >>>>>>>>> Tzu-Li Chen <[hidden email]> 于2018年11月2日周五 下午1:49写道: > >>>>>>>>> > >>>>>>>>>> Hi jingchengm > >>>>>>>>>> > >>>>>>>>>> Thanks a lot for your proposal! I find it is a good start > >>> point > >>>>> for > >>>>>>>>>> internal optimization works and help Flink to be more > >>>>>>>>>> user-friendly. > >>>>>>>>>> > >>>>>>>>>> AFAIK, DataStream is the most popular API currently that > >>> Flink > >>>>>>>>>> users should describe their logic with detailed logic. > >>>>>>>>>> From a more internal view the conversion from DataStream to > >>>>>>>>>> JobGraph is quite mechanically and hard to be optimized. So > >>>> when > >>>>>>>>>> users program with DataStream, they have to learn more > >>>> internals > >>>>>>>>>> and spend a lot of time to tune for performance. > >>>>>>>>>> With your proposal, we provide enhanced functionality of > >>> Table > >>>>> API, > >>>>>>>>>> so that users can describe their job easily on Table > >> aspect. > >>>> This > >>>>>>> gives > >>>>>>>>>> an opportunity to Flink developers to introduce an optimize > >>>> phase > >>>>>>>>>> while transforming user program(described by Table API) to > >>>>> internal > >>>>>>>>>> representation. > >>>>>>>>>> > >>>>>>>>>> Given a user who want to start using Flink with simple ETL, > >>>>>>> pipelining > >>>>>>>>>> or analytics, he would find it is most naturally described > >> by > >>>>>>> SQL/Table > >>>>>>>>>> API. Further, as mentioned by @hequn, > >>>>>>>>>> > >>>>>>>>>> SQL is a widely used language. It follows standards, is a > >>>>>>>>>>> descriptive language, and is easy to use > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> thus we could expect with the enhancement of SQL/Table API, > >>>> Flink > >>>>>>>>>> becomes more friendly to users. > >>>>>>>>>> > >>>>>>>>>> Looking forward to the design doc/FLIP! > >>>>>>>>>> > >>>>>>>>>> Best, > >>>>>>>>>> tison. > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> jincheng sun <[hidden email]> 于2018年11月2日周五 > >>>> 上午11:46写道: > >>>>>>>>>> > >>>>>>>>>>> Hi Hequn, > >>>>>>>>>>> Thanks for your feedback! And also thanks for our offline > >>>>>>> discussion! > >>>>>>>>>>> You are right, unification of batch and streaming is very > >>>>>> important > >>>>>>>> for > >>>>>>>>>>> flink API. > >>>>>>>>>>> We will provide more detailed design later, Please let me > >>>> know > >>>>> if > >>>>>>> you > >>>>>>>>>> have > >>>>>>>>>>> further thoughts or feedback. > >>>>>>>>>>> > >>>>>>>>>>> Thanks, > >>>>>>>>>>> Jincheng > >>>>>>>>>>> > >>>>>>>>>>> Hequn Cheng <[hidden email]> 于2018年11月2日周五 > >>> 上午10:02写道: > >>>>>>>>>>> > >>>>>>>>>>>> Hi Jincheng, > >>>>>>>>>>>> > >>>>>>>>>>>> Thanks a lot for your proposal. It is very encouraging! > >>>>>>>>>>>> > >>>>>>>>>>>> As we all know, SQL is a widely used language. It > >> follows > >>>>>>>> standards, > >>>>>>>>>> is a > >>>>>>>>>>>> descriptive language, and is easy to use. A powerful > >>>> feature > >>>>> of > >>>>>>> SQL > >>>>>>>>> is > >>>>>>>>>>> that > >>>>>>>>>>>> it supports optimization. Users only need to care about > >>> the > >>>>>> logic > >>>>>>>> of > >>>>>>>>>> the > >>>>>>>>>>>> program. The underlying optimizer will help users > >>> optimize > >>>>> the > >>>>>>>>>>> performance > >>>>>>>>>>>> of the program. However, in terms of functionality and > >>> ease > >>>>> of > >>>>>>> use, > >>>>>>>>> in > >>>>>>>>>>> some > >>>>>>>>>>>> scenarios sql will be limited, as described in > >> Jincheng's > >>>>>>> proposal. > >>>>>>>>>>>> > >>>>>>>>>>>> Correspondingly, the DataStream/DataSet api can provide > >>>>>> powerful > >>>>>>>>>>>> functionalities. Users can write > >>>>>>> ProcessFunction/CoProcessFunction > >>>>>>>>> and > >>>>>>>>>>> get > >>>>>>>>>>>> the timer. Compared with SQL, it provides more > >>>>> functionalities > >>>>>>> and > >>>>>>>>>>>> flexibilities. However, it does not support > >> optimization > >>>> like > >>>>>>> SQL. > >>>>>>>>>>>> Meanwhile, DataStream/DataSet api has not been unified > >>>> which > >>>>>>> means, > >>>>>>>>> for > >>>>>>>>>>> the > >>>>>>>>>>>> same logic, users need to write a job for each stream > >> and > >>>>>> batch. > >>>>>>>>>>>> > >>>>>>>>>>>> With TableApi, I think we can combine the advantages of > >>>> both. > >>>>>>> Users > >>>>>>>>> can > >>>>>>>>>>>> easily write relational operations and enjoy > >>> optimization. > >>>> At > >>>>>> the > >>>>>>>>> same > >>>>>>>>>>>> time, it supports more functionality and ease of use. > >>>> Looking > >>>>>>>> forward > >>>>>>>>>> to > >>>>>>>>>>>> the detailed design/FLIP. > >>>>>>>>>>>> > >>>>>>>>>>>> Best, > >>>>>>>>>>>> Hequn > >>>>>>>>>>>> > >>>>>>>>>>>> On Fri, Nov 2, 2018 at 9:48 AM Shaoxuan Wang < > >>>>>>> [hidden email]> > >>>>>>>>>>> wrote: > >>>>>>>>>>>> > >>>>>>>>>>>>> Hi Aljoscha, > >>>>>>>>>>>>> Glad that you like the proposal. We have completed > >> the > >>>>>>> prototype > >>>>>>>> of > >>>>>>>>>>> most > >>>>>>>>>>>>> new proposed functionalities. Once collect the > >> feedback > >>>>> from > >>>>>>>>>> community, > >>>>>>>>>>>> we > >>>>>>>>>>>>> will come up with a concrete FLIP/design doc. > >>>>>>>>>>>>> > >>>>>>>>>>>>> Regards, > >>>>>>>>>>>>> Shaoxuan > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> On Thu, Nov 1, 2018 at 8:12 PM Aljoscha Krettek < > >>>>>>>>> [hidden email] > >>>>>>>>>>> > >>>>>>>>>>>>> wrote: > >>>>>>>>>>>>> > >>>>>>>>>>>>>> Hi Jincheng, > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> these points sound very good! Are there any > >> concrete > >>>>>>> proposals > >>>>>>>>> for > >>>>>>>>>>>>>> changes? For example a FLIP/design document? > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> See here for FLIPs: > >>>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>> > >>>>>>>>> > >>>>>>>> > >>>>>>> > >>>>>> > >>>>> > >>>> > >>> > >> > https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Best, > >>>>>>>>>>>>>> Aljoscha > >>>>>>>>>>>>>> > >>>>>>>>>>>>>>> On 1. Nov 2018, at 12:51, jincheng sun < > >>>>>>>>> [hidden email] > >>>>>>>>>>> > >>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> *--------I am sorry for the formatting of the > >>>>>>> content. > >>>>>>>> I > >>>>>>>>>>>> reformat > >>>>>>>>>>>>>>> the **content** as follows-----------* > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> *Hi ALL,* > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> With the continuous efforts from the community, > >> the > >>>>> Flink > >>>>>>>>> system > >>>>>>>>>>> has > >>>>>>>>>>>>> been > >>>>>>>>>>>>>>> continuously improved, which has attracted more > >> and > >>>>> more > >>>>>>>> users. > >>>>>>>>>>> Flink > >>>>>>>>>>>>> SQL > >>>>>>>>>>>>>>> is a canonical, widely used relational query > >>>> language. > >>>>>>>> However, > >>>>>>>>>>> there > >>>>>>>>>>>>> are > >>>>>>>>>>>>>>> still some scenarios where Flink SQL failed to > >> meet > >>>>> user > >>>>>>>> needs > >>>>>>>>> in > >>>>>>>>>>>> terms > >>>>>>>>>>>>>> of > >>>>>>>>>>>>>>> functionality and ease of use, such as: > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> *1. In terms of functionality* > >>>>>>>>>>>>>>> Iteration, user-defined window, user-defined > >>> join, > >>>>>>>>>> user-defined > >>>>>>>>>>>>>>> GroupReduce, etc. Users cannot express them with > >>> SQL; > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> *2. In terms of ease of use* > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> - Map - e.g. “dataStream.map(mapFun)”. Although > >>>>>>>>>>>> “table.select(udf1(), > >>>>>>>>>>>>>>> udf2(), udf3()....)” can be used to accomplish > >>> the > >>>>> same > >>>>>>>>>>> function., > >>>>>>>>>>>>>> with a > >>>>>>>>>>>>>>> map() function returning 100 columns, one has > >> to > >>>>> define > >>>>>>> or > >>>>>>>>> call > >>>>>>>>>>> 100 > >>>>>>>>>>>>>> UDFs > >>>>>>>>>>>>>>> when using SQL, which is quite involved. > >>>>>>>>>>>>>>> - FlatMap - e.g. > >>> “dataStrem.flatmap(flatMapFun)”. > >>>>>>>> Similarly, > >>>>>>>>>> it > >>>>>>>>>>>> can > >>>>>>>>>>>>> be > >>>>>>>>>>>>>>> implemented with “table.join(udtf).select()”. > >>>>> However, > >>>>>> it > >>>>>>>> is > >>>>>>>>>>>> obvious > >>>>>>>>>>>>>> that > >>>>>>>>>>>>>>> dataStream is easier to use than SQL. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Due to the above two reasons, some users have to > >>> use > >>>>> the > >>>>>>>>>> DataStream > >>>>>>>>>>>> API > >>>>>>>>>>>>>> or > >>>>>>>>>>>>>>> the DataSet API. But when they do that, they lose > >>> the > >>>>>>>>> unification > >>>>>>>>>>> of > >>>>>>>>>>>>>> batch > >>>>>>>>>>>>>>> and streaming. They will also lose the > >>> sophisticated > >>>>>>>>>> optimizations > >>>>>>>>>>>> such > >>>>>>>>>>>>>> as > >>>>>>>>>>>>>>> codegen, aggregate join transpose and multi-stage > >>> agg > >>>>>> from > >>>>>>>>> Flink > >>>>>>>>>>> SQL. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> We believe that enhancing the functionality and > >>>>>>> productivity > >>>>>>>> is > >>>>>>>>>>> vital > >>>>>>>>>>>>> for > >>>>>>>>>>>>>>> the successful adoption of Table API. To this > >> end, > >>>>> Table > >>>>>>> API > >>>>>>>>>> still > >>>>>>>>>>>>>>> requires more efforts from every contributor in > >> the > >>>>>>>> community. > >>>>>>>>> We > >>>>>>>>>>> see > >>>>>>>>>>>>>> great > >>>>>>>>>>>>>>> opportunity in improving our user’s experience > >> from > >>>>> this > >>>>>>>> work. > >>>>>>>>>> Any > >>>>>>>>>>>>>> feedback > >>>>>>>>>>>>>>> is welcome. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Regards, > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Jincheng > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> jincheng sun <[hidden email]> > >>>> 于2018年11月1日周四 > >>>>>>>>> 下午5:07写道: > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Hi all, > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> With the continuous efforts from the community, > >>> the > >>>>>> Flink > >>>>>>>>> system > >>>>>>>>>>> has > >>>>>>>>>>>>>> been > >>>>>>>>>>>>>>>> continuously improved, which has attracted more > >>> and > >>>>> more > >>>>>>>>> users. > >>>>>>>>>>>> Flink > >>>>>>>>>>>>>> SQL > >>>>>>>>>>>>>>>> is a canonical, widely used relational query > >>>> language. > >>>>>>>>> However, > >>>>>>>>>>>> there > >>>>>>>>>>>>>> are > >>>>>>>>>>>>>>>> still some scenarios where Flink SQL failed to > >>> meet > >>>>> user > >>>>>>>> needs > >>>>>>>>>> in > >>>>>>>>>>>>> terms > >>>>>>>>>>>>>> of > >>>>>>>>>>>>>>>> functionality and ease of use, such as: > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> - > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> In terms of functionality > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Iteration, user-defined window, user-defined > >> join, > >>>>>>>>> user-defined > >>>>>>>>>>>>>>>> GroupReduce, etc. Users cannot express them with > >>>> SQL; > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> - > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> In terms of ease of use > >>>>>>>>>>>>>>>> - > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Map - e.g. “dataStream.map(mapFun)”. > >> Although > >>>>>>>>>>>>> “table.select(udf1(), > >>>>>>>>>>>>>>>> udf2(), udf3()....)” can be used to > >>> accomplish > >>>>> the > >>>>>>> same > >>>>>>>>>>>>> function., > >>>>>>>>>>>>>> with a > >>>>>>>>>>>>>>>> map() function returning 100 columns, one > >> has > >>>> to > >>>>>>> define > >>>>>>>>> or > >>>>>>>>>>> call > >>>>>>>>>>>>>> 100 UDFs > >>>>>>>>>>>>>>>> when using SQL, which is quite involved. > >>>>>>>>>>>>>>>> - > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> FlatMap - e.g. > >>>> “dataStrem.flatmap(flatMapFun)”. > >>>>>>>>> Similarly, > >>>>>>>>>>> it > >>>>>>>>>>>>> can > >>>>>>>>>>>>>>>> be implemented with > >>>> “table.join(udtf).select()”. > >>>>>>>> However, > >>>>>>>>>> it > >>>>>>>>>>> is > >>>>>>>>>>>>>> obvious > >>>>>>>>>>>>>>>> that datastream is easier to use than SQL. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Due to the above two reasons, some users have to > >>> use > >>>>> the > >>>>>>>>>>> DataStream > >>>>>>>>>>>>> API > >>>>>>>>>>>>>> or > >>>>>>>>>>>>>>>> the DataSet API. But when they do that, they > >> lose > >>>> the > >>>>>>>>>> unification > >>>>>>>>>>> of > >>>>>>>>>>>>>> batch > >>>>>>>>>>>>>>>> and streaming. They will also lose the > >>> sophisticated > >>>>>>>>>> optimizations > >>>>>>>>>>>>> such > >>>>>>>>>>>>>> as > >>>>>>>>>>>>>>>> codegen, aggregate join transpose and > >> multi-stage > >>>> agg > >>>>>>> from > >>>>>>>>>> Flink > >>>>>>>>>>>> SQL. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> We believe that enhancing the functionality and > >>>>>>> productivity > >>>>>>>>> is > >>>>>>>>>>>> vital > >>>>>>>>>>>>>> for > >>>>>>>>>>>>>>>> the successful adoption of Table API. To this > >> end, > >>>>>> Table > >>>>>>>> API > >>>>>>>>>>> still > >>>>>>>>>>>>>>>> requires more efforts from every contributor in > >>> the > >>>>>>>> community. > >>>>>>>>>> We > >>>>>>>>>>>> see > >>>>>>>>>>>>>> great > >>>>>>>>>>>>>>>> opportunity in improving our user’s experience > >>> from > >>>>> this > >>>>>>>> work. > >>>>>>>>>> Any > >>>>>>>>>>>>>> feedback > >>>>>>>>>>>>>>>> is welcome. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Regards, > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Jincheng > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>> > >>>>>>>>> > >>>>>>>> > >>>>>>> > >>>>>> > >>>>> > >>>> > >>> > >> > > |
In reply to this post by Piotr Nowojski
Hi all,
Thanks for the feedback. I enjoyed the discussions, especially the ones between Fabian and Xiaowei. I think it well revealed the motivations and design pros/cons behind this proposal. Enhancing tableAPI will not affect and limit the improvements on Flink SQL (as well as DataStream). Actually Alibaba is the biggest fun of Flink SQL and our contributions to Flink SQL will not cease or reduce because of this. As an supplement to the motivations, I would like to share some of our experience and lessons learned in Alibaba. In the past 1-2 years we upgraded most of our production jobs (including data analysis as well as AI pipeline) on top of the SQL API. Besides SQL, we also provide the tableAPI to the users and received lots of interests and new requests. We extended tableAPI with some new functionalities and it indeed helped a lot (in terms of easy-of-use as well as performance) in many cases. This motivates us to contribute it back to the community. Xiaowei opened another email thread and listed a few things that we are proposing to add on tableAPI as the first step. Please take a look. Let us move the discussions on design proposal to that ML. Thanks, Shaoxuan On Tue, Nov 6, 2018 at 9:24 PM Piotr Nowojski <[hidden email]> wrote: > Hi, > > What is our intended division/border between Table API and DataSet or > DataStream? If we want Table API to drift away from SQL that would be a > valid question. > > > Another distinguishing feature of DataStream API is that users get direct > > access to state/statebackend which we intensionally avoided in Table API > > Do we really want to make Table API an equivalent of DataSet/DataStream > API but without a state? Drawing boundary in such way would make it more > difficult for users to pick the right tool if for many use cases they could > use both. What if then at some point of time they came to conclusion that > they need a small state access for something? If that’s our intended end > goal separation between Table API and DataStream API, it would be very very > weird having two very similar APIs, that have tons of small differences, > but are basically equivalent modulo state accesses. > > Maybe instead of duplicating efforts and work between our different APIs > we should more focus on either interoperability or unifying them? > > For example if we would like to differentiate those APIs because of the > presence/lack of optimiser, maybe the APIs should be the same, but there > should be a way tell whether the UDF/operator is deterministic, has side > effects, etc. And if such operator is found in the plan, the nodes below > and above could still be subject to regular optimisation rules. > > Piotrek > > > On 6 Nov 2018, at 14:04, Fabian Hueske <[hidden email]> wrote: > > > > Hi, > > > > An analysis of orthogonal functions would be great! > > There is certainly some overlap in the functions provided by the DataSet > > API. > > > > In the past, I found that having low-level functions helped a lot to > > efficiently implement complex logic. > > Without partitionByHash, sortPartition, sort, mapPartition, combine, etc > it > > would not be possible to (efficiently) implement certain operators for > the > > Table API, SQL or the Cascading-On-Flink port that I did a while back. > > I could imaging that these APIs would be useful to implement DSLs on top > of > > the Table API, such as Gelly. > > > > Anyway, I completely agree that these physical operators should not be > the > > first step. > > If we find that these methods are not needed, even better! > > > > Let's try to keep this thread focused on the general proposal of > extending > > the scope of the Table API and keep the discussion of concrete proposal > > that Xiaowei shared in the other thread (and the design doc). > > That will help to keep all related comments in one place ;-) > > > > Best, Fabian > > > > > > Am Di., 6. Nov. 2018 um 13:01 Uhr schrieb jincheng sun < > > [hidden email]>: > > > >> Hi Fabian, > >> Thank you for your deep thoughts in this regard, I think most of > questions > >> you had mentioned are very worthy of in-depth discussion! I want share > >> thoughts about following questions: > >> > >> 1. Do we need move all DataSet API functionality into the Table API? > >> I think most of dataset functionality should be add into the TableAPI, > such > >> as map, flatmap, groupReduce etc., Because these are very easy to use > for > >> the user. > >> > >> 2. Do we support explicit physical operations like partitioning, > sorting or > >> optimizer hints? > >> I think we do not want add the physical operations, e.g.: > >> sortPartition,partitionCustom etc. From the points of my view, those > >> physical operations are used to optimization, which can be solved by > >> hints(I think we should add hints feature to both tableAPI and SQL). > >> > >> 3. Do we want to support retractions in iteration? > >> I think support iteration is a very complicated function。I am not sure, > >> but i think the implementation of the iteration may be implemented > >> according to the current batch mode, and the retraction is temporarily > not > >> supported, assuming that the trained data will not be updated in the > >> current iteration. The updated data will be used in the next > iteration. So > >> I think we should in-depth discussion in a new threading. > >> > >> BTW, I find that you have had leave the very useful comments in the > >> doc: > >> > >> > https://docs.google.com/document/d/1tnpxg31EQz2-MEzSotwFzqatsB4rNLz0I-l_vPa5H4Q/edit# > >> > >> Thanks again for both your mail feedback and doc comments ! > >> > >> Best, > >> Jincheng > >> > >> > >> > >> Fabian Hueske <[hidden email]> 于2018年11月6日周二 下午6:21写道: > >> > >>> Thanks for the replies Xiaowei and others! > >>> > >>> You are right, I did not consider the batch optimization that would be > >>> missing if the DataSet API would be ported to extend the DataStream > API. > >>> By extending the scope of the Table API, we can gain a holistic > logical & > >>> physical optimization which would be great! > >>> Is your plan to move all DataSet API functionality into the Table API? > >>> If so, do you envision any batch-related API in DataStream at all or > >> should > >>> this be done by converting a batch table to DataStream? I'm asking > >> because > >>> if there would be batch features in DataStream, we would need some > >>> optimization there as well. > >>> > >>> I think the proposed separation of Table API (stateless APIs) and > >>> DataStream (APIs that expose state handling) is a good idea. > >>> On a side note, the DataSet API discouraged state handing in user > >> function, > >>> so porting this Table API would be quite "natural". > >>> > >>> As I said before, I like that we can incrementally extend the Table > API. > >>> Map and FlatMap functions do not seem too difficult. > >>> Reduce, GroupReduce, Combine, GroupCombine, MapPartition might be more > >>> tricky, esp. if we want to support retractions. > >>> Iterations should be a challenge. I assume that Calcite does not > support > >>> iterations, so we probably need to split query / program and optimize > >> parts > >>> separately (IIRC, this is also how Flink's own optimizer handles this). > >>> To what extend are you planning to support explicit physical operations > >>> like partitioning, sorting or optimizer hints? > >>> > >>> I haven't had a look in the design document that you shared. Probably, > I > >>> find answers to some of my questions there ;-) > >>> > >>> Regarding the question of SQL or Table API, I agree that extending the > >>> scope of the Table API does not limit the scope for SQL. > >>> By adding more operations to the Table API we can expand it to use case > >>> that are not well-served by SQL. > >>> As others have said, we'll of course continue to extend and improve > >> Flink's > >>> SQL support (within the bounds of the standard). > >>> > >>> Best, Fabian > >>> > >>> Am Di., 6. Nov. 2018 um 10:09 Uhr schrieb jincheng sun < > >>> [hidden email]>: > >>> > >>>> Hi Jark, > >>>> Glad to see your feedback! > >>>> That's Correct, The proposal is aiming to extend the functionality for > >>>> Table API! I like add "drop" to fit the use case you mentioned. Not > >> only > >>>> that, if a 100-columns Table. and our UDF needs these 100 columns, we > >>> don't > >>>> want to define the eval as eval(column0...column99), we prefer to > >> define > >>>> eval as eval(Row)。Using it like this: table.select(udf (*)). All we > >> also > >>>> need to consider if we put the columns package as a row. In a scenario > >>> like > >>>> this, we have Classification it as cloumn operation, and list the > >>> changes > >>>> to the column operation after the map/flatMap/agg/flatAgg phase is > >>>> completed. And Currently, Xiaowei has started a threading outlining > >>> which > >>>> talk about what we are proposing. Please see the detail in the mail > >>> thread: > >>>> Please see the detail in the mail thread: > >>>> > >>>> > >>> > >> > https://mail.google.com/mail/u/0/#search/xiaowei/FMfcgxvzLWzfvCnmvMzzSfxHTSfdwLkB > >>>> < > >>>> > >>> > >> > https://mail.google.com/mail/u/0/#search/xiaowei/FMfcgxvzLWzfvCnmvMzzSfxHTSfdwLkB > >>>>> > >>>> . > >>>> > >>>> At this stage the Table API Enhancement Outline as follows: > >>>> > >>>> > >>> > >> > https://docs.google.com/document/d/1tnpxg31EQz2-MEzSotwFzqatsB4rNLz0I-l_vPa5H4Q/edit?usp=sharing > >>>> > >>>> Please let we know if you have further thoughts or feedback! > >>>> > >>>> Thanks, > >>>> Jincheng > >>>> > >>>> > >>>> Jark Wu <[hidden email]> 于2018年11月6日周二 下午3:35写道: > >>>> > >>>>> Hi jingcheng, > >>>>> > >>>>> Thanks for your proposal. I think it is a helpful enhancement for > >>>> TableAPI > >>>>> which is a solid step forward for TableAPI. > >>>>> It doesn't weaken SQL or DataStream, because the conversion between > >>>>> DataStream and Table still works. > >>>>> People with advanced cases (e.g. complex and fine-grained state > >>> control) > >>>>> can go with DataStream, > >>>>> but most general cases can stay in TableAPI. This works is aiming to > >>>> extend > >>>>> the functionality for TableAPI, > >>>>> to extend the usage scenario, to help TableAPI becomes a more widely > >>> used > >>>>> API. > >>>>> > >>>>> For example, someone want to drop one column from a 100-columns > >> Table. > >>>>> Currently, we have to convert > >>>>> Table to DataStream and use MapFunction to do that, or select the > >>>> remaining > >>>>> 99 columns using Table.select API. > >>>>> But if we support Table.drop() method for TableAPI, it will be a very > >>>>> convenient method and let users stay in Table. > >>>>> > >>>>> Looking forward to the more detailed design and further discussion. > >>>>> > >>>>> Regards, > >>>>> Jark > >>>>> > >>>>> jincheng sun <[hidden email]> 于2018年11月6日周二 下午1:05写道: > >>>>> > >>>>>> Hi Rong Rong, > >>>>>> > >>>>>> Sorry for the late reply, And thanks for your feedback! We will > >>>> continue > >>>>>> to add more convenience features to the TableAPI, such as map, > >>> flatmap, > >>>>>> agg, flatagg, iteration etc. And I am very happy that you are > >>>> interested > >>>>> on > >>>>>> this proposal. Due to this is a long-term continuous work, we will > >>> push > >>>>> it > >>>>>> in stages. Currently Xiaowei has started a threading outlining > >>> which > >>>>> talk > >>>>>> about what we are proposing. Please see the detail in the mail > >>> thread: > >>>>>> > >>>>>> > >>>>> > >>>> > >>> > >> > https://mail.google.com/mail/u/0/#search/xiaowei/FMfcgxvzLWzfvCnmvMzzSfxHTSfdwLkB > >>>>>> < > >>>>>> > >>>>> > >>>> > >>> > >> > https://mail.google.com/mail/u/0/#search/xiaowei/FMfcgxvzLWzfvCnmvMzzSfxHTSfdwLkB > >>>>>>> > >>>>>> . > >>>>>> > >>>>>> The Table API Enhancement Outline as follows: > >>>>>> > >>>>>> > >>>>> > >>>> > >>> > >> > https://docs.google.com/document/d/1tnpxg31EQz2-MEzSotwFzqatsB4rNLz0I-l_vPa5H4Q/edit?usp=sharing > >>>>>> > >>>>>> Please let we know if you have further thoughts or feedback! > >>>>>> > >>>>>> Thanks, > >>>>>> Jincheng > >>>>>> > >>>>>> Fabian Hueske <[hidden email]> 于2018年11月5日周一 下午7:03写道: > >>>>>> > >>>>>>> Hi Jincheng, > >>>>>>> > >>>>>>> Thanks for this interesting proposal. > >>>>>>> I like that we can push this effort forward in a very > >> fine-grained > >>>>>> manner, > >>>>>>> i.e., incrementally adding more APIs to the Table API. > >>>>>>> > >>>>>>> However, I also have a few questions / concerns. > >>>>>>> Today, the Table API is tightly integrated with the DataSet and > >>>>>> DataStream > >>>>>>> APIs. It is very easy to convert a Table into a DataSet or > >>> DataStream > >>>>> and > >>>>>>> vice versa. This mean it is already easy to combine custom logic > >> an > >>>>>>> relational operations. What I like is that several aspects are > >>>> clearly > >>>>>>> separated like retraction and timestamp handling (see below) + > >> all > >>>>>>> libraries on DataStream/DataSet can be easily combined with > >>>> relational > >>>>>>> operations. > >>>>>>> I can see that adding more functionality to the Table API would > >>>> remove > >>>>>> the > >>>>>>> distinction between DataSet and DataStream. However, wouldn't we > >>> get > >>>> a > >>>>>>> similar benefit by extending the DataStream API for proper > >> support > >>>> for > >>>>>>> bounded streams (as is the long-term goal of Flink)? > >>>>>>> I'm also a bit skeptical about the optimization opportunities we > >>>> would > >>>>>>> gain. Map/FlatMap UDFs are black boxes that cannot be easily > >>> removed > >>>>>>> without additional information (I did some research on this a few > >>>> years > >>>>>> ago > >>>>>>> [1]). > >>>>>>> > >>>>>>> Moreover, I think there are a few tricky details that need to be > >>>>> resolved > >>>>>>> to enable a good integration. > >>>>>>> > >>>>>>> 1) How to deal with retraction messages? The DataStream API does > >>> not > >>>>>> have a > >>>>>>> notion of retractions. How would a MapFunction or FlatMapFunction > >>>>> handle > >>>>>>> retraction? Do they need to be aware of the change flag? Custom > >>>>> windowing > >>>>>>> and aggregation logic would certainly need to have that > >>> information. > >>>>>>> 2) How to deal with timestamps? The DataStream API does not give > >>>> access > >>>>>> to > >>>>>>> timestamps. In the Table API / SQL these are exposed as regular > >>>>>> attributes. > >>>>>>> How can we ensure that timestamp attributes remain valid (i.e. > >>>> aligned > >>>>>> with > >>>>>>> watermarks) if the output is produced by arbitrary code? > >>>>>>> There might be more issues of this kind. > >>>>>>> > >>>>>>> My main question would be how much would we gain with this > >> proposal > >>>>> over > >>>>>> a > >>>>>>> tight integration of Table API and DataStream API, assuming that > >>>> batch > >>>>>>> functionality is moved to DataStream? > >>>>>>> > >>>>>>> Best, Fabian > >>>>>>> > >>>>>>> [1] > >>> http://stratosphere.eu/assets/papers/openingTheBlackBoxes_12.pdf > >>>>>>> > >>>>>>> > >>>>>>> Am Mo., 5. Nov. 2018 um 02:49 Uhr schrieb Rong Rong < > >>>>> [hidden email] > >>>>>>> : > >>>>>>> > >>>>>>>> Hi Jincheng, > >>>>>>>> > >>>>>>>> Thank you for the proposal! I think being able to define a > >>> process > >>>> / > >>>>>>>> co-process function in table API definitely opens up a whole > >> new > >>>>> level > >>>>>> of > >>>>>>>> applications using a unified API. > >>>>>>>> > >>>>>>>> In addition, as Tzu-Li and Hequn have mentioned, the benefit of > >>>>>>>> optimization layer of Table API will already bring in > >> additional > >>>>>> benefit > >>>>>>>> over directly programming on top of DataStream/DataSet API. I > >> am > >>>> very > >>>>>>>> interested an looking forward to seeing the support for more > >>>> complex > >>>>>> use > >>>>>>>> cases, especially iterations. It will enable table API to > >> define > >>>> much > >>>>>>>> broader, event-driven use cases such as real-time ML > >>>>>> prediction/training. > >>>>>>>> > >>>>>>>> As Timo mentioned, This will make Table API diverge from the > >> SQL > >>>> API. > >>>>>> But > >>>>>>>> as from my experience Table API was always giving me the > >>> impression > >>>>> to > >>>>>>> be a > >>>>>>>> more sophisticated, syntactic-aware way to express relational > >>>>>> operations. > >>>>>>>> Looking forward to further discussion and collaborations on the > >>>> FLIP > >>>>>> doc. > >>>>>>>> > >>>>>>>> -- > >>>>>>>> Rong > >>>>>>>> > >>>>>>>> On Sun, Nov 4, 2018 at 5:22 PM jincheng sun < > >>>>> [hidden email]> > >>>>>>>> wrote: > >>>>>>>> > >>>>>>>>> Hi tison, > >>>>>>>>> > >>>>>>>>> Thanks a lot for your feedback! > >>>>>>>>> I am very happy to see that community contributors agree to > >>>>> enhanced > >>>>>>> the > >>>>>>>>> TableAPI. This work is a long-term continuous work, we will > >>> push > >>>> it > >>>>>> in > >>>>>>>>> stages, we will soon complete the enhanced list of the first > >>>>> phase, > >>>>>> we > >>>>>>>> can > >>>>>>>>> go deep discussion in google doc. thanks again for joining > >> on > >>>> the > >>>>>> very > >>>>>>>>> important discussion of the Flink Table API. > >>>>>>>>> > >>>>>>>>> Thanks, > >>>>>>>>> Jincheng > >>>>>>>>> > >>>>>>>>> Tzu-Li Chen <[hidden email]> 于2018年11月2日周五 下午1:49写道: > >>>>>>>>> > >>>>>>>>>> Hi jingchengm > >>>>>>>>>> > >>>>>>>>>> Thanks a lot for your proposal! I find it is a good start > >>> point > >>>>> for > >>>>>>>>>> internal optimization works and help Flink to be more > >>>>>>>>>> user-friendly. > >>>>>>>>>> > >>>>>>>>>> AFAIK, DataStream is the most popular API currently that > >>> Flink > >>>>>>>>>> users should describe their logic with detailed logic. > >>>>>>>>>> From a more internal view the conversion from DataStream to > >>>>>>>>>> JobGraph is quite mechanically and hard to be optimized. So > >>>> when > >>>>>>>>>> users program with DataStream, they have to learn more > >>>> internals > >>>>>>>>>> and spend a lot of time to tune for performance. > >>>>>>>>>> With your proposal, we provide enhanced functionality of > >>> Table > >>>>> API, > >>>>>>>>>> so that users can describe their job easily on Table > >> aspect. > >>>> This > >>>>>>> gives > >>>>>>>>>> an opportunity to Flink developers to introduce an optimize > >>>> phase > >>>>>>>>>> while transforming user program(described by Table API) to > >>>>> internal > >>>>>>>>>> representation. > >>>>>>>>>> > >>>>>>>>>> Given a user who want to start using Flink with simple ETL, > >>>>>>> pipelining > >>>>>>>>>> or analytics, he would find it is most naturally described > >> by > >>>>>>> SQL/Table > >>>>>>>>>> API. Further, as mentioned by @hequn, > >>>>>>>>>> > >>>>>>>>>> SQL is a widely used language. It follows standards, is a > >>>>>>>>>>> descriptive language, and is easy to use > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> thus we could expect with the enhancement of SQL/Table API, > >>>> Flink > >>>>>>>>>> becomes more friendly to users. > >>>>>>>>>> > >>>>>>>>>> Looking forward to the design doc/FLIP! > >>>>>>>>>> > >>>>>>>>>> Best, > >>>>>>>>>> tison. > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> jincheng sun <[hidden email]> 于2018年11月2日周五 > >>>> 上午11:46写道: > >>>>>>>>>> > >>>>>>>>>>> Hi Hequn, > >>>>>>>>>>> Thanks for your feedback! And also thanks for our offline > >>>>>>> discussion! > >>>>>>>>>>> You are right, unification of batch and streaming is very > >>>>>> important > >>>>>>>> for > >>>>>>>>>>> flink API. > >>>>>>>>>>> We will provide more detailed design later, Please let me > >>>> know > >>>>> if > >>>>>>> you > >>>>>>>>>> have > >>>>>>>>>>> further thoughts or feedback. > >>>>>>>>>>> > >>>>>>>>>>> Thanks, > >>>>>>>>>>> Jincheng > >>>>>>>>>>> > >>>>>>>>>>> Hequn Cheng <[hidden email]> 于2018年11月2日周五 > >>> 上午10:02写道: > >>>>>>>>>>> > >>>>>>>>>>>> Hi Jincheng, > >>>>>>>>>>>> > >>>>>>>>>>>> Thanks a lot for your proposal. It is very encouraging! > >>>>>>>>>>>> > >>>>>>>>>>>> As we all know, SQL is a widely used language. It > >> follows > >>>>>>>> standards, > >>>>>>>>>> is a > >>>>>>>>>>>> descriptive language, and is easy to use. A powerful > >>>> feature > >>>>> of > >>>>>>> SQL > >>>>>>>>> is > >>>>>>>>>>> that > >>>>>>>>>>>> it supports optimization. Users only need to care about > >>> the > >>>>>> logic > >>>>>>>> of > >>>>>>>>>> the > >>>>>>>>>>>> program. The underlying optimizer will help users > >>> optimize > >>>>> the > >>>>>>>>>>> performance > >>>>>>>>>>>> of the program. However, in terms of functionality and > >>> ease > >>>>> of > >>>>>>> use, > >>>>>>>>> in > >>>>>>>>>>> some > >>>>>>>>>>>> scenarios sql will be limited, as described in > >> Jincheng's > >>>>>>> proposal. > >>>>>>>>>>>> > >>>>>>>>>>>> Correspondingly, the DataStream/DataSet api can provide > >>>>>> powerful > >>>>>>>>>>>> functionalities. Users can write > >>>>>>> ProcessFunction/CoProcessFunction > >>>>>>>>> and > >>>>>>>>>>> get > >>>>>>>>>>>> the timer. Compared with SQL, it provides more > >>>>> functionalities > >>>>>>> and > >>>>>>>>>>>> flexibilities. However, it does not support > >> optimization > >>>> like > >>>>>>> SQL. > >>>>>>>>>>>> Meanwhile, DataStream/DataSet api has not been unified > >>>> which > >>>>>>> means, > >>>>>>>>> for > >>>>>>>>>>> the > >>>>>>>>>>>> same logic, users need to write a job for each stream > >> and > >>>>>> batch. > >>>>>>>>>>>> > >>>>>>>>>>>> With TableApi, I think we can combine the advantages of > >>>> both. > >>>>>>> Users > >>>>>>>>> can > >>>>>>>>>>>> easily write relational operations and enjoy > >>> optimization. > >>>> At > >>>>>> the > >>>>>>>>> same > >>>>>>>>>>>> time, it supports more functionality and ease of use. > >>>> Looking > >>>>>>>> forward > >>>>>>>>>> to > >>>>>>>>>>>> the detailed design/FLIP. > >>>>>>>>>>>> > >>>>>>>>>>>> Best, > >>>>>>>>>>>> Hequn > >>>>>>>>>>>> > >>>>>>>>>>>> On Fri, Nov 2, 2018 at 9:48 AM Shaoxuan Wang < > >>>>>>> [hidden email]> > >>>>>>>>>>> wrote: > >>>>>>>>>>>> > >>>>>>>>>>>>> Hi Aljoscha, > >>>>>>>>>>>>> Glad that you like the proposal. We have completed > >> the > >>>>>>> prototype > >>>>>>>> of > >>>>>>>>>>> most > >>>>>>>>>>>>> new proposed functionalities. Once collect the > >> feedback > >>>>> from > >>>>>>>>>> community, > >>>>>>>>>>>> we > >>>>>>>>>>>>> will come up with a concrete FLIP/design doc. > >>>>>>>>>>>>> > >>>>>>>>>>>>> Regards, > >>>>>>>>>>>>> Shaoxuan > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> On Thu, Nov 1, 2018 at 8:12 PM Aljoscha Krettek < > >>>>>>>>> [hidden email] > >>>>>>>>>>> > >>>>>>>>>>>>> wrote: > >>>>>>>>>>>>> > >>>>>>>>>>>>>> Hi Jincheng, > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> these points sound very good! Are there any > >> concrete > >>>>>>> proposals > >>>>>>>>> for > >>>>>>>>>>>>>> changes? For example a FLIP/design document? > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> See here for FLIPs: > >>>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>> > >>>>>>>>> > >>>>>>>> > >>>>>>> > >>>>>> > >>>>> > >>>> > >>> > >> > https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Best, > >>>>>>>>>>>>>> Aljoscha > >>>>>>>>>>>>>> > >>>>>>>>>>>>>>> On 1. Nov 2018, at 12:51, jincheng sun < > >>>>>>>>> [hidden email] > >>>>>>>>>>> > >>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> *--------I am sorry for the formatting of the > >>>>>>> content. > >>>>>>>> I > >>>>>>>>>>>> reformat > >>>>>>>>>>>>>>> the **content** as follows-----------* > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> *Hi ALL,* > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> With the continuous efforts from the community, > >> the > >>>>> Flink > >>>>>>>>> system > >>>>>>>>>>> has > >>>>>>>>>>>>> been > >>>>>>>>>>>>>>> continuously improved, which has attracted more > >> and > >>>>> more > >>>>>>>> users. > >>>>>>>>>>> Flink > >>>>>>>>>>>>> SQL > >>>>>>>>>>>>>>> is a canonical, widely used relational query > >>>> language. > >>>>>>>> However, > >>>>>>>>>>> there > >>>>>>>>>>>>> are > >>>>>>>>>>>>>>> still some scenarios where Flink SQL failed to > >> meet > >>>>> user > >>>>>>>> needs > >>>>>>>>> in > >>>>>>>>>>>> terms > >>>>>>>>>>>>>> of > >>>>>>>>>>>>>>> functionality and ease of use, such as: > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> *1. In terms of functionality* > >>>>>>>>>>>>>>> Iteration, user-defined window, user-defined > >>> join, > >>>>>>>>>> user-defined > >>>>>>>>>>>>>>> GroupReduce, etc. Users cannot express them with > >>> SQL; > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> *2. In terms of ease of use* > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> - Map - e.g. “dataStream.map(mapFun)”. Although > >>>>>>>>>>>> “table.select(udf1(), > >>>>>>>>>>>>>>> udf2(), udf3()....)” can be used to accomplish > >>> the > >>>>> same > >>>>>>>>>>> function., > >>>>>>>>>>>>>> with a > >>>>>>>>>>>>>>> map() function returning 100 columns, one has > >> to > >>>>> define > >>>>>>> or > >>>>>>>>> call > >>>>>>>>>>> 100 > >>>>>>>>>>>>>> UDFs > >>>>>>>>>>>>>>> when using SQL, which is quite involved. > >>>>>>>>>>>>>>> - FlatMap - e.g. > >>> “dataStrem.flatmap(flatMapFun)”. > >>>>>>>> Similarly, > >>>>>>>>>> it > >>>>>>>>>>>> can > >>>>>>>>>>>>> be > >>>>>>>>>>>>>>> implemented with “table.join(udtf).select()”. > >>>>> However, > >>>>>> it > >>>>>>>> is > >>>>>>>>>>>> obvious > >>>>>>>>>>>>>> that > >>>>>>>>>>>>>>> dataStream is easier to use than SQL. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Due to the above two reasons, some users have to > >>> use > >>>>> the > >>>>>>>>>> DataStream > >>>>>>>>>>>> API > >>>>>>>>>>>>>> or > >>>>>>>>>>>>>>> the DataSet API. But when they do that, they lose > >>> the > >>>>>>>>> unification > >>>>>>>>>>> of > >>>>>>>>>>>>>> batch > >>>>>>>>>>>>>>> and streaming. They will also lose the > >>> sophisticated > >>>>>>>>>> optimizations > >>>>>>>>>>>> such > >>>>>>>>>>>>>> as > >>>>>>>>>>>>>>> codegen, aggregate join transpose and multi-stage > >>> agg > >>>>>> from > >>>>>>>>> Flink > >>>>>>>>>>> SQL. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> We believe that enhancing the functionality and > >>>>>>> productivity > >>>>>>>> is > >>>>>>>>>>> vital > >>>>>>>>>>>>> for > >>>>>>>>>>>>>>> the successful adoption of Table API. To this > >> end, > >>>>> Table > >>>>>>> API > >>>>>>>>>> still > >>>>>>>>>>>>>>> requires more efforts from every contributor in > >> the > >>>>>>>> community. > >>>>>>>>> We > >>>>>>>>>>> see > >>>>>>>>>>>>>> great > >>>>>>>>>>>>>>> opportunity in improving our user’s experience > >> from > >>>>> this > >>>>>>>> work. > >>>>>>>>>> Any > >>>>>>>>>>>>>> feedback > >>>>>>>>>>>>>>> is welcome. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Regards, > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Jincheng > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> jincheng sun <[hidden email]> > >>>> 于2018年11月1日周四 > >>>>>>>>> 下午5:07写道: > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Hi all, > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> With the continuous efforts from the community, > >>> the > >>>>>> Flink > >>>>>>>>> system > >>>>>>>>>>> has > >>>>>>>>>>>>>> been > >>>>>>>>>>>>>>>> continuously improved, which has attracted more > >>> and > >>>>> more > >>>>>>>>> users. > >>>>>>>>>>>> Flink > >>>>>>>>>>>>>> SQL > >>>>>>>>>>>>>>>> is a canonical, widely used relational query > >>>> language. > >>>>>>>>> However, > >>>>>>>>>>>> there > >>>>>>>>>>>>>> are > >>>>>>>>>>>>>>>> still some scenarios where Flink SQL failed to > >>> meet > >>>>> user > >>>>>>>> needs > >>>>>>>>>> in > >>>>>>>>>>>>> terms > >>>>>>>>>>>>>> of > >>>>>>>>>>>>>>>> functionality and ease of use, such as: > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> - > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> In terms of functionality > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Iteration, user-defined window, user-defined > >> join, > >>>>>>>>> user-defined > >>>>>>>>>>>>>>>> GroupReduce, etc. Users cannot express them with > >>>> SQL; > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> - > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> In terms of ease of use > >>>>>>>>>>>>>>>> - > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Map - e.g. “dataStream.map(mapFun)”. > >> Although > >>>>>>>>>>>>> “table.select(udf1(), > >>>>>>>>>>>>>>>> udf2(), udf3()....)” can be used to > >>> accomplish > >>>>> the > >>>>>>> same > >>>>>>>>>>>>> function., > >>>>>>>>>>>>>> with a > >>>>>>>>>>>>>>>> map() function returning 100 columns, one > >> has > >>>> to > >>>>>>> define > >>>>>>>>> or > >>>>>>>>>>> call > >>>>>>>>>>>>>> 100 UDFs > >>>>>>>>>>>>>>>> when using SQL, which is quite involved. > >>>>>>>>>>>>>>>> - > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> FlatMap - e.g. > >>>> “dataStrem.flatmap(flatMapFun)”. > >>>>>>>>> Similarly, > >>>>>>>>>>> it > >>>>>>>>>>>>> can > >>>>>>>>>>>>>>>> be implemented with > >>>> “table.join(udtf).select()”. > >>>>>>>> However, > >>>>>>>>>> it > >>>>>>>>>>> is > >>>>>>>>>>>>>> obvious > >>>>>>>>>>>>>>>> that datastream is easier to use than SQL. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Due to the above two reasons, some users have to > >>> use > >>>>> the > >>>>>>>>>>> DataStream > >>>>>>>>>>>>> API > >>>>>>>>>>>>>> or > >>>>>>>>>>>>>>>> the DataSet API. But when they do that, they > >> lose > >>>> the > >>>>>>>>>> unification > >>>>>>>>>>> of > >>>>>>>>>>>>>> batch > >>>>>>>>>>>>>>>> and streaming. They will also lose the > >>> sophisticated > >>>>>>>>>> optimizations > >>>>>>>>>>>>> such > >>>>>>>>>>>>>> as > >>>>>>>>>>>>>>>> codegen, aggregate join transpose and > >> multi-stage > >>>> agg > >>>>>>> from > >>>>>>>>>> Flink > >>>>>>>>>>>> SQL. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> We believe that enhancing the functionality and > >>>>>>> productivity > >>>>>>>>> is > >>>>>>>>>>>> vital > >>>>>>>>>>>>>> for > >>>>>>>>>>>>>>>> the successful adoption of Table API. To this > >> end, > >>>>>> Table > >>>>>>>> API > >>>>>>>>>>> still > >>>>>>>>>>>>>>>> requires more efforts from every contributor in > >>> the > >>>>>>>> community. > >>>>>>>>>> We > >>>>>>>>>>>> see > >>>>>>>>>>>>>> great > >>>>>>>>>>>>>>>> opportunity in improving our user’s experience > >>> from > >>>>> this > >>>>>>>> work. > >>>>>>>>>> Any > >>>>>>>>>>>>>> feedback > >>>>>>>>>>>>>>>> is welcome. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Regards, > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Jincheng > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>> > >>>>>>>>> > >>>>>>>> > >>>>>>> > >>>>>> > >>>>> > >>>> > >>> > >> > > |
Hi Piotr:
I want to clarify one thing first: I think that we will keep the interoperability between TableAPI and DataStream in any case. So user can switch between the two whenever needed. Given that, it would still be very helpful that users can use one API to achieve most of what they do. Currently, TableAPI/SQL is good enough for most data analytics kind of scenarios, but there are some limitations that when removed will help we go even further in this direction. An initial list of these is provided in another thread. These are naturally extensions to TableAPI which we need to do just for the sake of making TableAPI more usable. TableAPI and SQL share the same underlying implementation, so enhancement in one will end up helping the other. I don't see them as competitive. TableAPI is easier to extend because that we have a bit more freedom in adding new functionalities. In reality, TableAPI can be mixed with SQL as well. On the implementation side, I agree that Table API/SQL and DataStream should try to share as much as possible. But that question is orthogonal to the API discussion. This thread is meant to enhancing the functionalities of TableAPI. I don't think that anyone is suggesting either reducing the effort in SQL or DataStream. So let's focus on how we can enhance TableAPI. Regards, Xiaowei |
Hi,
> This thread is meant to enhancing the functionalities of TableAPI. I don't > think that anyone is suggesting either reducing the effort in SQL or > DataStream. So let's focus on how we can enhance TableAPI. I wasn’t thinking about that. As I said before, I was rising a question, what Table API should look like in the future if we want to diverge it more and more from SQL. It looks to me, that the more or less consensus is that it should be expanded and evolve parallel to the DataStream API, but in order to better suite different needs with following differences: - declarative - predefined schema/types - no custom state operations (?) - optimisations allowed by the above points Piotrek > On 7 Nov 2018, at 16:01, Xiaowei Jiang <[hidden email]> wrote: > > Hi Piotr: > > I want to clarify one thing first: I think that we will keep the > interoperability between TableAPI and DataStream in any case. So user can > switch between the two whenever needed. Given that, it would still be very > helpful that users can use one API to achieve most of what they do. > Currently, TableAPI/SQL is good enough for most data analytics kind of > scenarios, but there are some limitations that when removed will help we go > even further in this direction. An initial list of these is provided in > another thread. These are naturally extensions to TableAPI which we need to > do just for the sake of making TableAPI more usable. > > TableAPI and SQL share the same underlying implementation, so enhancement > in one will end up helping the other. I don't see them as competitive. > TableAPI is easier to extend because that we have a bit more freedom in > adding new functionalities. In reality, TableAPI can be mixed with SQL as > well. > > On the implementation side, I agree that Table API/SQL and DataStream > should try to share as much as possible. But that question is orthogonal to > the API discussion. > > This thread is meant to enhancing the functionalities of TableAPI. I don't > think that anyone is suggesting either reducing the effort in SQL or > DataStream. So let's focus on how we can enhance TableAPI. > > Regards, > Xiaowei |
Yes, that is my understanding as well.
Manual time management would be another difference. Something still to be discussed would be whether (or to what extent) it would be possible to define the physical execution plan with hints or methods like partitionByHash and sortPartition. Best, Fabian Am Di., 13. Nov. 2018, 13:57 hat Piotr Nowojski <[hidden email]> geschrieben: > Hi, > > > This thread is meant to enhancing the functionalities of TableAPI. I > don't > > think that anyone is suggesting either reducing the effort in SQL or > > DataStream. So let's focus on how we can enhance TableAPI. > > I wasn’t thinking about that. As I said before, I was rising a question, > what Table API should look like in the future if we want to diverge it more > and more from SQL. It looks to me, that the more or less consensus is that > it should be expanded and evolve parallel to the DataStream API, but in > order to better suite different needs with following differences: > - declarative > - predefined schema/types > - no custom state operations (?) > - optimisations allowed by the above points > > Piotrek > > > On 7 Nov 2018, at 16:01, Xiaowei Jiang <[hidden email]> wrote: > > > > Hi Piotr: > > > > I want to clarify one thing first: I think that we will keep the > > interoperability between TableAPI and DataStream in any case. So user can > > switch between the two whenever needed. Given that, it would still be > very > > helpful that users can use one API to achieve most of what they do. > > Currently, TableAPI/SQL is good enough for most data analytics kind of > > scenarios, but there are some limitations that when removed will help we > go > > even further in this direction. An initial list of these is provided in > > another thread. These are naturally extensions to TableAPI which we need > to > > do just for the sake of making TableAPI more usable. > > > > TableAPI and SQL share the same underlying implementation, so enhancement > > in one will end up helping the other. I don't see them as competitive. > > TableAPI is easier to extend because that we have a bit more freedom in > > adding new functionalities. In reality, TableAPI can be mixed with SQL as > > well. > > > > On the implementation side, I agree that Table API/SQL and DataStream > > should try to share as much as possible. But that question is orthogonal > to > > the API discussion. > > > > This thread is meant to enhancing the functionalities of TableAPI. I > don't > > think that anyone is suggesting either reducing the effort in SQL or > > DataStream. So let's focus on how we can enhance TableAPI. > > > > Regards, > > Xiaowei > > |
Hi Piotrek, Fabian:
I am very glad to see your reply. Thank you very much Piotrek for asking very good questions. I will share my opinion: - The Enhancing TableAPI that I proposed is proposed for user friendliness. After enhancement, it will maintain the characteristics of TableAPI&SQL, such as: declarative, optimization etc. - For the difference between DataStreamAPI and TableAPI, I think there are two points: - a). State management, DataStreamAPI users can use stateAPI to perform state operations directly. TableAPI/SQL can only use such as: DataView indirectly. - b). Physical operation, DataStreamAPI can call rebalance()/rescale()/shuffle()etc. physical operation, TableAPI will use the optimizer to judge the underlying strategy. If necessary, we will follow the database's hint mechanism and add a hint to the tableAPI. Affects physical operations, but I don't recommend adding operations such as rebalance()/rescale()/shuffle()/sortPartition()etc. on the TableAPI. - About the management of time attributes we can continue to discuss in TableAPI Enhancement Outline: https://mail.google.com/mail/u/0/#search/xiaowei/FMfcgxvzLWzfvCnmvMzzSfxHTSfdwLkB . - Regarding the proposed Enhanced TableAPI, my core goal is to solve the usability problem. At this stage, the following four APIs will be proposed: - Table.Map - Table.FlatMap - GroupedTable.agg - GroupedTable.flatAgg Adding the above API is for the terms of ease of use, taking map as an example: Map - e.g. “dataStream.map(mapFun)”. Although “table.select(udf1(), udf2(), udf3()....)” can be used to accomplish the same function., with a map() function returning 100 columns, one has to define or call 100 UDFs when using SQL, which is quite involved. I totally agree that we have to discuss in depth the changes in the API and let our community APIs continue to develop in the right direction. Thanks again for the reply, and looking forward to your feedback!:) Best, Jincheng Fabian Hueske <[hidden email]> 于2018年11月13日周二 下午9:31写道: > Yes, that is my understanding as well. > > Manual time management would be another difference. > Something still to be discussed would be whether (or to what extent) it > would be possible to define the physical execution plan with hints or > methods like partitionByHash and sortPartition. > > Best, Fabian > > Am Di., 13. Nov. 2018, 13:57 hat Piotr Nowojski <[hidden email]> > geschrieben: > > > Hi, > > > > > This thread is meant to enhancing the functionalities of TableAPI. I > > don't > > > think that anyone is suggesting either reducing the effort in SQL or > > > DataStream. So let's focus on how we can enhance TableAPI. > > > > I wasn’t thinking about that. As I said before, I was rising a question, > > what Table API should look like in the future if we want to diverge it > more > > and more from SQL. It looks to me, that the more or less consensus is > that > > it should be expanded and evolve parallel to the DataStream API, but in > > order to better suite different needs with following differences: > > - declarative > > - predefined schema/types > > - no custom state operations (?) > > - optimisations allowed by the above points > > > > Piotrek > > > > > On 7 Nov 2018, at 16:01, Xiaowei Jiang <[hidden email]> wrote: > > > > > > Hi Piotr: > > > > > > I want to clarify one thing first: I think that we will keep the > > > interoperability between TableAPI and DataStream in any case. So user > can > > > switch between the two whenever needed. Given that, it would still be > > very > > > helpful that users can use one API to achieve most of what they do. > > > Currently, TableAPI/SQL is good enough for most data analytics kind of > > > scenarios, but there are some limitations that when removed will help > we > > go > > > even further in this direction. An initial list of these is provided in > > > another thread. These are naturally extensions to TableAPI which we > need > > to > > > do just for the sake of making TableAPI more usable. > > > > > > TableAPI and SQL share the same underlying implementation, so > enhancement > > > in one will end up helping the other. I don't see them as competitive. > > > TableAPI is easier to extend because that we have a bit more freedom in > > > adding new functionalities. In reality, TableAPI can be mixed with SQL > as > > > well. > > > > > > On the implementation side, I agree that Table API/SQL and DataStream > > > should try to share as much as possible. But that question is > orthogonal > > to > > > the API discussion. > > > > > > This thread is meant to enhancing the functionalities of TableAPI. I > > don't > > > think that anyone is suggesting either reducing the effort in SQL or > > > DataStream. So let's focus on how we can enhance TableAPI. > > > > > > Regards, > > > Xiaowei > > > > > |
Thanks Jincheng,
That makes sense to me. Another differentiation of Table API and DataStream API would be the access to the timer service. The DataStream API can register and act on timers while the Table API would not have this feature. Best, Fabian Am Mi., 14. Nov. 2018 um 02:02 Uhr schrieb jincheng sun < [hidden email]>: > Hi Piotrek, Fabian: > > I am very glad to see your reply. Thank you very much Piotrek for asking > very good questions. I will share my opinion: > > > - The Enhancing TableAPI that I proposed is proposed for user > friendliness. After enhancement, it will maintain the characteristics of > TableAPI&SQL, such as: declarative, optimization etc. > > > - For the difference between DataStreamAPI and TableAPI, I think there > are two points: > - a). State management, DataStreamAPI users can use stateAPI to perform > state operations directly. TableAPI/SQL can only use such as: > DataView > indirectly. > - b). Physical operation, DataStreamAPI can call > rebalance()/rescale()/shuffle()etc. physical operation, TableAPI > will use > the optimizer to judge the underlying strategy. If necessary, we will > follow the database's hint mechanism and add a hint to the tableAPI. > Affects physical operations, but I don't recommend adding operations > such > as rebalance()/rescale()/shuffle()/sortPartition()etc. on the > TableAPI. > - About the management of time attributes we can continue to discuss in > TableAPI Enhancement Outline: > > https://mail.google.com/mail/u/0/#search/xiaowei/FMfcgxvzLWzfvCnmvMzzSfxHTSfdwLkB > . > > > - Regarding the proposed Enhanced TableAPI, my core goal is to solve the > usability problem. At this stage, the following four APIs will be > proposed: > - Table.Map > - Table.FlatMap > - GroupedTable.agg > - GroupedTable.flatAgg > > Adding the above API is for the terms of ease of use, > taking map as an example: Map - e.g. “dataStream.map(mapFun)”. Although > “table.select(udf1(), udf2(), udf3()....)” can be used to accomplish the > same function., with a map() function returning 100 columns, one has to > define or call 100 UDFs when using SQL, which is quite involved. > > I totally agree that we have to discuss in depth the changes in the API and > let our community APIs continue to develop in the right direction. Thanks > again for the reply, and looking forward to your feedback!:) > > Best, > Jincheng > > Fabian Hueske <[hidden email]> 于2018年11月13日周二 下午9:31写道: > > > Yes, that is my understanding as well. > > > > Manual time management would be another difference. > > Something still to be discussed would be whether (or to what extent) it > > would be possible to define the physical execution plan with hints or > > methods like partitionByHash and sortPartition. > > > > Best, Fabian > > > > Am Di., 13. Nov. 2018, 13:57 hat Piotr Nowojski <[hidden email] > > > > geschrieben: > > > > > Hi, > > > > > > > This thread is meant to enhancing the functionalities of TableAPI. I > > > don't > > > > think that anyone is suggesting either reducing the effort in SQL or > > > > DataStream. So let's focus on how we can enhance TableAPI. > > > > > > I wasn’t thinking about that. As I said before, I was rising a > question, > > > what Table API should look like in the future if we want to diverge it > > more > > > and more from SQL. It looks to me, that the more or less consensus is > > that > > > it should be expanded and evolve parallel to the DataStream API, but in > > > order to better suite different needs with following differences: > > > - declarative > > > - predefined schema/types > > > - no custom state operations (?) > > > - optimisations allowed by the above points > > > > > > Piotrek > > > > > > > On 7 Nov 2018, at 16:01, Xiaowei Jiang <[hidden email]> wrote: > > > > > > > > Hi Piotr: > > > > > > > > I want to clarify one thing first: I think that we will keep the > > > > interoperability between TableAPI and DataStream in any case. So user > > can > > > > switch between the two whenever needed. Given that, it would still be > > > very > > > > helpful that users can use one API to achieve most of what they do. > > > > Currently, TableAPI/SQL is good enough for most data analytics kind > of > > > > scenarios, but there are some limitations that when removed will help > > we > > > go > > > > even further in this direction. An initial list of these is provided > in > > > > another thread. These are naturally extensions to TableAPI which we > > need > > > to > > > > do just for the sake of making TableAPI more usable. > > > > > > > > TableAPI and SQL share the same underlying implementation, so > > enhancement > > > > in one will end up helping the other. I don't see them as > competitive. > > > > TableAPI is easier to extend because that we have a bit more freedom > in > > > > adding new functionalities. In reality, TableAPI can be mixed with > SQL > > as > > > > well. > > > > > > > > On the implementation side, I agree that Table API/SQL and DataStream > > > > should try to share as much as possible. But that question is > > orthogonal > > > to > > > > the API discussion. > > > > > > > > This thread is meant to enhancing the functionalities of TableAPI. I > > > don't > > > > think that anyone is suggesting either reducing the effort in SQL or > > > > DataStream. So let's focus on how we can enhance TableAPI. > > > > > > > > Regards, > > > > Xiaowei > > > > > > > > > |
Hi Fabian,
Yes, Timers is not only the difference between Table and DataStream, but also the difference between DataStream and DataSet. We need to unify the batch and Stream in Table, so the difference about timers needs to be considered in depth. :) Thanks, Jincheng Fabian Hueske <[hidden email]> 于2018年11月15日周四 下午8:58写道: > Thanks Jincheng, > > That makes sense to me. > Another differentiation of Table API and DataStream API would be the access > to the timer service. > The DataStream API can register and act on timers while the Table API would > not have this feature. > > Best, Fabian > > Am Mi., 14. Nov. 2018 um 02:02 Uhr schrieb jincheng sun < > [hidden email]>: > > > Hi Piotrek, Fabian: > > > > I am very glad to see your reply. Thank you very much Piotrek for asking > > very good questions. I will share my opinion: > > > > > > - The Enhancing TableAPI that I proposed is proposed for user > > friendliness. After enhancement, it will maintain the characteristics > of > > TableAPI&SQL, such as: declarative, optimization etc. > > > > > > - For the difference between DataStreamAPI and TableAPI, I think there > > are two points: > > - a). State management, DataStreamAPI users can use stateAPI to > perform > > state operations directly. TableAPI/SQL can only use such as: > > DataView > > indirectly. > > - b). Physical operation, DataStreamAPI can call > > rebalance()/rescale()/shuffle()etc. physical operation, TableAPI > > will use > > the optimizer to judge the underlying strategy. If necessary, we > will > > follow the database's hint mechanism and add a hint to the > tableAPI. > > Affects physical operations, but I don't recommend adding > operations > > such > > as rebalance()/rescale()/shuffle()/sortPartition()etc. on the > > TableAPI. > > - About the management of time attributes we can continue to discuss > in > > TableAPI Enhancement Outline: > > > > > https://mail.google.com/mail/u/0/#search/xiaowei/FMfcgxvzLWzfvCnmvMzzSfxHTSfdwLkB > > . > > > > > > - Regarding the proposed Enhanced TableAPI, my core goal is to solve > the > > usability problem. At this stage, the following four APIs will be > > proposed: > > - Table.Map > > - Table.FlatMap > > - GroupedTable.agg > > - GroupedTable.flatAgg > > > > Adding the above API is for the terms of ease of use, > > taking map as an example: Map - e.g. “dataStream.map(mapFun)”. Although > > “table.select(udf1(), udf2(), udf3()....)” can be used to accomplish the > > same function., with a map() function returning 100 columns, one has to > > define or call 100 UDFs when using SQL, which is quite involved. > > > > I totally agree that we have to discuss in depth the changes in the API > and > > let our community APIs continue to develop in the right direction. > Thanks > > again for the reply, and looking forward to your feedback!:) > > > > Best, > > Jincheng > > > > Fabian Hueske <[hidden email]> 于2018年11月13日周二 下午9:31写道: > > > > > Yes, that is my understanding as well. > > > > > > Manual time management would be another difference. > > > Something still to be discussed would be whether (or to what extent) it > > > would be possible to define the physical execution plan with hints or > > > methods like partitionByHash and sortPartition. > > > > > > Best, Fabian > > > > > > Am Di., 13. Nov. 2018, 13:57 hat Piotr Nowojski < > [hidden email] > > > > > > geschrieben: > > > > > > > Hi, > > > > > > > > > This thread is meant to enhancing the functionalities of TableAPI. > I > > > > don't > > > > > think that anyone is suggesting either reducing the effort in SQL > or > > > > > DataStream. So let's focus on how we can enhance TableAPI. > > > > > > > > I wasn’t thinking about that. As I said before, I was rising a > > question, > > > > what Table API should look like in the future if we want to diverge > it > > > more > > > > and more from SQL. It looks to me, that the more or less consensus is > > > that > > > > it should be expanded and evolve parallel to the DataStream API, but > in > > > > order to better suite different needs with following differences: > > > > - declarative > > > > - predefined schema/types > > > > - no custom state operations (?) > > > > - optimisations allowed by the above points > > > > > > > > Piotrek > > > > > > > > > On 7 Nov 2018, at 16:01, Xiaowei Jiang <[hidden email]> wrote: > > > > > > > > > > Hi Piotr: > > > > > > > > > > I want to clarify one thing first: I think that we will keep the > > > > > interoperability between TableAPI and DataStream in any case. So > user > > > can > > > > > switch between the two whenever needed. Given that, it would still > be > > > > very > > > > > helpful that users can use one API to achieve most of what they do. > > > > > Currently, TableAPI/SQL is good enough for most data analytics kind > > of > > > > > scenarios, but there are some limitations that when removed will > help > > > we > > > > go > > > > > even further in this direction. An initial list of these is > provided > > in > > > > > another thread. These are naturally extensions to TableAPI which we > > > need > > > > to > > > > > do just for the sake of making TableAPI more usable. > > > > > > > > > > TableAPI and SQL share the same underlying implementation, so > > > enhancement > > > > > in one will end up helping the other. I don't see them as > > competitive. > > > > > TableAPI is easier to extend because that we have a bit more > freedom > > in > > > > > adding new functionalities. In reality, TableAPI can be mixed with > > SQL > > > as > > > > > well. > > > > > > > > > > On the implementation side, I agree that Table API/SQL and > DataStream > > > > > should try to share as much as possible. But that question is > > > orthogonal > > > > to > > > > > the API discussion. > > > > > > > > > > This thread is meant to enhancing the functionalities of TableAPI. > I > > > > don't > > > > > think that anyone is suggesting either reducing the effort in SQL > or > > > > > DataStream. So let's focus on how we can enhance TableAPI. > > > > > > > > > > Regards, > > > > > Xiaowei > > > > > > > > > > > > > > |
Free forum by Nabble | Edit this page |