Hi all,
I'd like to start a discussion about introducing a few convenient operations in Table API from the perspective of ease of use. Currently some tasks are not easy to express in Table API e.g. deduplication, topn, etc, or not easy to express when there are hundreds of columns in a table, e.g. null data handling, etc. I'd like to propose to introduce a few operations in Table API with the following purposes: - Make Table API users to easily leverage the powerful features already in SQL, e.g. deduplication, topn, etc - Provide some convenient operations, e.g. introducing a series of operations for null data handling (it may become a problem when there are hundreds of columns), data sampling and splitting (which is a very common use case in ML which usually needs to split a table into multiple tables for training and validation separately). Please refer to FLIP-155 [1] for more details. Looking forward to your feedback! Regards, Dian [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-155%3A+Introduce+a+few+convenient+operations+in+Table+API |
Hi Dian,
Big +1 for making the Table API easier to use. Java users and Python users can both benefit from it. I think it would be better if we add some Python API examples. Best, Wei > 在 2021年1月4日,20:03,Dian Fu <[hidden email]> 写道: > > Hi all, > > I'd like to start a discussion about introducing a few convenient operations in Table API from the perspective of ease of use. > > Currently some tasks are not easy to express in Table API e.g. deduplication, topn, etc, or not easy to express when there are hundreds of columns in a table, e.g. null data handling, etc. > > I'd like to propose to introduce a few operations in Table API with the following purposes: > - Make Table API users to easily leverage the powerful features already in SQL, e.g. deduplication, topn, etc > - Provide some convenient operations, e.g. introducing a series of operations for null data handling (it may become a problem when there are hundreds of columns), data sampling and splitting (which is a very common use case in ML which usually needs to split a table into multiple tables for training and validation separately). > > Please refer to FLIP-155 [1] for more details. > > Looking forward to your feedback! > > Regards, > Dian > > [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-155%3A+Introduce+a+few+convenient+operations+in+Table+API |
This makes sense, I have some questions about method names.
What do you think about renaming `dropDuplicates` to `deduplicate`? I don't think that drop is the right word to use for this operation, it implies records are filtered where this operator actually issues updates and retractions. Also, deduplicate is already how we talk about this feature in the docs so I think it would be easier for users to find. For null handling, I don't know how close we want to stick with SQL conventions but what about making `coalesce` a top-level method? Something like: myTable.coalesce($("a"), 1).as("a") We can require the next method to be an `as`. There is already precedent for this sort of thing, `GroupedTable#aggregate` can only be followed by `select`. Seth On Mon, Jan 4, 2021 at 6:27 AM Wei Zhong <[hidden email]> wrote: > Hi Dian, > > Big +1 for making the Table API easier to use. Java users and Python users > can both benefit from it. I think it would be better if we add some Python > API examples. > > Best, > Wei > > > > 在 2021年1月4日,20:03,Dian Fu <[hidden email]> 写道: > > > > Hi all, > > > > I'd like to start a discussion about introducing a few convenient > operations in Table API from the perspective of ease of use. > > > > Currently some tasks are not easy to express in Table API e.g. > deduplication, topn, etc, or not easy to express when there are hundreds of > columns in a table, e.g. null data handling, etc. > > > > I'd like to propose to introduce a few operations in Table API with the > following purposes: > > - Make Table API users to easily leverage the powerful features already > in SQL, e.g. deduplication, topn, etc > > - Provide some convenient operations, e.g. introducing a series of > operations for null data handling (it may become a problem when there are > hundreds of columns), data sampling and splitting (which is a very common > use case in ML which usually needs to split a table into multiple tables > for training and validation separately). > > > > Please refer to FLIP-155 [1] for more details. > > > > Looking forward to your feedback! > > > > Regards, > > Dian > > > > [1] > https://cwiki.apache.org/confluence/display/FLINK/FLIP-155%3A+Introduce+a+few+convenient+operations+in+Table+API > > |
Hi Dian,
thanks for the proposed FLIP. I haven't taken a deep look at the proposal yet but will do so shortly. In general, we should aim to make the Table API as concise and self-explaining as possible. E.g. `dropna` does not sound obvious to me. Regarding `myTable.coalesce($("a"), 1).as("a")`: Instead of introducing more top-level functions, maybe we should also consider introducing more building blocks e.g. for applying an expression to every column. A more functional approach (e.g. with lamba function) could solve more use cases. Regards, Timo On 04.01.21 15:35, Seth Wiesman wrote: > This makes sense, I have some questions about method names. > > What do you think about renaming `dropDuplicates` to `deduplicate`? I don't > think that drop is the right word to use for this operation, it implies > records are filtered where this operator actually issues updates and > retractions. Also, deduplicate is already how we talk about this feature in > the docs so I think it would be easier for users to find. > > For null handling, I don't know how close we want to stick with SQL > conventions but what about making `coalesce` a top-level method? Something > like: > > myTable.coalesce($("a"), 1).as("a") > > We can require the next method to be an `as`. There is already precedent > for this sort of thing, `GroupedTable#aggregate` can only be followed by > `select`. > > Seth > > On Mon, Jan 4, 2021 at 6:27 AM Wei Zhong <[hidden email]> wrote: > >> Hi Dian, >> >> Big +1 for making the Table API easier to use. Java users and Python users >> can both benefit from it. I think it would be better if we add some Python >> API examples. >> >> Best, >> Wei >> >> >>> 在 2021年1月4日,20:03,Dian Fu <[hidden email]> 写道: >>> >>> Hi all, >>> >>> I'd like to start a discussion about introducing a few convenient >> operations in Table API from the perspective of ease of use. >>> >>> Currently some tasks are not easy to express in Table API e.g. >> deduplication, topn, etc, or not easy to express when there are hundreds of >> columns in a table, e.g. null data handling, etc. >>> >>> I'd like to propose to introduce a few operations in Table API with the >> following purposes: >>> - Make Table API users to easily leverage the powerful features already >> in SQL, e.g. deduplication, topn, etc >>> - Provide some convenient operations, e.g. introducing a series of >> operations for null data handling (it may become a problem when there are >> hundreds of columns), data sampling and splitting (which is a very common >> use case in ML which usually needs to split a table into multiple tables >> for training and validation separately). >>> >>> Please refer to FLIP-155 [1] for more details. >>> >>> Looking forward to your feedback! >>> >>> Regards, >>> Dian >>> >>> [1] >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-155%3A+Introduce+a+few+convenient+operations+in+Table+API >> >> > |
Thanks Dian,
+1 to `deduplicate`. Regarding `myTable.coalesce($("a"), 1).as("a")`, I'm afraid it may conflict/confuse the built-in expression `coalesce(f0, 0)` (we may introduce it in the future). Besides that, could we also align other features of Flink SQL, e.g. event-time/processing-time temporal join, SQL Hints, window TVF (FLIP-145 [1])? Best, Jark [1]: https://cwiki.apache.org/confluence/display/FLINK/FLIP-145%3A+Support+SQL+windowing+table-valued+function On Mon, 4 Jan 2021 at 22:59, Timo Walther <[hidden email]> wrote: > Hi Dian, > > thanks for the proposed FLIP. I haven't taken a deep look at the > proposal yet but will do so shortly. In general, we should aim to make > the Table API as concise and self-explaining as possible. E.g. `dropna` > does not sound obvious to me. > > Regarding `myTable.coalesce($("a"), 1).as("a")`: Instead of introducing > more top-level functions, maybe we should also consider introducing more > building blocks e.g. for applying an expression to every column. A more > functional approach (e.g. with lamba function) could solve more use cases. > > Regards, > Timo > > On 04.01.21 15:35, Seth Wiesman wrote: > > This makes sense, I have some questions about method names. > > > > What do you think about renaming `dropDuplicates` to `deduplicate`? I > don't > > think that drop is the right word to use for this operation, it implies > > records are filtered where this operator actually issues updates and > > retractions. Also, deduplicate is already how we talk about this feature > in > > the docs so I think it would be easier for users to find. > > > > For null handling, I don't know how close we want to stick with SQL > > conventions but what about making `coalesce` a top-level method? > Something > > like: > > > > myTable.coalesce($("a"), 1).as("a") > > > > We can require the next method to be an `as`. There is already precedent > > for this sort of thing, `GroupedTable#aggregate` can only be followed by > > `select`. > > > > Seth > > > > On Mon, Jan 4, 2021 at 6:27 AM Wei Zhong <[hidden email]> wrote: > > > >> Hi Dian, > >> > >> Big +1 for making the Table API easier to use. Java users and Python > users > >> can both benefit from it. I think it would be better if we add some > Python > >> API examples. > >> > >> Best, > >> Wei > >> > >> > >>> 在 2021年1月4日,20:03,Dian Fu <[hidden email]> 写道: > >>> > >>> Hi all, > >>> > >>> I'd like to start a discussion about introducing a few convenient > >> operations in Table API from the perspective of ease of use. > >>> > >>> Currently some tasks are not easy to express in Table API e.g. > >> deduplication, topn, etc, or not easy to express when there are > hundreds of > >> columns in a table, e.g. null data handling, etc. > >>> > >>> I'd like to propose to introduce a few operations in Table API with the > >> following purposes: > >>> - Make Table API users to easily leverage the powerful features already > >> in SQL, e.g. deduplication, topn, etc > >>> - Provide some convenient operations, e.g. introducing a series of > >> operations for null data handling (it may become a problem when there > are > >> hundreds of columns), data sampling and splitting (which is a very > common > >> use case in ML which usually needs to split a table into multiple tables > >> for training and validation separately). > >>> > >>> Please refer to FLIP-155 [1] for more details. > >>> > >>> Looking forward to your feedback! > >>> > >>> Regards, > >>> Dian > >>> > >>> [1] > >> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-155%3A+Introduce+a+few+convenient+operations+in+Table+API > >> > >> > > > > |
In reply to this post by Timo Walther-2
Thanks a lot for your comments!
Regarding to Python Table API examples: I thought it should be straightforward about how to use these operations in Python Table API and so have not added them. However, the suggestions make sense to me and I have added some examples about how to use them in Python Table API to make it more clear. Regarding to dropDuplicates vs deduplicate: +1 to use deduplicate. It's more consistent with the feature/concept which is already documented clearly in Flink. Regarding to `myTable.coalesce($("a"), 1).as("a")`: I'm still in favor of fillna for now. Compared to coalesce, fillna could handle multiple columns in one method call. For the naming convention, the name "fillna/dropna/replace" comes from Pandas [1][2][3]. Regarding to `event-time/processing-time temporal join, SQL Hints, window TVF`: Good catch! Definitely we should support them in Table API. I will update the FLIP about these functionalities. [1] https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html <https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html> [2] https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html <https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html> [3] https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html <https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html> > 在 2021年1月4日,下午10:59,Timo Walther <[hidden email]> 写道: > > Hi Dian, > > thanks for the proposed FLIP. I haven't taken a deep look at the proposal yet but will do so shortly. In general, we should aim to make the Table API as concise and self-explaining as possible. E.g. `dropna` does not sound obvious to me. > > Regarding `myTable.coalesce($("a"), 1).as("a")`: Instead of introducing more top-level functions, maybe we should also consider introducing more building blocks e.g. for applying an expression to every column. A more functional approach (e.g. with lamba function) could solve more use cases. > > Regards, > Timo > > On 04.01.21 15:35, Seth Wiesman wrote: >> This makes sense, I have some questions about method names. >> What do you think about renaming `dropDuplicates` to `deduplicate`? I don't >> think that drop is the right word to use for this operation, it implies >> records are filtered where this operator actually issues updates and >> retractions. Also, deduplicate is already how we talk about this feature in >> the docs so I think it would be easier for users to find. >> For null handling, I don't know how close we want to stick with SQL >> conventions but what about making `coalesce` a top-level method? Something >> like: >> myTable.coalesce($("a"), 1).as("a") >> We can require the next method to be an `as`. There is already precedent >> for this sort of thing, `GroupedTable#aggregate` can only be followed by >> `select`. >> Seth >> On Mon, Jan 4, 2021 at 6:27 AM Wei Zhong <[hidden email]> wrote: >>> Hi Dian, >>> >>> Big +1 for making the Table API easier to use. Java users and Python users >>> can both benefit from it. I think it would be better if we add some Python >>> API examples. >>> >>> Best, >>> Wei >>> >>> >>>> 在 2021年1月4日,20:03,Dian Fu <[hidden email]> 写道: >>>> >>>> Hi all, >>>> >>>> I'd like to start a discussion about introducing a few convenient >>> operations in Table API from the perspective of ease of use. >>>> >>>> Currently some tasks are not easy to express in Table API e.g. >>> deduplication, topn, etc, or not easy to express when there are hundreds of >>> columns in a table, e.g. null data handling, etc. >>>> >>>> I'd like to propose to introduce a few operations in Table API with the >>> following purposes: >>>> - Make Table API users to easily leverage the powerful features already >>> in SQL, e.g. deduplication, topn, etc >>>> - Provide some convenient operations, e.g. introducing a series of >>> operations for null data handling (it may become a problem when there are >>> hundreds of columns), data sampling and splitting (which is a very common >>> use case in ML which usually needs to split a table into multiple tables >>> for training and validation separately). >>>> >>>> Please refer to FLIP-155 [1] for more details. >>>> >>>> Looking forward to your feedback! >>>> >>>> Regards, >>>> Dian >>>> >>>> [1] >>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-155%3A+Introduce+a+few+convenient+operations+in+Table+API >>> >>> > |
Hi all,
I have updated the FLIP about temporal join, sql hints and window TVF. Regards, Dian > 在 2021年1月5日,上午11:58,Dian Fu <[hidden email]> 写道: > > Thanks a lot for your comments! > > Regarding to Python Table API examples: I thought it should be straightforward about how to use these operations in Python Table API and so have not added them. However, the suggestions make sense to me and I have added some examples about how to use them in Python Table API to make it more clear. > > Regarding to dropDuplicates vs deduplicate: +1 to use deduplicate. It's more consistent with the feature/concept which is already documented clearly in Flink. > > Regarding to `myTable.coalesce($("a"), 1).as("a")`: I'm still in favor of fillna for now. Compared to coalesce, fillna could handle multiple columns in one method call. For the naming convention, the name "fillna/dropna/replace" comes from Pandas [1][2][3]. > > Regarding to `event-time/processing-time temporal join, SQL Hints, window TVF`: Good catch! Definitely we should support them in Table API. I will update the FLIP about these functionalities. > > [1] https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html <https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html> > [2] https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html <https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html> > [3] https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html <https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html> >> 在 2021年1月4日,下午10:59,Timo Walther <[hidden email] <mailto:[hidden email]>> 写道: >> >> Hi Dian, >> >> thanks for the proposed FLIP. I haven't taken a deep look at the proposal yet but will do so shortly. In general, we should aim to make the Table API as concise and self-explaining as possible. E.g. `dropna` does not sound obvious to me. >> >> Regarding `myTable.coalesce($("a"), 1).as("a")`: Instead of introducing more top-level functions, maybe we should also consider introducing more building blocks e.g. for applying an expression to every column. A more functional approach (e.g. with lamba function) could solve more use cases. >> >> Regards, >> Timo >> >> On 04.01.21 15:35, Seth Wiesman wrote: >>> This makes sense, I have some questions about method names. >>> What do you think about renaming `dropDuplicates` to `deduplicate`? I don't >>> think that drop is the right word to use for this operation, it implies >>> records are filtered where this operator actually issues updates and >>> retractions. Also, deduplicate is already how we talk about this feature in >>> the docs so I think it would be easier for users to find. >>> For null handling, I don't know how close we want to stick with SQL >>> conventions but what about making `coalesce` a top-level method? Something >>> like: >>> myTable.coalesce($("a"), 1).as("a") >>> We can require the next method to be an `as`. There is already precedent >>> for this sort of thing, `GroupedTable#aggregate` can only be followed by >>> `select`. >>> Seth >>> On Mon, Jan 4, 2021 at 6:27 AM Wei Zhong <[hidden email] <mailto:[hidden email]>> wrote: >>>> Hi Dian, >>>> >>>> Big +1 for making the Table API easier to use. Java users and Python users >>>> can both benefit from it. I think it would be better if we add some Python >>>> API examples. >>>> >>>> Best, >>>> Wei >>>> >>>> >>>>> 在 2021年1月4日,20:03,Dian Fu <[hidden email] <mailto:[hidden email]>> 写道: >>>>> >>>>> Hi all, >>>>> >>>>> I'd like to start a discussion about introducing a few convenient >>>> operations in Table API from the perspective of ease of use. >>>>> >>>>> Currently some tasks are not easy to express in Table API e.g. >>>> deduplication, topn, etc, or not easy to express when there are hundreds of >>>> columns in a table, e.g. null data handling, etc. >>>>> >>>>> I'd like to propose to introduce a few operations in Table API with the >>>> following purposes: >>>>> - Make Table API users to easily leverage the powerful features already >>>> in SQL, e.g. deduplication, topn, etc >>>>> - Provide some convenient operations, e.g. introducing a series of >>>> operations for null data handling (it may become a problem when there are >>>> hundreds of columns), data sampling and splitting (which is a very common >>>> use case in ML which usually needs to split a table into multiple tables >>>> for training and validation separately). >>>>> >>>>> Please refer to FLIP-155 [1] for more details. >>>>> >>>>> Looking forward to your feedback! >>>>> >>>>> Regards, >>>>> Dian >>>>> >>>>> [1] >>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-155%3A+Introduce+a+few+convenient+operations+in+Table+API <https://cwiki.apache.org/confluence/display/FLINK/FLIP-155%3A+Introduce+a+few+convenient+operations+in+Table+API> >>>> >>>> >> > |
Hi Dian,
Thanks for working on improving the Table API. I went through the entire FLIP and many functions definitely make sense. However, we need to make sure that the general API naming, behavior etc. remains consistent. Here is some feedback from my side: 1) deduplicate Are we planning to overload this method in Java or do users always have to provide all 3 parameters? I'm asking because I find the `new Expression[] {$("a"), $("b")}` not very fluent. It would be better to have varargs at the end of the method signature instead. If this is not possible, maybe we could also think about forcing `withColumns()`/`withoutColumns` programatically at those locations instead of using arrays. Maybe we can introduce a `ExpressionList` that is returned by `withColumns`/`withoutColumns`. Are users able to define `asc` or `desc` for the `orderField`? 2) topn Rename to just `top` such that it reads `top(3)`? Can't we use the parts of the API for this task? And introduce a `paritionBy` clause: `Table.partitionBy(...).orderBy(...).limit(3)` Actually, we could use a similar syntax for deduplicate as well: `Table.partitionBy(...).orderBy(...).deduplicate()` 3) hint How can we guarantee the same API for Scala and Java? Because `java.util.Map<String, String>` would require to perform collection transformations for Scala users. Can we introduce a fluent way to unify the two APIs? For example, add a dedicated method for all kinds hints? ``` table .hintOption(String key, String value) .hintOption(String key, String value) .hintOption(String key, String value) ``` 4) fillna I don't find this name intuitive, it also doesn't match to the other methods of the API. How about `replaceNull()`? In general, I'm wondering here if we should rather introduce a lambda like function that would serve a variety of use cases: Just an initial example: ``` table.mapColumns(e -> e.ifNull(1)) table.mapColumns(e -> e.ifNull(1), ExpressionList) ``` 5) dropna Is this really useful? This sounds like a rarely used method. 6) replace Similar to other proposed methods, we will have issues with the Scala API when using a java.util.Map. Furthermore this map also take expression instead of objects. Let me know what you think. Regards, Timo On 06.01.21 05:00, Dian Fu wrote: > Hi all, > > I have updated the FLIP about temporal join, sql hints and window TVF. > > Regards, > Dian > >> 在 2021年1月5日,上午11:58,Dian Fu <[hidden email]> 写道: >> >> Thanks a lot for your comments! >> >> Regarding to Python Table API examples: I thought it should be straightforward about how to use these operations in Python Table API and so have not added them. However, the suggestions make sense to me and I have added some examples about how to use them in Python Table API to make it more clear. >> >> Regarding to dropDuplicates vs deduplicate: +1 to use deduplicate. It's more consistent with the feature/concept which is already documented clearly in Flink. >> >> Regarding to `myTable.coalesce($("a"), 1).as("a")`: I'm still in favor of fillna for now. Compared to coalesce, fillna could handle multiple columns in one method call. For the naming convention, the name "fillna/dropna/replace" comes from Pandas [1][2][3]. >> >> Regarding to `event-time/processing-time temporal join, SQL Hints, window TVF`: Good catch! Definitely we should support them in Table API. I will update the FLIP about these functionalities. >> >> [1] https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html <https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html> >> [2] https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html <https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html> >> [3] https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html <https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html> >>> 在 2021年1月4日,下午10:59,Timo Walther <[hidden email] <mailto:[hidden email]>> 写道: >>> >>> Hi Dian, >>> >>> thanks for the proposed FLIP. I haven't taken a deep look at the proposal yet but will do so shortly. In general, we should aim to make the Table API as concise and self-explaining as possible. E.g. `dropna` does not sound obvious to me. >>> >>> Regarding `myTable.coalesce($("a"), 1).as("a")`: Instead of introducing more top-level functions, maybe we should also consider introducing more building blocks e.g. for applying an expression to every column. A more functional approach (e.g. with lamba function) could solve more use cases. >>> >>> Regards, >>> Timo >>> >>> On 04.01.21 15:35, Seth Wiesman wrote: >>>> This makes sense, I have some questions about method names. >>>> What do you think about renaming `dropDuplicates` to `deduplicate`? I don't >>>> think that drop is the right word to use for this operation, it implies >>>> records are filtered where this operator actually issues updates and >>>> retractions. Also, deduplicate is already how we talk about this feature in >>>> the docs so I think it would be easier for users to find. >>>> For null handling, I don't know how close we want to stick with SQL >>>> conventions but what about making `coalesce` a top-level method? Something >>>> like: >>>> myTable.coalesce($("a"), 1).as("a") >>>> We can require the next method to be an `as`. There is already precedent >>>> for this sort of thing, `GroupedTable#aggregate` can only be followed by >>>> `select`. >>>> Seth >>>> On Mon, Jan 4, 2021 at 6:27 AM Wei Zhong <[hidden email] <mailto:[hidden email]>> wrote: >>>>> Hi Dian, >>>>> >>>>> Big +1 for making the Table API easier to use. Java users and Python users >>>>> can both benefit from it. I think it would be better if we add some Python >>>>> API examples. >>>>> >>>>> Best, >>>>> Wei >>>>> >>>>> >>>>>> 在 2021年1月4日,20:03,Dian Fu <[hidden email] <mailto:[hidden email]>> 写道: >>>>>> >>>>>> Hi all, >>>>>> >>>>>> I'd like to start a discussion about introducing a few convenient >>>>> operations in Table API from the perspective of ease of use. >>>>>> >>>>>> Currently some tasks are not easy to express in Table API e.g. >>>>> deduplication, topn, etc, or not easy to express when there are hundreds of >>>>> columns in a table, e.g. null data handling, etc. >>>>>> >>>>>> I'd like to propose to introduce a few operations in Table API with the >>>>> following purposes: >>>>>> - Make Table API users to easily leverage the powerful features already >>>>> in SQL, e.g. deduplication, topn, etc >>>>>> - Provide some convenient operations, e.g. introducing a series of >>>>> operations for null data handling (it may become a problem when there are >>>>> hundreds of columns), data sampling and splitting (which is a very common >>>>> use case in ML which usually needs to split a table into multiple tables >>>>> for training and validation separately). >>>>>> >>>>>> Please refer to FLIP-155 [1] for more details. >>>>>> >>>>>> Looking forward to your feedback! >>>>>> >>>>>> Regards, >>>>>> Dian >>>>>> >>>>>> [1] >>>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-155%3A+Introduce+a+few+convenient+operations+in+Table+API <https://cwiki.apache.org/confluence/display/FLINK/FLIP-155%3A+Introduce+a+few+convenient+operations+in+Table+API> >>>>> >>>>> >>> >> > > |
Free forum by Nabble | Edit this page |