[DISCUSS] FLIP-155: Introduce a few convenient operations in Table API

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

[DISCUSS] FLIP-155: Introduce a few convenient operations in Table API

Dian Fu-2
Hi all,

I'd like to start a discussion about introducing a few convenient operations in Table API from the perspective of ease of use.

Currently some tasks are not easy to express in Table API e.g. deduplication, topn, etc, or not easy to express when there are hundreds of columns in a table, e.g. null data handling, etc.

I'd like to propose to introduce a few operations in Table API with the following purposes:
- Make Table API users to easily leverage the powerful features already in SQL, e.g. deduplication, topn, etc
- Provide some convenient operations, e.g. introducing a series of operations for null data handling (it may become a problem when there are hundreds of columns), data sampling and splitting (which is a very common use case in ML which usually needs to split a table into multiple tables for training and validation separately).

Please refer to FLIP-155 [1] for more details.

Looking forward to your feedback!

Regards,
Dian

[1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-155%3A+Introduce+a+few+convenient+operations+in+Table+API
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] FLIP-155: Introduce a few convenient operations in Table API

Wei Zhong-2
Hi Dian,

Big +1 for making the Table API easier to use. Java users and Python users can both benefit from it. I think it would be better if we add some Python API examples.

Best,
Wei


> 在 2021年1月4日,20:03,Dian Fu <[hidden email]> 写道:
>
> Hi all,
>
> I'd like to start a discussion about introducing a few convenient operations in Table API from the perspective of ease of use.
>
> Currently some tasks are not easy to express in Table API e.g. deduplication, topn, etc, or not easy to express when there are hundreds of columns in a table, e.g. null data handling, etc.
>
> I'd like to propose to introduce a few operations in Table API with the following purposes:
> - Make Table API users to easily leverage the powerful features already in SQL, e.g. deduplication, topn, etc
> - Provide some convenient operations, e.g. introducing a series of operations for null data handling (it may become a problem when there are hundreds of columns), data sampling and splitting (which is a very common use case in ML which usually needs to split a table into multiple tables for training and validation separately).
>
> Please refer to FLIP-155 [1] for more details.
>
> Looking forward to your feedback!
>
> Regards,
> Dian
>
> [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-155%3A+Introduce+a+few+convenient+operations+in+Table+API

Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] FLIP-155: Introduce a few convenient operations in Table API

Seth Wiesman-4
This makes sense, I have some questions about method names.

What do you think about renaming `dropDuplicates` to `deduplicate`? I don't
think that drop is the right word to use for this operation, it implies
records are filtered where this operator actually issues updates and
retractions. Also, deduplicate is already how we talk about this feature in
the docs so I think it would be easier for users to find.

For null handling, I don't know how close we want to stick with SQL
conventions but what about making `coalesce` a top-level method? Something
like:

myTable.coalesce($("a"), 1).as("a")

We can require the next method to be an `as`. There is already precedent
for this sort of thing, `GroupedTable#aggregate` can only be followed by
`select`.

Seth

On Mon, Jan 4, 2021 at 6:27 AM Wei Zhong <[hidden email]> wrote:

> Hi Dian,
>
> Big +1 for making the Table API easier to use. Java users and Python users
> can both benefit from it. I think it would be better if we add some Python
> API examples.
>
> Best,
> Wei
>
>
> > 在 2021年1月4日,20:03,Dian Fu <[hidden email]> 写道:
> >
> > Hi all,
> >
> > I'd like to start a discussion about introducing a few convenient
> operations in Table API from the perspective of ease of use.
> >
> > Currently some tasks are not easy to express in Table API e.g.
> deduplication, topn, etc, or not easy to express when there are hundreds of
> columns in a table, e.g. null data handling, etc.
> >
> > I'd like to propose to introduce a few operations in Table API with the
> following purposes:
> > - Make Table API users to easily leverage the powerful features already
> in SQL, e.g. deduplication, topn, etc
> > - Provide some convenient operations, e.g. introducing a series of
> operations for null data handling (it may become a problem when there are
> hundreds of columns), data sampling and splitting (which is a very common
> use case in ML which usually needs to split a table into multiple tables
> for training and validation separately).
> >
> > Please refer to FLIP-155 [1] for more details.
> >
> > Looking forward to your feedback!
> >
> > Regards,
> > Dian
> >
> > [1]
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-155%3A+Introduce+a+few+convenient+operations+in+Table+API
>
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] FLIP-155: Introduce a few convenient operations in Table API

Timo Walther-2
Hi Dian,

thanks for the proposed FLIP. I haven't taken a deep look at the
proposal yet but will do so shortly. In general, we should aim to make
the Table API as concise and self-explaining as possible. E.g. `dropna`
does not sound obvious to me.

Regarding `myTable.coalesce($("a"), 1).as("a")`: Instead of introducing
more top-level functions, maybe we should also consider introducing more
building blocks e.g. for applying an expression to every column. A more
functional approach (e.g. with lamba function) could solve more use cases.

Regards,
Timo

On 04.01.21 15:35, Seth Wiesman wrote:

> This makes sense, I have some questions about method names.
>
> What do you think about renaming `dropDuplicates` to `deduplicate`? I don't
> think that drop is the right word to use for this operation, it implies
> records are filtered where this operator actually issues updates and
> retractions. Also, deduplicate is already how we talk about this feature in
> the docs so I think it would be easier for users to find.
>
> For null handling, I don't know how close we want to stick with SQL
> conventions but what about making `coalesce` a top-level method? Something
> like:
>
> myTable.coalesce($("a"), 1).as("a")
>
> We can require the next method to be an `as`. There is already precedent
> for this sort of thing, `GroupedTable#aggregate` can only be followed by
> `select`.
>
> Seth
>
> On Mon, Jan 4, 2021 at 6:27 AM Wei Zhong <[hidden email]> wrote:
>
>> Hi Dian,
>>
>> Big +1 for making the Table API easier to use. Java users and Python users
>> can both benefit from it. I think it would be better if we add some Python
>> API examples.
>>
>> Best,
>> Wei
>>
>>
>>> 在 2021年1月4日,20:03,Dian Fu <[hidden email]> 写道:
>>>
>>> Hi all,
>>>
>>> I'd like to start a discussion about introducing a few convenient
>> operations in Table API from the perspective of ease of use.
>>>
>>> Currently some tasks are not easy to express in Table API e.g.
>> deduplication, topn, etc, or not easy to express when there are hundreds of
>> columns in a table, e.g. null data handling, etc.
>>>
>>> I'd like to propose to introduce a few operations in Table API with the
>> following purposes:
>>> - Make Table API users to easily leverage the powerful features already
>> in SQL, e.g. deduplication, topn, etc
>>> - Provide some convenient operations, e.g. introducing a series of
>> operations for null data handling (it may become a problem when there are
>> hundreds of columns), data sampling and splitting (which is a very common
>> use case in ML which usually needs to split a table into multiple tables
>> for training and validation separately).
>>>
>>> Please refer to FLIP-155 [1] for more details.
>>>
>>> Looking forward to your feedback!
>>>
>>> Regards,
>>> Dian
>>>
>>> [1]
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-155%3A+Introduce+a+few+convenient+operations+in+Table+API
>>
>>
>

Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] FLIP-155: Introduce a few convenient operations in Table API

Jark Wu-2
Thanks Dian,

+1 to `deduplicate`.

Regarding `myTable.coalesce($("a"), 1).as("a")`, I'm afraid it may
conflict/confuse the built-in expression `coalesce(f0, 0)` (we may
introduce it in the future).

Besides that, could we also align other features of Flink SQL, e.g.
event-time/processing-time temporal join, SQL Hints, window TVF (FLIP-145
[1])?

Best,
Jark

[1]:
https://cwiki.apache.org/confluence/display/FLINK/FLIP-145%3A+Support+SQL+windowing+table-valued+function





On Mon, 4 Jan 2021 at 22:59, Timo Walther <[hidden email]> wrote:

> Hi Dian,
>
> thanks for the proposed FLIP. I haven't taken a deep look at the
> proposal yet but will do so shortly. In general, we should aim to make
> the Table API as concise and self-explaining as possible. E.g. `dropna`
> does not sound obvious to me.
>
> Regarding `myTable.coalesce($("a"), 1).as("a")`: Instead of introducing
> more top-level functions, maybe we should also consider introducing more
> building blocks e.g. for applying an expression to every column. A more
> functional approach (e.g. with lamba function) could solve more use cases.
>
> Regards,
> Timo
>
> On 04.01.21 15:35, Seth Wiesman wrote:
> > This makes sense, I have some questions about method names.
> >
> > What do you think about renaming `dropDuplicates` to `deduplicate`? I
> don't
> > think that drop is the right word to use for this operation, it implies
> > records are filtered where this operator actually issues updates and
> > retractions. Also, deduplicate is already how we talk about this feature
> in
> > the docs so I think it would be easier for users to find.
> >
> > For null handling, I don't know how close we want to stick with SQL
> > conventions but what about making `coalesce` a top-level method?
> Something
> > like:
> >
> > myTable.coalesce($("a"), 1).as("a")
> >
> > We can require the next method to be an `as`. There is already precedent
> > for this sort of thing, `GroupedTable#aggregate` can only be followed by
> > `select`.
> >
> > Seth
> >
> > On Mon, Jan 4, 2021 at 6:27 AM Wei Zhong <[hidden email]> wrote:
> >
> >> Hi Dian,
> >>
> >> Big +1 for making the Table API easier to use. Java users and Python
> users
> >> can both benefit from it. I think it would be better if we add some
> Python
> >> API examples.
> >>
> >> Best,
> >> Wei
> >>
> >>
> >>> 在 2021年1月4日,20:03,Dian Fu <[hidden email]> 写道:
> >>>
> >>> Hi all,
> >>>
> >>> I'd like to start a discussion about introducing a few convenient
> >> operations in Table API from the perspective of ease of use.
> >>>
> >>> Currently some tasks are not easy to express in Table API e.g.
> >> deduplication, topn, etc, or not easy to express when there are
> hundreds of
> >> columns in a table, e.g. null data handling, etc.
> >>>
> >>> I'd like to propose to introduce a few operations in Table API with the
> >> following purposes:
> >>> - Make Table API users to easily leverage the powerful features already
> >> in SQL, e.g. deduplication, topn, etc
> >>> - Provide some convenient operations, e.g. introducing a series of
> >> operations for null data handling (it may become a problem when there
> are
> >> hundreds of columns), data sampling and splitting (which is a very
> common
> >> use case in ML which usually needs to split a table into multiple tables
> >> for training and validation separately).
> >>>
> >>> Please refer to FLIP-155 [1] for more details.
> >>>
> >>> Looking forward to your feedback!
> >>>
> >>> Regards,
> >>> Dian
> >>>
> >>> [1]
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-155%3A+Introduce+a+few+convenient+operations+in+Table+API
> >>
> >>
> >
>
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] FLIP-155: Introduce a few convenient operations in Table API

Dian Fu-2
In reply to this post by Timo Walther-2
Thanks a lot for your comments!

Regarding to Python Table API examples: I thought it should be straightforward about how to use these operations in Python Table API and so have not added them. However, the suggestions make sense to me and I have added some examples about how to use them in Python Table API to make it more clear.

Regarding to dropDuplicates vs deduplicate: +1 to use deduplicate. It's more consistent with the feature/concept which is already documented clearly in Flink.

Regarding to `myTable.coalesce($("a"), 1).as("a")`: I'm still in favor of fillna for now. Compared to coalesce, fillna could handle multiple columns in one method call. For the naming convention, the name "fillna/dropna/replace" comes from Pandas [1][2][3].

Regarding to `event-time/processing-time temporal join, SQL Hints, window TVF`: Good catch! Definitely we should support them in Table API. I will update the FLIP about these functionalities.

[1] https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html <https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html>
[2] https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html <https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html>
[3] https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html <https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html>

> 在 2021年1月4日,下午10:59,Timo Walther <[hidden email]> 写道:
>
> Hi Dian,
>
> thanks for the proposed FLIP. I haven't taken a deep look at the proposal yet but will do so shortly. In general, we should aim to make the Table API as concise and self-explaining as possible. E.g. `dropna` does not sound obvious to me.
>
> Regarding `myTable.coalesce($("a"), 1).as("a")`: Instead of introducing more top-level functions, maybe we should also consider introducing more building blocks e.g. for applying an expression to every column. A more functional approach (e.g. with lamba function) could solve more use cases.
>
> Regards,
> Timo
>
> On 04.01.21 15:35, Seth Wiesman wrote:
>> This makes sense, I have some questions about method names.
>> What do you think about renaming `dropDuplicates` to `deduplicate`? I don't
>> think that drop is the right word to use for this operation, it implies
>> records are filtered where this operator actually issues updates and
>> retractions. Also, deduplicate is already how we talk about this feature in
>> the docs so I think it would be easier for users to find.
>> For null handling, I don't know how close we want to stick with SQL
>> conventions but what about making `coalesce` a top-level method? Something
>> like:
>> myTable.coalesce($("a"), 1).as("a")
>> We can require the next method to be an `as`. There is already precedent
>> for this sort of thing, `GroupedTable#aggregate` can only be followed by
>> `select`.
>> Seth
>> On Mon, Jan 4, 2021 at 6:27 AM Wei Zhong <[hidden email]> wrote:
>>> Hi Dian,
>>>
>>> Big +1 for making the Table API easier to use. Java users and Python users
>>> can both benefit from it. I think it would be better if we add some Python
>>> API examples.
>>>
>>> Best,
>>> Wei
>>>
>>>
>>>> 在 2021年1月4日,20:03,Dian Fu <[hidden email]> 写道:
>>>>
>>>> Hi all,
>>>>
>>>> I'd like to start a discussion about introducing a few convenient
>>> operations in Table API from the perspective of ease of use.
>>>>
>>>> Currently some tasks are not easy to express in Table API e.g.
>>> deduplication, topn, etc, or not easy to express when there are hundreds of
>>> columns in a table, e.g. null data handling, etc.
>>>>
>>>> I'd like to propose to introduce a few operations in Table API with the
>>> following purposes:
>>>> - Make Table API users to easily leverage the powerful features already
>>> in SQL, e.g. deduplication, topn, etc
>>>> - Provide some convenient operations, e.g. introducing a series of
>>> operations for null data handling (it may become a problem when there are
>>> hundreds of columns), data sampling and splitting (which is a very common
>>> use case in ML which usually needs to split a table into multiple tables
>>> for training and validation separately).
>>>>
>>>> Please refer to FLIP-155 [1] for more details.
>>>>
>>>> Looking forward to your feedback!
>>>>
>>>> Regards,
>>>> Dian
>>>>
>>>> [1]
>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-155%3A+Introduce+a+few+convenient+operations+in+Table+API
>>>
>>>
>

Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] FLIP-155: Introduce a few convenient operations in Table API

Dian Fu-2
Hi all,

I have updated the FLIP about temporal join, sql hints and window TVF.

Regards,
Dian

> 在 2021年1月5日,上午11:58,Dian Fu <[hidden email]> 写道:
>
> Thanks a lot for your comments!
>
> Regarding to Python Table API examples: I thought it should be straightforward about how to use these operations in Python Table API and so have not added them. However, the suggestions make sense to me and I have added some examples about how to use them in Python Table API to make it more clear.
>
> Regarding to dropDuplicates vs deduplicate: +1 to use deduplicate. It's more consistent with the feature/concept which is already documented clearly in Flink.
>
> Regarding to `myTable.coalesce($("a"), 1).as("a")`: I'm still in favor of fillna for now. Compared to coalesce, fillna could handle multiple columns in one method call. For the naming convention, the name "fillna/dropna/replace" comes from Pandas [1][2][3].
>
> Regarding to `event-time/processing-time temporal join, SQL Hints, window TVF`: Good catch! Definitely we should support them in Table API. I will update the FLIP about these functionalities.
>
> [1] https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html <https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html>
> [2] https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html <https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html>
> [3] https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html <https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html>
>> 在 2021年1月4日,下午10:59,Timo Walther <[hidden email] <mailto:[hidden email]>> 写道:
>>
>> Hi Dian,
>>
>> thanks for the proposed FLIP. I haven't taken a deep look at the proposal yet but will do so shortly. In general, we should aim to make the Table API as concise and self-explaining as possible. E.g. `dropna` does not sound obvious to me.
>>
>> Regarding `myTable.coalesce($("a"), 1).as("a")`: Instead of introducing more top-level functions, maybe we should also consider introducing more building blocks e.g. for applying an expression to every column. A more functional approach (e.g. with lamba function) could solve more use cases.
>>
>> Regards,
>> Timo
>>
>> On 04.01.21 15:35, Seth Wiesman wrote:
>>> This makes sense, I have some questions about method names.
>>> What do you think about renaming `dropDuplicates` to `deduplicate`? I don't
>>> think that drop is the right word to use for this operation, it implies
>>> records are filtered where this operator actually issues updates and
>>> retractions. Also, deduplicate is already how we talk about this feature in
>>> the docs so I think it would be easier for users to find.
>>> For null handling, I don't know how close we want to stick with SQL
>>> conventions but what about making `coalesce` a top-level method? Something
>>> like:
>>> myTable.coalesce($("a"), 1).as("a")
>>> We can require the next method to be an `as`. There is already precedent
>>> for this sort of thing, `GroupedTable#aggregate` can only be followed by
>>> `select`.
>>> Seth
>>> On Mon, Jan 4, 2021 at 6:27 AM Wei Zhong <[hidden email] <mailto:[hidden email]>> wrote:
>>>> Hi Dian,
>>>>
>>>> Big +1 for making the Table API easier to use. Java users and Python users
>>>> can both benefit from it. I think it would be better if we add some Python
>>>> API examples.
>>>>
>>>> Best,
>>>> Wei
>>>>
>>>>
>>>>> 在 2021年1月4日,20:03,Dian Fu <[hidden email] <mailto:[hidden email]>> 写道:
>>>>>
>>>>> Hi all,
>>>>>
>>>>> I'd like to start a discussion about introducing a few convenient
>>>> operations in Table API from the perspective of ease of use.
>>>>>
>>>>> Currently some tasks are not easy to express in Table API e.g.
>>>> deduplication, topn, etc, or not easy to express when there are hundreds of
>>>> columns in a table, e.g. null data handling, etc.
>>>>>
>>>>> I'd like to propose to introduce a few operations in Table API with the
>>>> following purposes:
>>>>> - Make Table API users to easily leverage the powerful features already
>>>> in SQL, e.g. deduplication, topn, etc
>>>>> - Provide some convenient operations, e.g. introducing a series of
>>>> operations for null data handling (it may become a problem when there are
>>>> hundreds of columns), data sampling and splitting (which is a very common
>>>> use case in ML which usually needs to split a table into multiple tables
>>>> for training and validation separately).
>>>>>
>>>>> Please refer to FLIP-155 [1] for more details.
>>>>>
>>>>> Looking forward to your feedback!
>>>>>
>>>>> Regards,
>>>>> Dian
>>>>>
>>>>> [1]
>>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-155%3A+Introduce+a+few+convenient+operations+in+Table+API <https://cwiki.apache.org/confluence/display/FLINK/FLIP-155%3A+Introduce+a+few+convenient+operations+in+Table+API>
>>>>
>>>>
>>
>

Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] FLIP-155: Introduce a few convenient operations in Table API

Timo Walther-2
Hi Dian,

Thanks for working on improving the Table API. I went through the entire
FLIP and many functions definitely make sense. However, we need to make
sure that the general API naming, behavior etc. remains consistent.

Here is some feedback from my side:

1) deduplicate
Are we planning to overload this method in Java or do users always have
to provide all 3 parameters? I'm asking because I find the `new
Expression[] {$("a"), $("b")}` not very fluent. It would be better to
have varargs at the end of the method signature instead.

If this is not possible, maybe we could also think about forcing
`withColumns()`/`withoutColumns` programatically at those locations
instead of using arrays. Maybe we can introduce a `ExpressionList` that
is returned by `withColumns`/`withoutColumns`.

Are users able to define `asc` or `desc` for the `orderField`?

2) topn

Rename to just `top` such that it reads `top(3)`?

Can't we use the parts of the API for this task? And introduce a
`paritionBy` clause: `Table.partitionBy(...).orderBy(...).limit(3)`

Actually, we could use a similar syntax for deduplicate as well:
`Table.partitionBy(...).orderBy(...).deduplicate()`

3) hint

How can we guarantee the same API for Scala and Java? Because
`java.util.Map<String, String>` would require to perform collection
transformations for Scala users. Can we introduce a fluent way to unify
the two APIs?

For example, add a dedicated method for all kinds hints?
```
   table
     .hintOption(String key, String value)
     .hintOption(String key, String value)
     .hintOption(String key, String value)
```

4) fillna

I don't find this name intuitive, it also doesn't match to the other
methods of the API.

How about `replaceNull()`?

In general, I'm wondering here if we should rather introduce a lambda
like function that would serve a variety of use cases:

Just an initial example:
```
table.mapColumns(e -> e.ifNull(1))
table.mapColumns(e -> e.ifNull(1), ExpressionList)
```

5) dropna

Is this really useful? This sounds like a rarely used method.

6) replace

Similar to other proposed methods, we will have issues with the Scala
API when using a java.util.Map.

Furthermore this map also take expression instead of objects.


Let me know what you think.

Regards,
Timo




On 06.01.21 05:00, Dian Fu wrote:

> Hi all,
>
> I have updated the FLIP about temporal join, sql hints and window TVF.
>
> Regards,
> Dian
>
>> 在 2021年1月5日,上午11:58,Dian Fu <[hidden email]> 写道:
>>
>> Thanks a lot for your comments!
>>
>> Regarding to Python Table API examples: I thought it should be straightforward about how to use these operations in Python Table API and so have not added them. However, the suggestions make sense to me and I have added some examples about how to use them in Python Table API to make it more clear.
>>
>> Regarding to dropDuplicates vs deduplicate: +1 to use deduplicate. It's more consistent with the feature/concept which is already documented clearly in Flink.
>>
>> Regarding to `myTable.coalesce($("a"), 1).as("a")`: I'm still in favor of fillna for now. Compared to coalesce, fillna could handle multiple columns in one method call. For the naming convention, the name "fillna/dropna/replace" comes from Pandas [1][2][3].
>>
>> Regarding to `event-time/processing-time temporal join, SQL Hints, window TVF`: Good catch! Definitely we should support them in Table API. I will update the FLIP about these functionalities.
>>
>> [1] https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html <https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html>
>> [2] https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html <https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html>
>> [3] https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html <https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html>
>>> 在 2021年1月4日,下午10:59,Timo Walther <[hidden email] <mailto:[hidden email]>> 写道:
>>>
>>> Hi Dian,
>>>
>>> thanks for the proposed FLIP. I haven't taken a deep look at the proposal yet but will do so shortly. In general, we should aim to make the Table API as concise and self-explaining as possible. E.g. `dropna` does not sound obvious to me.
>>>
>>> Regarding `myTable.coalesce($("a"), 1).as("a")`: Instead of introducing more top-level functions, maybe we should also consider introducing more building blocks e.g. for applying an expression to every column. A more functional approach (e.g. with lamba function) could solve more use cases.
>>>
>>> Regards,
>>> Timo
>>>
>>> On 04.01.21 15:35, Seth Wiesman wrote:
>>>> This makes sense, I have some questions about method names.
>>>> What do you think about renaming `dropDuplicates` to `deduplicate`? I don't
>>>> think that drop is the right word to use for this operation, it implies
>>>> records are filtered where this operator actually issues updates and
>>>> retractions. Also, deduplicate is already how we talk about this feature in
>>>> the docs so I think it would be easier for users to find.
>>>> For null handling, I don't know how close we want to stick with SQL
>>>> conventions but what about making `coalesce` a top-level method? Something
>>>> like:
>>>> myTable.coalesce($("a"), 1).as("a")
>>>> We can require the next method to be an `as`. There is already precedent
>>>> for this sort of thing, `GroupedTable#aggregate` can only be followed by
>>>> `select`.
>>>> Seth
>>>> On Mon, Jan 4, 2021 at 6:27 AM Wei Zhong <[hidden email] <mailto:[hidden email]>> wrote:
>>>>> Hi Dian,
>>>>>
>>>>> Big +1 for making the Table API easier to use. Java users and Python users
>>>>> can both benefit from it. I think it would be better if we add some Python
>>>>> API examples.
>>>>>
>>>>> Best,
>>>>> Wei
>>>>>
>>>>>
>>>>>> 在 2021年1月4日,20:03,Dian Fu <[hidden email] <mailto:[hidden email]>> 写道:
>>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> I'd like to start a discussion about introducing a few convenient
>>>>> operations in Table API from the perspective of ease of use.
>>>>>>
>>>>>> Currently some tasks are not easy to express in Table API e.g.
>>>>> deduplication, topn, etc, or not easy to express when there are hundreds of
>>>>> columns in a table, e.g. null data handling, etc.
>>>>>>
>>>>>> I'd like to propose to introduce a few operations in Table API with the
>>>>> following purposes:
>>>>>> - Make Table API users to easily leverage the powerful features already
>>>>> in SQL, e.g. deduplication, topn, etc
>>>>>> - Provide some convenient operations, e.g. introducing a series of
>>>>> operations for null data handling (it may become a problem when there are
>>>>> hundreds of columns), data sampling and splitting (which is a very common
>>>>> use case in ML which usually needs to split a table into multiple tables
>>>>> for training and validation separately).
>>>>>>
>>>>>> Please refer to FLIP-155 [1] for more details.
>>>>>>
>>>>>> Looking forward to your feedback!
>>>>>>
>>>>>> Regards,
>>>>>> Dian
>>>>>>
>>>>>> [1]
>>>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-155%3A+Introduce+a+few+convenient+operations+in+Table+API <https://cwiki.apache.org/confluence/display/FLINK/FLIP-155%3A+Introduce+a+few+convenient+operations+in+Table+API>
>>>>>
>>>>>
>>>
>>
>
>