(DEPRECATED) Apache Flink Mailing List archive.

[DISCUSS] What parts of the Python API should we focus on next ?

Classic

List

Threaded

4 messages Options

jincheng sun

[DISCUSS] What parts of the Python API should we focus on next ?

Hi folks,

As release-1.10 is under feature-freeze(The stateless Python UDF is already
supported), it is time for us to plan the features of PyFlink for the next
release.

To make sure the features supported in PyFlink are the mostly demanded for
the community, we'd like to get more people involved, i.e., it would be
better if all of the devs and users join in the discussion of which kind of
features are more important and urgent.

We have already listed some features from different aspects which you can
find below, however it is not the ultimate plan. We appreciate any
suggestions from the community, either on the functionalities or
performance improvements, etc. Would be great to have the following
information if you want to suggest to add some features:

---------
- Feature description: xxxx
- Benefits of the feature: xxxx
- Use cases (optional): xxxx
----------

----Features in my mind----

1. Integration with most popular Python libraries
- fromPandas/toPandas API
Description:
Support to convert between Table and pandas.DataFrame.
Benefits:
Users could switch between Flink and Pandas API, for example, do
some analysis using Flink and then perform analysis using the Pandas API if
the result data is small and could fit into the memory, and vice versa.

- Support Scalar Pandas UDF
Description:
Support scalar Pandas UDF in Python Table API & SQL. Both the
input and output of the UDF is pandas.Series.
Benefits:
1) Scalar Pandas UDF performs better than row-at-a-time UDF,
ranging from 3x to over 100x (from pyspark)
2) Users could use Pandas/Numpy API in the Python UDF
implementation if the input/output data type is pandas.Series

- Support Pandas UDAF in batch GroupBy aggregation
Description:
Support Pandas UDAF in batch GroupBy aggregation of Python Table
API & SQL. Both the input and output of the UDF is pandas.DataFrame.
Benefits:
1) Pandas UDAF performs better than row-at-a-time UDAF more than
10x in certain scenarios
2) Users could use Pandas/Numpy API in the Python UDAF
implementation if the input/output data type is pandas.DataFrame

2. Fully support all kinds of Python UDF
- Support Python UDAF(stateful) in GroupBy aggregation (NOTE: Please
give us some use case if you want this feature to be contained in the next
release)
Description:
Support UDAF in GroupBy aggregation.
Benefits:
Users could define and use Python UDAF and use it in GroupBy
aggregation. Without it, users have to use Java/Scala UDAF.

- Support Python UDTF
Description:
Support Python UDTF in Python Table API & SQL
Benefits:
Users could define and use Python UDTF in Python Table API & SQL.
Without it, users have to use Java/Scala UDTF.

3. Debugging and Monitoring of Python UDF
- Support User-Defined Metrics
Description:
Allow users to define user-defined metrics and global job parameters
with Python UDFs.
Benefits:
UDF needs metrics to monitor some business or technical indicators,
which is also a requirement for UDFs.

- Make the log level configurable
Description:
Allow users to config the log level of Python UDF.
Benefits:
Users could configure different log levels when debugging and
deploying.

4. Enrich the Python execution environment
- Docker Mode Support
Description:
Support running python UDF in docker workers.
Benefits:
Support various of deployments to meet more users' requirements.

5. Expand the usage scope of Python UDF
- Support to use Python UDF via SQL client
Description:
Support to register and use Python UDF via SQL client
Benefits:
SQL client is a very important interface for SQL users. This
feature allows SQL users to use Python UDFs via SQL client.

- Integrate Python UDF with Notebooks
Description:
Such as Zeppelin, etc (Especially Python dependencies)

- Support to register Python UDF into catalog
Description:
Support to register Python UDF into catalog
Benefits:
1）Catalog is the centralized place to manage metadata such as
tables, UDFs, etc. With it, users could register the UDFs once and use it
anywhere.
2) It's an important part of the SQL functionality. If Python
UDFs are not supported to be registered and used in catalog, Python UDFs
could not be shared between jobs.

6. Performance Improvements of Python UDF
- Cython improvements
Description:
Cython Improvements in coder & operations
Benefits:
Initial tests show that Cython will speed 3x+ in coder
serialization/deserialization.

7. Add Python ML API
- Add Python ML Pipeline API
Description:
Align Python ML Pipeline API with Java/Scala
Benefits:
1) Currently, we already have the Pipeline APIs for ML. It would be
good to also have the related Python APIs.
2) In many cases, algorithm engineers prefer the Python language.

BTW, the PyFlink is a new component, and there are still a lot of work need
to do. Thus, everybody is cordially welcome to join the contribution to
PyFlink, including asking questions, filing bug reports, proposing new
features, joining discussions, contributing code or documentation ...

Hope to see your feedback!

Best,
Jincheng

jincheng sun

Re: [DISCUSS] What parts of the Python API should we focus on next ?

Also CC user-zh.

Best,
Jincheng

jincheng sun <[hidden email]> 于2019年12月19日周四上午10:20写道：

> Hi folks,
>
> As release-1.10 is under feature-freeze(The stateless Python UDF is
> already supported), it is time for us to plan the features of PyFlink for
> the next release.
>
> To make sure the features supported in PyFlink are the mostly demanded for
> the community, we'd like to get more people involved, i.e., it would be
> better if all of the devs and users join in the discussion of which kind of
> features are more important and urgent.
>
> We have already listed some features from different aspects which you can
> find below, however it is not the ultimate plan. We appreciate any
> suggestions from the community, either on the functionalities or
> performance improvements, etc. Would be great to have the following
> information if you want to suggest to add some features:
>
> ---------
> - Feature description: xxxx
> - Benefits of the feature: xxxx
> - Use cases (optional): xxxx
> ----------
>
> ----Features in my mind----
>
> 1. Integration with most popular Python libraries
> - fromPandas/toPandas API
> Description:
> Support to convert between Table and pandas.DataFrame.
> Benefits:
> Users could switch between Flink and Pandas API, for example, do
> some analysis using Flink and then perform analysis using the Pandas API if
> the result data is small and could fit into the memory, and vice versa.
>
> - Support Scalar Pandas UDF
> Description:
> Support scalar Pandas UDF in Python Table API & SQL. Both the
> input and output of the UDF is pandas.Series.
> Benefits:
> 1) Scalar Pandas UDF performs better than row-at-a-time UDF,
> ranging from 3x to over 100x (from pyspark)
> 2) Users could use Pandas/Numpy API in the Python UDF
> implementation if the input/output data type is pandas.Series
>
> - Support Pandas UDAF in batch GroupBy aggregation
> Description:
> Support Pandas UDAF in batch GroupBy aggregation of Python
> Table API & SQL. Both the input and output of the UDF is pandas.DataFrame.
> Benefits:
> 1) Pandas UDAF performs better than row-at-a-time UDAF more than
> 10x in certain scenarios
> 2) Users could use Pandas/Numpy API in the Python UDAF
> implementation if the input/output data type is pandas.DataFrame
>
> 2. Fully support all kinds of Python UDF
> - Support Python UDAF(stateful) in GroupBy aggregation (NOTE: Please
> give us some use case if you want this feature to be contained in the next
> release)
> Description:
> Support UDAF in GroupBy aggregation.
> Benefits:
> Users could define and use Python UDAF and use it in GroupBy
> aggregation. Without it, users have to use Java/Scala UDAF.
>
> - Support Python UDTF
> Description:
> Support Python UDTF in Python Table API & SQL
> Benefits:
> Users could define and use Python UDTF in Python Table API & SQL.
> Without it, users have to use Java/Scala UDTF.
>
> 3. Debugging and Monitoring of Python UDF
> - Support User-Defined Metrics
> Description:
> Allow users to define user-defined metrics and global job
> parameters with Python UDFs.
> Benefits:
> UDF needs metrics to monitor some business or technical indicators,
> which is also a requirement for UDFs.
>
> - Make the log level configurable
> Description:
> Allow users to config the log level of Python UDF.
> Benefits:
> Users could configure different log levels when debugging and
> deploying.
>
> 4. Enrich the Python execution environment
> - Docker Mode Support
> Description:
> Support running python UDF in docker workers.
> Benefits:
> Support various of deployments to meet more users' requirements.
>
> 5. Expand the usage scope of Python UDF
> - Support to use Python UDF via SQL client
> Description:
> Support to register and use Python UDF via SQL client
> Benefits:
> SQL client is a very important interface for SQL users. This
> feature allows SQL users to use Python UDFs via SQL client.
>
> - Integrate Python UDF with Notebooks
> Description:
> Such as Zeppelin, etc (Especially Python dependencies)
>
> - Support to register Python UDF into catalog
> Description:
> Support to register Python UDF into catalog
> Benefits:
> 1）Catalog is the centralized place to manage metadata such as
> tables, UDFs, etc. With it, users could register the UDFs once and use it
> anywhere.
> 2) It's an important part of the SQL functionality. If Python
> UDFs are not supported to be registered and used in catalog, Python UDFs
> could not be shared between jobs.
>
> 6. Performance Improvements of Python UDF
> - Cython improvements
> Description:
> Cython Improvements in coder & operations
> Benefits:
> Initial tests show that Cython will speed 3x+ in coder
> serialization/deserialization.
>
> 7. Add Python ML API
> - Add Python ML Pipeline API
> Description:
> Align Python ML Pipeline API with Java/Scala
> Benefits:
> 1) Currently, we already have the Pipeline APIs for ML. It would be
> good to also have the related Python APIs.
> 2) In many cases, algorithm engineers prefer the Python language.
>
>
> BTW, the PyFlink is a new component, and there are still a lot of work
> need to do. Thus, everybody is cordially welcome to join the contribution
> to PyFlink, including asking questions, filing bug reports, proposing new
> features, joining discussions, contributing code or documentation ...
>
> Hope to see your feedback!
>
> Best,
> Jincheng
>

bowen.li

Re: [DISCUSS] What parts of the Python API should we focus on next ?

- integrate PyFlink with Jupyter notebook
- Description: users should be able to run PyFlink seamlessly in Jupyter
- Benefits: Jupyter is the industrial standard notebook for data
scientists. I’ve talked to a few companies in North America, they think
Jupyter is the #1 way to empower internal DS with Flink

On Wed, Dec 18, 2019 at 19:05 jincheng sun <[hidden email]> wrote:

> Also CC user-zh.
>
> Best,
> Jincheng
>
>
> jincheng sun <[hidden email]> 于2019年12月19日周四上午10:20写道：
>
>> Hi folks,
>>
>> As release-1.10 is under feature-freeze(The stateless Python UDF is
>> already supported), it is time for us to plan the features of PyFlink for
>> the next release.
>>
>> To make sure the features supported in PyFlink are the mostly demanded
>> for the community, we'd like to get more people involved, i.e., it would be
>> better if all of the devs and users join in the discussion of which kind of
>> features are more important and urgent.
>>
>> We have already listed some features from different aspects which you can
>> find below, however it is not the ultimate plan. We appreciate any
>> suggestions from the community, either on the functionalities or
>> performance improvements, etc. Would be great to have the following
>> information if you want to suggest to add some features:
>>
>> ---------
>> - Feature description: xxxx
>> - Benefits of the feature: xxxx
>> - Use cases (optional): xxxx
>> ----------
>>
>> ----Features in my mind----
>>
>> 1. Integration with most popular Python libraries
>> - fromPandas/toPandas API
>> Description:
>> Support to convert between Table and pandas.DataFrame.
>> Benefits:
>> Users could switch between Flink and Pandas API, for example,
>> do some analysis using Flink and then perform analysis using the Pandas API
>> if the result data is small and could fit into the memory, and vice versa.
>>
>> - Support Scalar Pandas UDF
>> Description:
>> Support scalar Pandas UDF in Python Table API & SQL. Both the
>> input and output of the UDF is pandas.Series.
>> Benefits:
>> 1) Scalar Pandas UDF performs better than row-at-a-time UDF,
>> ranging from 3x to over 100x (from pyspark)
>> 2) Users could use Pandas/Numpy API in the Python UDF
>> implementation if the input/output data type is pandas.Series
>>
>> - Support Pandas UDAF in batch GroupBy aggregation
>> Description:
>> Support Pandas UDAF in batch GroupBy aggregation of Python
>> Table API & SQL. Both the input and output of the UDF is pandas.DataFrame.
>> Benefits:
>> 1) Pandas UDAF performs better than row-at-a-time UDAF more
>> than 10x in certain scenarios
>> 2) Users could use Pandas/Numpy API in the Python UDAF
>> implementation if the input/output data type is pandas.DataFrame
>>
>> 2. Fully support all kinds of Python UDF
>> - Support Python UDAF(stateful) in GroupBy aggregation (NOTE: Please
>> give us some use case if you want this feature to be contained in the next
>> release)
>> Description:
>> Support UDAF in GroupBy aggregation.
>> Benefits:
>> Users could define and use Python UDAF and use it in GroupBy
>> aggregation. Without it, users have to use Java/Scala UDAF.
>>
>> - Support Python UDTF
>> Description:
>> Support Python UDTF in Python Table API & SQL
>> Benefits:
>> Users could define and use Python UDTF in Python Table API & SQL.
>> Without it, users have to use Java/Scala UDTF.
>>
>> 3. Debugging and Monitoring of Python UDF
>> - Support User-Defined Metrics
>> Description:
>> Allow users to define user-defined metrics and global job
>> parameters with Python UDFs.
>> Benefits:
>> UDF needs metrics to monitor some business or technical
>> indicators, which is also a requirement for UDFs.
>>
>> - Make the log level configurable
>> Description:
>> Allow users to config the log level of Python UDF.
>> Benefits:
>> Users could configure different log levels when debugging and
>> deploying.
>>
>> 4. Enrich the Python execution environment
>> - Docker Mode Support
>> Description:
>> Support running python UDF in docker workers.
>> Benefits:
>> Support various of deployments to meet more users' requirements.
>>
>> 5. Expand the usage scope of Python UDF
>> - Support to use Python UDF via SQL client
>> Description:
>> Support to register and use Python UDF via SQL client
>> Benefits:
>> SQL client is a very important interface for SQL users. This
>> feature allows SQL users to use Python UDFs via SQL client.
>>
>> - Integrate Python UDF with Notebooks
>> Description:
>> Such as Zeppelin, etc (Especially Python dependencies)
>>
>> - Support to register Python UDF into catalog
>> Description:
>> Support to register Python UDF into catalog
>> Benefits:
>> 1）Catalog is the centralized place to manage metadata such as
>> tables, UDFs, etc. With it, users could register the UDFs once and use it
>> anywhere.
>> 2) It's an important part of the SQL functionality. If Python
>> UDFs are not supported to be registered and used in catalog, Python UDFs
>> could not be shared between jobs.
>>
>> 6. Performance Improvements of Python UDF
>> - Cython improvements
>> Description:
>> Cython Improvements in coder & operations
>> Benefits:
>> Initial tests show that Cython will speed 3x+ in coder
>> serialization/deserialization.
>>
>> 7. Add Python ML API
>> - Add Python ML Pipeline API
>> Description:
>> Align Python ML Pipeline API with Java/Scala
>> Benefits:
>> 1) Currently, we already have the Pipeline APIs for ML. It would
>> be good to also have the related Python APIs.
>> 2) In many cases, algorithm engineers prefer the Python language.
>>
>>
>> BTW, the PyFlink is a new component, and there are still a lot of work
>> need to do. Thus, everybody is cordially welcome to join the contribution
>> to PyFlink, including asking questions, filing bug reports, proposing new
>> features, joining discussions, contributing code or documentation ...
>>
>> Hope to see your feedback!
>>
>> Best,
>> Jincheng
>>
>

jincheng sun

Re: [DISCUSS] What parts of the Python API should we focus on next ?

Hi Bowen,

Your suggestions are very helpful for expanding the PyFlink ecology. I
also mentioned above to integrate notebooks，Jupyter and Zeppelin are both
very excellent notebooks. The process of integrating Jupyter and Zeppelin
also requires the support of Jupyter and Zeppelin community personnel.
Currently Jeff has made great efforts in Zeppelin community for PyFink. I
would greatly appreciate if anyone who active in the Jupyter community also
willing to help to integrate PyFlink.

Best,
Jincheng

Bowen Li <[hidden email]> 于2019年12月20日周五上午12:55写道：

> - integrate PyFlink with Jupyter notebook
> - Description: users should be able to run PyFlink seamlessly in Jupyter
> - Benefits: Jupyter is the industrial standard notebook for data
> scientists. I’ve talked to a few companies in North America, they think
> Jupyter is the #1 way to empower internal DS with Flink
>
>
> On Wed, Dec 18, 2019 at 19:05 jincheng sun <[hidden email]>
> wrote:
>
>> Also CC user-zh.
>>
>> Best,
>> Jincheng
>>
>>
>> jincheng sun <[hidden email]> 于2019年12月19日周四上午10:20写道：
>>
>>> Hi folks,
>>>
>>> As release-1.10 is under feature-freeze(The stateless Python UDF is
>>> already supported), it is time for us to plan the features of PyFlink for
>>> the next release.
>>>
>>> To make sure the features supported in PyFlink are the mostly demanded
>>> for the community, we'd like to get more people involved, i.e., it would be
>>> better if all of the devs and users join in the discussion of which kind of
>>> features are more important and urgent.
>>>
>>> We have already listed some features from different aspects which you
>>> can find below, however it is not the ultimate plan. We appreciate any
>>> suggestions from the community, either on the functionalities or
>>> performance improvements, etc. Would be great to have the following
>>> information if you want to suggest to add some features:
>>>
>>> ---------
>>> - Feature description: xxxx
>>> - Benefits of the feature: xxxx
>>> - Use cases (optional): xxxx
>>> ----------
>>>
>>> ----Features in my mind----
>>>
>>> 1. Integration with most popular Python libraries
>>> - fromPandas/toPandas API
>>> Description:
>>> Support to convert between Table and pandas.DataFrame.
>>> Benefits:
>>> Users could switch between Flink and Pandas API, for example,
>>> do some analysis using Flink and then perform analysis using the Pandas API
>>> if the result data is small and could fit into the memory, and vice versa.
>>>
>>> - Support Scalar Pandas UDF
>>> Description:
>>> Support scalar Pandas UDF in Python Table API & SQL. Both the
>>> input and output of the UDF is pandas.Series.
>>> Benefits:
>>> 1) Scalar Pandas UDF performs better than row-at-a-time UDF,
>>> ranging from 3x to over 100x (from pyspark)
>>> 2) Users could use Pandas/Numpy API in the Python UDF
>>> implementation if the input/output data type is pandas.Series
>>>
>>> - Support Pandas UDAF in batch GroupBy aggregation
>>> Description:
>>> Support Pandas UDAF in batch GroupBy aggregation of Python
>>> Table API & SQL. Both the input and output of the UDF is pandas.DataFrame.
>>> Benefits:
>>> 1) Pandas UDAF performs better than row-at-a-time UDAF more
>>> than 10x in certain scenarios
>>> 2) Users could use Pandas/Numpy API in the Python UDAF
>>> implementation if the input/output data type is pandas.DataFrame
>>>
>>> 2. Fully support all kinds of Python UDF
>>> - Support Python UDAF(stateful) in GroupBy aggregation (NOTE: Please
>>> give us some use case if you want this feature to be contained in the next
>>> release)
>>> Description:
>>> Support UDAF in GroupBy aggregation.
>>> Benefits:
>>> Users could define and use Python UDAF and use it in GroupBy
>>> aggregation. Without it, users have to use Java/Scala UDAF.
>>>
>>> - Support Python UDTF
>>> Description:
>>> Support Python UDTF in Python Table API & SQL
>>> Benefits:
>>> Users could define and use Python UDTF in Python Table API &
>>> SQL. Without it, users have to use Java/Scala UDTF.
>>>
>>> 3. Debugging and Monitoring of Python UDF
>>> - Support User-Defined Metrics
>>> Description:
>>> Allow users to define user-defined metrics and global job
>>> parameters with Python UDFs.
>>> Benefits:
>>> UDF needs metrics to monitor some business or technical
>>> indicators, which is also a requirement for UDFs.
>>>
>>> - Make the log level configurable
>>> Description:
>>> Allow users to config the log level of Python UDF.
>>> Benefits:
>>> Users could configure different log levels when debugging and
>>> deploying.
>>>
>>> 4. Enrich the Python execution environment
>>> - Docker Mode Support
>>> Description:
>>> Support running python UDF in docker workers.
>>> Benefits:
>>> Support various of deployments to meet more users' requirements.
>>>
>>> 5. Expand the usage scope of Python UDF
>>> - Support to use Python UDF via SQL client
>>> Description:
>>> Support to register and use Python UDF via SQL client
>>> Benefits:
>>> SQL client is a very important interface for SQL users. This
>>> feature allows SQL users to use Python UDFs via SQL client.
>>>
>>> - Integrate Python UDF with Notebooks
>>> Description:
>>> Such as Zeppelin, etc (Especially Python dependencies)
>>>
>>> - Support to register Python UDF into catalog
>>> Description:
>>> Support to register Python UDF into catalog
>>> Benefits:
>>> 1）Catalog is the centralized place to manage metadata such as
>>> tables, UDFs, etc. With it, users could register the UDFs once and use it
>>> anywhere.
>>> 2) It's an important part of the SQL functionality. If Python
>>> UDFs are not supported to be registered and used in catalog, Python UDFs
>>> could not be shared between jobs.
>>>
>>> 6. Performance Improvements of Python UDF
>>> - Cython improvements
>>> Description:
>>> Cython Improvements in coder & operations
>>> Benefits:
>>> Initial tests show that Cython will speed 3x+ in coder
>>> serialization/deserialization.
>>>
>>> 7. Add Python ML API
>>> - Add Python ML Pipeline API
>>> Description:
>>> Align Python ML Pipeline API with Java/Scala
>>> Benefits:
>>> 1) Currently, we already have the Pipeline APIs for ML. It would
>>> be good to also have the related Python APIs.
>>> 2) In many cases, algorithm engineers prefer the Python language.
>>>
>>>
>>> BTW, the PyFlink is a new component, and there are still a lot of work
>>> need to do. Thus, everybody is cordially welcome to join the contribution
>>> to PyFlink, including asking questions, filing bug reports, proposing new
>>> features, joining discussions, contributing code or documentation ...
>>>
>>> Hope to see your feedback!
>>>
>>> Best,
>>> Jincheng
>>>
>>