[jira] [Created] (FLINK-11976) PyFlink vectorized python udf with Pandas support

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Created] (FLINK-11976) PyFlink vectorized python udf with Pandas support

Shang Yuanchun (Jira)
Yurui Zhou created FLINK-11976:
----------------------------------

             Summary: PyFlink vectorized python udf with Pandas support
                 Key: FLINK-11976
                 URL: https://issues.apache.org/jira/browse/FLINK-11976
             Project: Flink
          Issue Type: New Feature
          Components: API / Python
            Reporter: Yurui Zhou


h2. Motivation

Currently, the PyFlink  allow user to compose Flink data transformation and define UDF in python.  The PyFlink  transform python scripts into operation plans and send it over to Java runtime, having the Java runtime execute the operations accordingly and return the executed result. 

While encountering Python UDF, the Java runtime create another Python worker, serialized the data and have it send over to python worker. The python worker processed data in row based manner and send it back to Java runtime. 

 

How flink python UDF works:

 
!https://intranetproxy.alipay.com/skylark/lark/0/2019/png/93219/1551259580271-ebffa0d7-f675-43bf-aa6d-4e94a54b1f10.png!
 

There are several limitation with current python udf:
 * Inefficient data movement between Java and Python (Serialization/Deserialization)

 * Scalar Computation model

 
h2. Goals
 * Enable Pandas support in Flink Python UDF.
 * Enable vectorizied Python UDF execution based on Pandas
 * Using Apache Arrow as the serialization format between Java runtime and Python worker

 

Pandas UDF (vectorized UDF)
h3. Benefits
 * Provided high performance, easy-to-use data structures and data analysis tools for Python.
 * Pandas already provide interface to directly interact with Apache Arrow
 * Enable vectorized computation to fully taking advantage of the Arrow Memory layout. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)