Yurui Zhou created FLINK-11976:
----------------------------------
Summary: PyFlink vectorized python udf with Pandas support
Key: FLINK-11976
URL:
https://issues.apache.org/jira/browse/FLINK-11976 Project: Flink
Issue Type: New Feature
Components: API / Python
Reporter: Yurui Zhou
h2. Motivation
Currently, the PyFlink allow user to compose Flink data transformation and define UDF in python. The PyFlink transform python scripts into operation plans and send it over to Java runtime, having the Java runtime execute the operations accordingly and return the executed result.
While encountering Python UDF, the Java runtime create another Python worker, serialized the data and have it send over to python worker. The python worker processed data in row based manner and send it back to Java runtime.
How flink python UDF works:
!
https://intranetproxy.alipay.com/skylark/lark/0/2019/png/93219/1551259580271-ebffa0d7-f675-43bf-aa6d-4e94a54b1f10.png!
There are several limitation with current python udf:
* Inefficient data movement between Java and Python (Serialization/Deserialization)
* Scalar Computation model
h2. Goals
* Enable Pandas support in Flink Python UDF.
* Enable vectorizied Python UDF execution based on Pandas
* Using Apache Arrow as the serialization format between Java runtime and Python worker
Pandas UDF (vectorized UDF)
h3. Benefits
* Provided high performance, easy-to-use data structures and data analysis tools for Python.
* Pandas already provide interface to directly interact with Apache Arrow
* Enable vectorized computation to fully taking advantage of the Arrow Memory layout.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)