(DEPRECATED) Apache Flink Mailing List archive.

[jira] [Created] (FLINK-11976) PyFlink vectorized python udf with Pandas support

Classic

List

Threaded

1 message

Shang Yuanchun (Jira)

[jira] [Created] (FLINK-11976) PyFlink vectorized python udf with Pandas support

Yurui Zhou created FLINK-11976:
----------------------------------

Summary: PyFlink vectorized python udf with Pandas support
Key: FLINK-11976
URL: https://issues.apache.org/jira/browse/FLINK-11976
Project: Flink
Issue Type: New Feature
Components: API / Python
Reporter: Yurui Zhou

h2. Motivation

Currently, the PyFlink allow user to compose Flink data transformation and define UDF in python. The PyFlink transform python scripts into operation plans and send it over to Java runtime, having the Java runtime execute the operations accordingly and return the executed result.

While encountering Python UDF, the Java runtime create another Python worker, serialized the data and have it send over to python worker. The python worker processed data in row based manner and send it back to Java runtime.

How flink python UDF works:

!https://intranetproxy.alipay.com/skylark/lark/0/2019/png/93219/1551259580271-ebffa0d7-f675-43bf-aa6d-4e94a54b1f10.png!

There are several limitation with current python udf:
* Inefficient data movement between Java and Python (Serialization/Deserialization)

* Scalar Computation model

h2. Goals
* Enable Pandas support in Flink Python UDF.
* Enable vectorizied Python UDF execution based on Pandas
* Using Apache Arrow as the serialization format between Java runtime and Python worker

Pandas UDF (vectorized UDF)
h3. Benefits
* Provided high performance, easy-to-use data structures and data analysis tools for Python.
* Pandas already provide interface to directly interact with Apache Arrow
* Enable vectorized computation to fully taking advantage of the Arrow Memory layout.

--
This message was sent by Atlassian JIRA
(v7.6.3#76005)