Currently there is an ongoing survey about Python usage of Flink [1]. Some
discussion was also brought up there regarding non-jvm language support strategy in general. To avoid polluting the survey thread, we are starting this discussion thread and would like to move the discussions here. In the interest of facilitating the discussion, we would like to first share the following design doc which describes what we have done at Alibaba about Python API for Flink. It could serve as a good reference to the discussion. [DISCUSS] Flink Python API <https://docs.google.com/document/d/1JNGWdLwbo_btq9RVrc1PjWJV3lYUgPvK0uEWDIfVNJI/edit?usp=drive_web> As of now, we've implemented and delivered Python UDF for SQL for the internal users at Alibaba. We are starting to implement Python API. To recap and continue the discussion from the survey thread, I agree with @Stephan that we should figure out in which general direction Python support should go. Stephan also list three options there: * Option (1): Language portability via Apache Beam * Option (2): Implement own Python API * Option (3): Implement own portability layer From my perspective, (1). Flink language APIs and Beam's languages support are not mutually exclusive. It is nice that Beam has Python/NodeJS/Go APIs, and support Flink as the runner. Flink's own Python(or NodeJS/Go) APIs will benefit Flink's ecosystem. (2). Python API / portability layer To support non-JVM languages in Flink, * at client side, Flink would provide language interfaces, which will translate user's application to Flink StreamGraph. * at server side, Flink would execute user's UDF code at runtime The non-JVM languages communicate with JVM via RPC(or low-level socket, embedded interpreter and so on). What the portability layer can do maybe is abstracting the RPC layer. When the portability layer is ready, still there are lots of stuff to do for a specified language. Say, Python, we may still have to write the interface classes by hand for the users because generated code without detailed documentation is unacceptable for users, or handle the serialization issue of lambda/closure which is not a built-in feature in Python. Maybe, we can start with Python API, then extend to other languages and abstract the logic in common as the portability layer. --- References: [1] [SURVEY] Usage of flink-python and flink-streaming-python Regards, Xianda |
RE: Stephen's options (
http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/SURVEY-Usage-of-flink-python-and-flink-streaming-python-td25793.html ) * Option (1): Language portability via Apache Beam * Option (2): Implement own Python API * Option (3): Implement own portability layer Hi Stephen, Eventually, I think we should support both option1 and option3. TMO, these two options are orthogonal. I agree with you that we can leverage the existing work and ecosystem in beam by supporting option1. But the problem of beam is that it skips (to the best of my knowledge) the natural table/SQL optimization framework provided by Flink. We should spend all the needed efforts to support solution1 (as it is the better alternative of the current Flink python API), but cannot solely bet on it. Option3 is the ideal choice for Flink to support all Non-JVM languages which we should better plan to achieve. We have done some preliminary prototypes for option2/option3, and it seems not quite complex and difficult to accomplish. Regards, Shaoxuan On Thu, Dec 13, 2018 at 4:58 PM Xianda Ke <[hidden email]> wrote: > Currently there is an ongoing survey about Python usage of Flink [1]. Some > discussion was also brought up there regarding non-jvm language support > strategy in general. To avoid polluting the survey thread, we are starting > this discussion thread and would like to move the discussions here. > > In the interest of facilitating the discussion, we would like to first > share the following design doc which describes what we have done at Alibaba > about Python API for Flink. It could serve as a good reference to the > discussion. > > [DISCUSS] Flink Python API > < > https://docs.google.com/document/d/1JNGWdLwbo_btq9RVrc1PjWJV3lYUgPvK0uEWDIfVNJI/edit?usp=drive_web > > > > As of now, we've implemented and delivered Python UDF for SQL for the > internal users at Alibaba. > We are starting to implement Python API. > > To recap and continue the discussion from the survey thread, I agree with > @Stephan that we should figure out in which general direction Python > support should go. Stephan also list three options there: > * Option (1): Language portability via Apache Beam > * Option (2): Implement own Python API > * Option (3): Implement own portability layer > > From my perspective, > (1). Flink language APIs and Beam's languages support are not mutually > exclusive. > It is nice that Beam has Python/NodeJS/Go APIs, and support Flink as the > runner. > Flink's own Python(or NodeJS/Go) APIs will benefit Flink's ecosystem. > > (2). Python API / portability layer > To support non-JVM languages in Flink, > * at client side, Flink would provide language interfaces, which will > translate user's application to Flink StreamGraph. > * at server side, Flink would execute user's UDF code at runtime > The non-JVM languages communicate with JVM via RPC(or low-level socket, > embedded interpreter and so on). What the portability layer can do maybe is > abstracting the RPC layer. When the portability layer is ready, still there > are lots of stuff to do for a specified language. Say, Python, we may still > have to write the interface classes by hand for the users because generated > code without detailed documentation is unacceptable for users, or handle > the serialization issue of lambda/closure which is not a built-in feature > in Python. Maybe, we can start with Python API, then extend to other > languages and abstract the logic in common as the portability layer. > > --- > References: > [1] [SURVEY] Usage of flink-python and flink-streaming-python > > Regards, > Xianda > |
Hi Xianda, hi Shaoxuan,
I'd be in favor of option (1). There is great potential in Beam and Flink joining forces on this one. Here's why: The Beam project spent at least a year developing a portability layer with a reasonable amount of people working on it. Developing a new portability layer from scratch will probably take about the same amount of time and resources. Concerning option (2): There is already a Python API for Flink but an API is only one part of the portability story. In Beam the portability is structured into three components: - SDK (API, its Protobuf serialization, and interaction with the SDK Harness) - Runner (Translation from Protobuf pipeline to Flink job) - SDK Harness (UDF execution, Interaction with the SDK and the execution engine) I could imagine the Flink Python API would be another SDK which could have its own API but would reuse code for the interaction with the SDK Harness. We would be able to focus on the optimizations instead of rebuilding a portability layer from scratch. Thanks, Max On 13.12.18 11:52, Shaoxuan Wang wrote: > RE: Stephen's options ( > http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/SURVEY-Usage-of-flink-python-and-flink-streaming-python-td25793.html > ) > * Option (1): Language portability via Apache Beam > * Option (2): Implement own Python API > * Option (3): Implement own portability layer > > Hi Stephen, > Eventually, I think we should support both option1 and option3. TMO, these > two options are orthogonal. I agree with you that we can leverage the > existing work and ecosystem in beam by supporting option1. But the problem > of beam is that it skips (to the best of my knowledge) the natural > table/SQL optimization framework provided by Flink. We should spend all the > needed efforts to support solution1 (as it is the better alternative of the > current Flink python API), but cannot solely bet on it. Option3 is the > ideal choice for Flink to support all Non-JVM languages which we should > better plan to achieve. We have done some preliminary prototypes for > option2/option3, and it seems not quite complex and difficult to accomplish. > > Regards, > Shaoxuan > > > On Thu, Dec 13, 2018 at 4:58 PM Xianda Ke <[hidden email]> wrote: > >> Currently there is an ongoing survey about Python usage of Flink [1]. Some >> discussion was also brought up there regarding non-jvm language support >> strategy in general. To avoid polluting the survey thread, we are starting >> this discussion thread and would like to move the discussions here. >> >> In the interest of facilitating the discussion, we would like to first >> share the following design doc which describes what we have done at Alibaba >> about Python API for Flink. It could serve as a good reference to the >> discussion. >> >> [DISCUSS] Flink Python API >> < >> https://docs.google.com/document/d/1JNGWdLwbo_btq9RVrc1PjWJV3lYUgPvK0uEWDIfVNJI/edit?usp=drive_web >>> >> >> As of now, we've implemented and delivered Python UDF for SQL for the >> internal users at Alibaba. >> We are starting to implement Python API. >> >> To recap and continue the discussion from the survey thread, I agree with >> @Stephan that we should figure out in which general direction Python >> support should go. Stephan also list three options there: >> * Option (1): Language portability via Apache Beam >> * Option (2): Implement own Python API >> * Option (3): Implement own portability layer >> >> From my perspective, >> (1). Flink language APIs and Beam's languages support are not mutually >> exclusive. >> It is nice that Beam has Python/NodeJS/Go APIs, and support Flink as the >> runner. >> Flink's own Python(or NodeJS/Go) APIs will benefit Flink's ecosystem. >> >> (2). Python API / portability layer >> To support non-JVM languages in Flink, >> * at client side, Flink would provide language interfaces, which will >> translate user's application to Flink StreamGraph. >> * at server side, Flink would execute user's UDF code at runtime >> The non-JVM languages communicate with JVM via RPC(or low-level socket, >> embedded interpreter and so on). What the portability layer can do maybe is >> abstracting the RPC layer. When the portability layer is ready, still there >> are lots of stuff to do for a specified language. Say, Python, we may still >> have to write the interface classes by hand for the users because generated >> code without detailed documentation is unacceptable for users, or handle >> the serialization issue of lambda/closure which is not a built-in feature >> in Python. Maybe, we can start with Python API, then extend to other >> languages and abstract the logic in common as the portability layer. >> >> --- >> References: >> [1] [SURVEY] Usage of flink-python and flink-streaming-python >> >> Regards, >> Xianda >> > |
Hi Shaoxuan,
FWIW, Kenn Knowles (Beam PMC Chair) recently gave a talk at the Bay Area Apache Beam Meetup[1] which included a bit on a vision for how Beam could better leverage runner specific optimizations -- as an example/extension, Beam SQL leveraging Flink specific SQL optimizations (to address your point). So, that is part of the eventual roadmap for Beam, and illustrates how concrete efforts towards optimizations in Runner/SDK-Harness would likely yield the desired result of cross-language support (which could be done by leveraging Beam, and devote focus to optimizing that processing on Flink). Cheers, Austin [1] https://www.meetup.com/San-Francisco-Apache-Beam/events/256348972/ -- I can post/share videos once available should someone desire. On Fri, Dec 14, 2018 at 6:03 AM Maximilian Michels <[hidden email]> wrote: > Hi Xianda, hi Shaoxuan, > > I'd be in favor of option (1). There is great potential in Beam and Flink > joining forces on this one. Here's why: > > The Beam project spent at least a year developing a portability layer with > a > reasonable amount of people working on it. Developing a new portability > layer > from scratch will probably take about the same amount of time and > resources. > > Concerning option (2): There is already a Python API for Flink but an API > is > only one part of the portability story. In Beam the portability is > structured > into three components: > > - SDK (API, its Protobuf serialization, and interaction with the SDK > Harness) > - Runner (Translation from Protobuf pipeline to Flink job) > - SDK Harness (UDF execution, Interaction with the SDK and the execution > engine) > > I could imagine the Flink Python API would be another SDK which could have > its > own API but would reuse code for the interaction with the SDK Harness. > > We would be able to focus on the optimizations instead of rebuilding a > portability layer from scratch. > > Thanks, > Max > > On 13.12.18 11:52, Shaoxuan Wang wrote: > > RE: Stephen's options ( > > > http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/SURVEY-Usage-of-flink-python-and-flink-streaming-python-td25793.html > > ) > > * Option (1): Language portability via Apache Beam > > * Option (2): Implement own Python API > > * Option (3): Implement own portability layer > > > > Hi Stephen, > > Eventually, I think we should support both option1 and option3. TMO, > these > > two options are orthogonal. I agree with you that we can leverage the > > existing work and ecosystem in beam by supporting option1. But the > problem > > of beam is that it skips (to the best of my knowledge) the natural > > table/SQL optimization framework provided by Flink. We should spend all > the > > needed efforts to support solution1 (as it is the better alternative of > the > > current Flink python API), but cannot solely bet on it. Option3 is the > > ideal choice for Flink to support all Non-JVM languages which we should > > better plan to achieve. We have done some preliminary prototypes for > > option2/option3, and it seems not quite complex and difficult to > accomplish. > > > > Regards, > > Shaoxuan > > > > > > On Thu, Dec 13, 2018 at 4:58 PM Xianda Ke <[hidden email]> wrote: > > > >> Currently there is an ongoing survey about Python usage of Flink [1]. > Some > >> discussion was also brought up there regarding non-jvm language support > >> strategy in general. To avoid polluting the survey thread, we are > starting > >> this discussion thread and would like to move the discussions here. > >> > >> In the interest of facilitating the discussion, we would like to first > >> share the following design doc which describes what we have done at > Alibaba > >> about Python API for Flink. It could serve as a good reference to the > >> discussion. > >> > >> [DISCUSS] Flink Python API > >> < > >> > https://docs.google.com/document/d/1JNGWdLwbo_btq9RVrc1PjWJV3lYUgPvK0uEWDIfVNJI/edit?usp=drive_web > >>> > >> > >> As of now, we've implemented and delivered Python UDF for SQL for the > >> internal users at Alibaba. > >> We are starting to implement Python API. > >> > >> To recap and continue the discussion from the survey thread, I agree > with > >> @Stephan that we should figure out in which general direction Python > >> support should go. Stephan also list three options there: > >> * Option (1): Language portability via Apache Beam > >> * Option (2): Implement own Python API > >> * Option (3): Implement own portability layer > >> > >> From my perspective, > >> (1). Flink language APIs and Beam's languages support are not mutually > >> exclusive. > >> It is nice that Beam has Python/NodeJS/Go APIs, and support Flink as the > >> runner. > >> Flink's own Python(or NodeJS/Go) APIs will benefit Flink's ecosystem. > >> > >> (2). Python API / portability layer > >> To support non-JVM languages in Flink, > >> * at client side, Flink would provide language interfaces, which will > >> translate user's application to Flink StreamGraph. > >> * at server side, Flink would execute user's UDF code at runtime > >> The non-JVM languages communicate with JVM via RPC(or low-level socket, > >> embedded interpreter and so on). What the portability layer can do > maybe is > >> abstracting the RPC layer. When the portability layer is ready, still > there > >> are lots of stuff to do for a specified language. Say, Python, we may > still > >> have to write the interface classes by hand for the users because > generated > >> code without detailed documentation is unacceptable for users, or handle > >> the serialization issue of lambda/closure which is not a built-in > feature > >> in Python. Maybe, we can start with Python API, then extend to other > >> languages and abstract the logic in common as the portability layer. > >> > >> --- > >> References: > >> [1] [SURVEY] Usage of flink-python and flink-streaming-python > >> > >> Regards, > >> Xianda > >> > > > |
Interest in Python seems on the rise and so this is a good discussion to
have :) So far there seems to be agreement that Beam's approach towards Python and other non-JVM language support (language SDK, portability layer etc.) is the right direction? Specification and execution are native Python and it does not suffer from the shortcomings of Flink's Jython API and few other approaches. Overall there already is good alignment between Beam and Flink in concepts and model. There are also few of us that are active in both communities. The Beam Flink runner has made a lot of progress this year, but work on portability in Beam actually started much before that and was a very big change (originally there was just the Java SDK). Much of the code has been rewritten as part of the effort; that's what implementing a strong multi language support story took. To have a decent shot at it, the equivalent of much of the Beam portability framework would need to be reinvented in Flink. This would fork resources and divert focus away from things that may be more core to Flink. As you can guess I am in favor of option (1) ! We could take a look at SQL for reference. Flink community has invested a lot in SQL and there remains a lot of work to do. Beam community has done the same and we have two completely separate implementations. When I recently learned more about the Beam SQL work, one of my first questions was if joined effort would not lead to better user value? Calcite is common, but isn't there much more that could be shared? Someone had the idea that in such a world Flink could just substitute portions or all of the graph provided by Beam with it's own optimized version but much of the tooling could be same? IO connectors are another area where much effort is repeated. It takes a very long time to arrive at a solid, production quality implementation (typically resulting from broad user exposure and running at scale). Currently there is discussion how connectors can be done much better in both projects: SDF in Beam [1] and FLIP-27. It's a trade-off, but more synergy would be great! Thomas [1] https://docs.google.com/document/d/1cKOB9ToasfYs1kLWQgffzvIbJx2Smy4svlodPRhFrk4/ On Tue, Dec 18, 2018 at 2:16 PM Austin Bennett <[hidden email]> wrote: > Hi Shaoxuan, > > FWIW, Kenn Knowles (Beam PMC Chair) recently gave a talk at the Bay Area > Apache Beam Meetup[1] which included a bit on a vision for how Beam could > better leverage runner specific optimizations -- as an example/extension, > Beam SQL leveraging Flink specific SQL optimizations (to address your > point). So, that is part of the eventual roadmap for Beam, and illustrates > how concrete efforts towards optimizations in Runner/SDK-Harness would > likely yield the desired result of cross-language support (which could be > done by leveraging Beam, and devote focus to optimizing that processing on > Flink). > > Cheers, > Austin > > > [1] https://www.meetup.com/San-Francisco-Apache-Beam/events/256348972/ -- > I > can post/share videos once available should someone desire. > > On Fri, Dec 14, 2018 at 6:03 AM Maximilian Michels <[hidden email]> wrote: > > > Hi Xianda, hi Shaoxuan, > > > > I'd be in favor of option (1). There is great potential in Beam and Flink > > joining forces on this one. Here's why: > > > > The Beam project spent at least a year developing a portability layer > with > > a > > reasonable amount of people working on it. Developing a new portability > > layer > > from scratch will probably take about the same amount of time and > > resources. > > > > Concerning option (2): There is already a Python API for Flink but an API > > is > > only one part of the portability story. In Beam the portability is > > structured > > into three components: > > > > - SDK (API, its Protobuf serialization, and interaction with the SDK > > Harness) > > - Runner (Translation from Protobuf pipeline to Flink job) > > - SDK Harness (UDF execution, Interaction with the SDK and the execution > > engine) > > > > I could imagine the Flink Python API would be another SDK which could > have > > its > > own API but would reuse code for the interaction with the SDK Harness. > > > > We would be able to focus on the optimizations instead of rebuilding a > > portability layer from scratch. > > > > Thanks, > > Max > > > > On 13.12.18 11:52, Shaoxuan Wang wrote: > > > RE: Stephen's options ( > > > > > > http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/SURVEY-Usage-of-flink-python-and-flink-streaming-python-td25793.html > > > ) > > > * Option (1): Language portability via Apache Beam > > > * Option (2): Implement own Python API > > > * Option (3): Implement own portability layer > > > > > > Hi Stephen, > > > Eventually, I think we should support both option1 and option3. TMO, > > these > > > two options are orthogonal. I agree with you that we can leverage the > > > existing work and ecosystem in beam by supporting option1. But the > > problem > > > of beam is that it skips (to the best of my knowledge) the natural > > > table/SQL optimization framework provided by Flink. We should spend all > > the > > > needed efforts to support solution1 (as it is the better alternative of > > the > > > current Flink python API), but cannot solely bet on it. Option3 is the > > > ideal choice for Flink to support all Non-JVM languages which we should > > > better plan to achieve. We have done some preliminary prototypes for > > > option2/option3, and it seems not quite complex and difficult to > > accomplish. > > > > > > Regards, > > > Shaoxuan > > > > > > > > > On Thu, Dec 13, 2018 at 4:58 PM Xianda Ke <[hidden email]> wrote: > > > > > >> Currently there is an ongoing survey about Python usage of Flink [1]. > > Some > > >> discussion was also brought up there regarding non-jvm language > support > > >> strategy in general. To avoid polluting the survey thread, we are > > starting > > >> this discussion thread and would like to move the discussions here. > > >> > > >> In the interest of facilitating the discussion, we would like to first > > >> share the following design doc which describes what we have done at > > Alibaba > > >> about Python API for Flink. It could serve as a good reference to the > > >> discussion. > > >> > > >> [DISCUSS] Flink Python API > > >> < > > >> > > > https://docs.google.com/document/d/1JNGWdLwbo_btq9RVrc1PjWJV3lYUgPvK0uEWDIfVNJI/edit?usp=drive_web > > >>> > > >> > > >> As of now, we've implemented and delivered Python UDF for SQL for the > > >> internal users at Alibaba. > > >> We are starting to implement Python API. > > >> > > >> To recap and continue the discussion from the survey thread, I agree > > with > > >> @Stephan that we should figure out in which general direction Python > > >> support should go. Stephan also list three options there: > > >> * Option (1): Language portability via Apache Beam > > >> * Option (2): Implement own Python API > > >> * Option (3): Implement own portability layer > > >> > > >> From my perspective, > > >> (1). Flink language APIs and Beam's languages support are not mutually > > >> exclusive. > > >> It is nice that Beam has Python/NodeJS/Go APIs, and support Flink as > the > > >> runner. > > >> Flink's own Python(or NodeJS/Go) APIs will benefit Flink's ecosystem. > > >> > > >> (2). Python API / portability layer > > >> To support non-JVM languages in Flink, > > >> * at client side, Flink would provide language interfaces, which > will > > >> translate user's application to Flink StreamGraph. > > >> * at server side, Flink would execute user's UDF code at runtime > > >> The non-JVM languages communicate with JVM via RPC(or low-level > socket, > > >> embedded interpreter and so on). What the portability layer can do > > maybe is > > >> abstracting the RPC layer. When the portability layer is ready, still > > there > > >> are lots of stuff to do for a specified language. Say, Python, we may > > still > > >> have to write the interface classes by hand for the users because > > generated > > >> code without detailed documentation is unacceptable for users, or > handle > > >> the serialization issue of lambda/closure which is not a built-in > > feature > > >> in Python. Maybe, we can start with Python API, then extend to other > > >> languages and abstract the logic in common as the portability layer. > > >> > > >> --- > > >> References: > > >> [1] [SURVEY] Usage of flink-python and flink-streaming-python > > >> > > >> Regards, > > >> Xianda > > >> > > > > > > |
Hey guys,
Thanks for your comments and sorry for the late reply. Beam Python API and Flink Python TableAPI describe the DAG/pipeline in different manners. We got a chance to communicate with Tyler Akidau (from Beam) offline, and explained why the Flink tableAPI needs a specific design for python, rather than purely leverage Beam portability layer. In our proposal, most of the Python code is just a DAG/pipeline builder for tableAPI. The majority of operators run purely in Java, while only python UDFs executed in Python environment during the runtime. This design does not affect the development and adoption of Beam language portability layer with Flink runner. Flink and Beam community will still collaborate, jointly develop and optimize on the JVM / Non-JVM (python,GO) bridge (data and control connections between different processes) to ensure the robustness and performance. Regards, Shaoxuan On Fri, Dec 21, 2018 at 1:39 PM Thomas Weise <[hidden email]> wrote: > Interest in Python seems on the rise and so this is a good discussion to > have :) > > So far there seems to be agreement that Beam's approach towards Python and > other non-JVM language support (language SDK, portability layer etc.) is > the right direction? Specification and execution are native Python and it > does not suffer from the shortcomings of Flink's Jython API and few other > approaches. > > Overall there already is good alignment between Beam and Flink in concepts > and model. There are also few of us that are active in both communities. > The Beam Flink runner has made a lot of progress this year, but work on > portability in Beam actually started much before that and was a very big > change (originally there was just the Java SDK). Much of the code has been > rewritten as part of the effort; that's what implementing a strong multi > language support story took. To have a decent shot at it, the equivalent of > much of the Beam portability framework would need to be reinvented in > Flink. This would fork resources and divert focus away from things that may > be more core to Flink. As you can guess I am in favor of option (1) ! > > We could take a look at SQL for reference. Flink community has invested a > lot in SQL and there remains a lot of work to do. Beam community has done > the same and we have two completely separate implementations. When I > recently learned more about the Beam SQL work, one of my first questions > was if joined effort would not lead to better user value? Calcite is > common, but isn't there much more that could be shared? Someone had the > idea that in such a world Flink could just substitute portions or all of > the graph provided by Beam with it's own optimized version but much of the > tooling could be same? > > IO connectors are another area where much effort is repeated. It takes a > very long time to arrive at a solid, production quality implementation > (typically resulting from broad user exposure and running at scale). > Currently there is discussion how connectors can be done much better in > both projects: SDF in Beam [1] and FLIP-27. > > It's a trade-off, but more synergy would be great! > > Thomas > > [1] > > https://docs.google.com/document/d/1cKOB9ToasfYs1kLWQgffzvIbJx2Smy4svlodPRhFrk4/ > > > On Tue, Dec 18, 2018 at 2:16 PM Austin Bennett < > [hidden email]> > wrote: > > > Hi Shaoxuan, > > > > FWIW, Kenn Knowles (Beam PMC Chair) recently gave a talk at the Bay Area > > Apache Beam Meetup[1] which included a bit on a vision for how Beam could > > better leverage runner specific optimizations -- as an example/extension, > > Beam SQL leveraging Flink specific SQL optimizations (to address your > > point). So, that is part of the eventual roadmap for Beam, and > illustrates > > how concrete efforts towards optimizations in Runner/SDK-Harness would > > likely yield the desired result of cross-language support (which could be > > done by leveraging Beam, and devote focus to optimizing that processing > on > > Flink). > > > > Cheers, > > Austin > > > > > > [1] https://www.meetup.com/San-Francisco-Apache-Beam/events/256348972/ > -- > > I > > can post/share videos once available should someone desire. > > > > On Fri, Dec 14, 2018 at 6:03 AM Maximilian Michels <[hidden email]> > wrote: > > > > > Hi Xianda, hi Shaoxuan, > > > > > > I'd be in favor of option (1). There is great potential in Beam and > Flink > > > joining forces on this one. Here's why: > > > > > > The Beam project spent at least a year developing a portability layer > > with > > > a > > > reasonable amount of people working on it. Developing a new portability > > > layer > > > from scratch will probably take about the same amount of time and > > > resources. > > > > > > Concerning option (2): There is already a Python API for Flink but an > API > > > is > > > only one part of the portability story. In Beam the portability is > > > structured > > > into three components: > > > > > > - SDK (API, its Protobuf serialization, and interaction with the SDK > > > Harness) > > > - Runner (Translation from Protobuf pipeline to Flink job) > > > - SDK Harness (UDF execution, Interaction with the SDK and the > execution > > > engine) > > > > > > I could imagine the Flink Python API would be another SDK which could > > have > > > its > > > own API but would reuse code for the interaction with the SDK Harness. > > > > > > We would be able to focus on the optimizations instead of rebuilding a > > > portability layer from scratch. > > > > > > Thanks, > > > Max > > > > > > On 13.12.18 11:52, Shaoxuan Wang wrote: > > > > RE: Stephen's options ( > > > > > > > > > > http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/SURVEY-Usage-of-flink-python-and-flink-streaming-python-td25793.html > > > > ) > > > > * Option (1): Language portability via Apache Beam > > > > * Option (2): Implement own Python API > > > > * Option (3): Implement own portability layer > > > > > > > > Hi Stephen, > > > > Eventually, I think we should support both option1 and option3. TMO, > > > these > > > > two options are orthogonal. I agree with you that we can leverage the > > > > existing work and ecosystem in beam by supporting option1. But the > > > problem > > > > of beam is that it skips (to the best of my knowledge) the natural > > > > table/SQL optimization framework provided by Flink. We should spend > all > > > the > > > > needed efforts to support solution1 (as it is the better alternative > of > > > the > > > > current Flink python API), but cannot solely bet on it. Option3 is > the > > > > ideal choice for Flink to support all Non-JVM languages which we > should > > > > better plan to achieve. We have done some preliminary prototypes for > > > > option2/option3, and it seems not quite complex and difficult to > > > accomplish. > > > > > > > > Regards, > > > > Shaoxuan > > > > > > > > > > > > On Thu, Dec 13, 2018 at 4:58 PM Xianda Ke <[hidden email]> > wrote: > > > > > > > >> Currently there is an ongoing survey about Python usage of Flink > [1]. > > > Some > > > >> discussion was also brought up there regarding non-jvm language > > support > > > >> strategy in general. To avoid polluting the survey thread, we are > > > starting > > > >> this discussion thread and would like to move the discussions here. > > > >> > > > >> In the interest of facilitating the discussion, we would like to > first > > > >> share the following design doc which describes what we have done at > > > Alibaba > > > >> about Python API for Flink. It could serve as a good reference to > the > > > >> discussion. > > > >> > > > >> [DISCUSS] Flink Python API > > > >> < > > > >> > > > > > > https://docs.google.com/document/d/1JNGWdLwbo_btq9RVrc1PjWJV3lYUgPvK0uEWDIfVNJI/edit?usp=drive_web > > > >>> > > > >> > > > >> As of now, we've implemented and delivered Python UDF for SQL for > the > > > >> internal users at Alibaba. > > > >> We are starting to implement Python API. > > > >> > > > >> To recap and continue the discussion from the survey thread, I agree > > > with > > > >> @Stephan that we should figure out in which general direction Python > > > >> support should go. Stephan also list three options there: > > > >> * Option (1): Language portability via Apache Beam > > > >> * Option (2): Implement own Python API > > > >> * Option (3): Implement own portability layer > > > >> > > > >> From my perspective, > > > >> (1). Flink language APIs and Beam's languages support are not > mutually > > > >> exclusive. > > > >> It is nice that Beam has Python/NodeJS/Go APIs, and support Flink as > > the > > > >> runner. > > > >> Flink's own Python(or NodeJS/Go) APIs will benefit Flink's > ecosystem. > > > >> > > > >> (2). Python API / portability layer > > > >> To support non-JVM languages in Flink, > > > >> * at client side, Flink would provide language interfaces, which > > will > > > >> translate user's application to Flink StreamGraph. > > > >> * at server side, Flink would execute user's UDF code at runtime > > > >> The non-JVM languages communicate with JVM via RPC(or low-level > > socket, > > > >> embedded interpreter and so on). What the portability layer can do > > > maybe is > > > >> abstracting the RPC layer. When the portability layer is ready, > still > > > there > > > >> are lots of stuff to do for a specified language. Say, Python, we > may > > > still > > > >> have to write the interface classes by hand for the users because > > > generated > > > >> code without detailed documentation is unacceptable for users, or > > handle > > > >> the serialization issue of lambda/closure which is not a built-in > > > feature > > > >> in Python. Maybe, we can start with Python API, then extend to > other > > > >> languages and abstract the logic in common as the portability layer. > > > >> > > > >> --- > > > >> References: > > > >> [1] [SURVEY] Usage of flink-python and flink-streaming-python > > > >> > > > >> Regards, > > > >> Xianda > > > >> > > > > > > > > > > |
Hi everyone,
Sorry to join in this discussion late. Thanks to Xianda Ke for initiating this discussion. I also enjoy the discussions&suggestions by Max, Austin, Thomas, Shaoxuan and others. Recently, I did feel the desire of the community and Flink users for Python support. Stephan also pointed out in the discussion of `Adding a mid-term roadmap`: "Table API becomes primary API for analytics use cases", while a large number of users in analytics use cases are accustomed to the Python language, and the accumulation of a large number of class libraries is also deposited in the python library. So I am very interested in participating in the discussion of supporting Python in Flink. With regard to the three options mentioned so far, it is a great encouragement to leverage the beam’s language portable layer on Flink. For now, we can start with a step in the Flink to add a Py-tableAPI. I believe in, in this process, we will have a deeper understanding of how Flink support python. If we can quickly let users experience the first version of Flink Python TableAPI, we can also receive feedback from many users, and consider the long-term goals of multi-language support on Flink. So if you agree, I volunteer to draft a document that would support the detailed design and implementation plan of Py-TableAPI on Flink. What do you think? Shaoxuan Wang <[hidden email]> 于2019年2月21日周四 下午10:44写道: > Hey guys, > Thanks for your comments and sorry for the late reply. > Beam Python API and Flink Python TableAPI describe the DAG/pipeline in > different manners. We got a chance to communicate with Tyler Akidau (from > Beam) offline, and explained why the Flink tableAPI needs a specific design > for python, rather than purely leverage Beam portability layer. > > In our proposal, most of the Python code is just a DAG/pipeline builder for > tableAPI. The majority of operators run purely in Java, while only python > UDFs executed in Python environment during the runtime. This design does > not affect the development and adoption of Beam language portability layer > with Flink runner. Flink and Beam community will still collaborate, jointly > develop and optimize on the JVM / Non-JVM (python,GO) bridge (data and > control connections between different processes) to ensure the robustness > and performance. > > Regards, > Shaoxuan > > > On Fri, Dec 21, 2018 at 1:39 PM Thomas Weise <[hidden email]> wrote: > > > Interest in Python seems on the rise and so this is a good discussion to > > have :) > > > > So far there seems to be agreement that Beam's approach towards Python > and > > other non-JVM language support (language SDK, portability layer etc.) is > > the right direction? Specification and execution are native Python and it > > does not suffer from the shortcomings of Flink's Jython API and few other > > approaches. > > > > Overall there already is good alignment between Beam and Flink in > concepts > > and model. There are also few of us that are active in both communities. > > The Beam Flink runner has made a lot of progress this year, but work on > > portability in Beam actually started much before that and was a very big > > change (originally there was just the Java SDK). Much of the code has > been > > rewritten as part of the effort; that's what implementing a strong multi > > language support story took. To have a decent shot at it, the equivalent > of > > much of the Beam portability framework would need to be reinvented in > > Flink. This would fork resources and divert focus away from things that > may > > be more core to Flink. As you can guess I am in favor of option (1) ! > > > > We could take a look at SQL for reference. Flink community has invested a > > lot in SQL and there remains a lot of work to do. Beam community has done > > the same and we have two completely separate implementations. When I > > recently learned more about the Beam SQL work, one of my first questions > > was if joined effort would not lead to better user value? Calcite is > > common, but isn't there much more that could be shared? Someone had the > > idea that in such a world Flink could just substitute portions or all of > > the graph provided by Beam with it's own optimized version but much of > the > > tooling could be same? > > > > IO connectors are another area where much effort is repeated. It takes a > > very long time to arrive at a solid, production quality implementation > > (typically resulting from broad user exposure and running at scale). > > Currently there is discussion how connectors can be done much better in > > both projects: SDF in Beam [1] and FLIP-27. > > > > It's a trade-off, but more synergy would be great! > > > > Thomas > > > > [1] > > > > > https://docs.google.com/document/d/1cKOB9ToasfYs1kLWQgffzvIbJx2Smy4svlodPRhFrk4/ > > > > > > On Tue, Dec 18, 2018 at 2:16 PM Austin Bennett < > > [hidden email]> > > wrote: > > > > > Hi Shaoxuan, > > > > > > FWIW, Kenn Knowles (Beam PMC Chair) recently gave a talk at the Bay > Area > > > Apache Beam Meetup[1] which included a bit on a vision for how Beam > could > > > better leverage runner specific optimizations -- as an > example/extension, > > > Beam SQL leveraging Flink specific SQL optimizations (to address your > > > point). So, that is part of the eventual roadmap for Beam, and > > illustrates > > > how concrete efforts towards optimizations in Runner/SDK-Harness would > > > likely yield the desired result of cross-language support (which could > be > > > done by leveraging Beam, and devote focus to optimizing that processing > > on > > > Flink). > > > > > > Cheers, > > > Austin > > > > > > > > > [1] https://www.meetup.com/San-Francisco-Apache-Beam/events/256348972/ > > -- > > > I > > > can post/share videos once available should someone desire. > > > > > > On Fri, Dec 14, 2018 at 6:03 AM Maximilian Michels <[hidden email]> > > wrote: > > > > > > > Hi Xianda, hi Shaoxuan, > > > > > > > > I'd be in favor of option (1). There is great potential in Beam and > > Flink > > > > joining forces on this one. Here's why: > > > > > > > > The Beam project spent at least a year developing a portability layer > > > with > > > > a > > > > reasonable amount of people working on it. Developing a new > portability > > > > layer > > > > from scratch will probably take about the same amount of time and > > > > resources. > > > > > > > > Concerning option (2): There is already a Python API for Flink but an > > API > > > > is > > > > only one part of the portability story. In Beam the portability is > > > > structured > > > > into three components: > > > > > > > > - SDK (API, its Protobuf serialization, and interaction with the SDK > > > > Harness) > > > > - Runner (Translation from Protobuf pipeline to Flink job) > > > > - SDK Harness (UDF execution, Interaction with the SDK and the > > execution > > > > engine) > > > > > > > > I could imagine the Flink Python API would be another SDK which could > > > have > > > > its > > > > own API but would reuse code for the interaction with the SDK > Harness. > > > > > > > > We would be able to focus on the optimizations instead of rebuilding > a > > > > portability layer from scratch. > > > > > > > > Thanks, > > > > Max > > > > > > > > On 13.12.18 11:52, Shaoxuan Wang wrote: > > > > > RE: Stephen's options ( > > > > > > > > > > > > > > > http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/SURVEY-Usage-of-flink-python-and-flink-streaming-python-td25793.html > > > > > ) > > > > > * Option (1): Language portability via Apache Beam > > > > > * Option (2): Implement own Python API > > > > > * Option (3): Implement own portability layer > > > > > > > > > > Hi Stephen, > > > > > Eventually, I think we should support both option1 and option3. > TMO, > > > > these > > > > > two options are orthogonal. I agree with you that we can leverage > the > > > > > existing work and ecosystem in beam by supporting option1. But the > > > > problem > > > > > of beam is that it skips (to the best of my knowledge) the natural > > > > > table/SQL optimization framework provided by Flink. We should spend > > all > > > > the > > > > > needed efforts to support solution1 (as it is the better > alternative > > of > > > > the > > > > > current Flink python API), but cannot solely bet on it. Option3 is > > the > > > > > ideal choice for Flink to support all Non-JVM languages which we > > should > > > > > better plan to achieve. We have done some preliminary prototypes > for > > > > > option2/option3, and it seems not quite complex and difficult to > > > > accomplish. > > > > > > > > > > Regards, > > > > > Shaoxuan > > > > > > > > > > > > > > > On Thu, Dec 13, 2018 at 4:58 PM Xianda Ke <[hidden email]> > > wrote: > > > > > > > > > >> Currently there is an ongoing survey about Python usage of Flink > > [1]. > > > > Some > > > > >> discussion was also brought up there regarding non-jvm language > > > support > > > > >> strategy in general. To avoid polluting the survey thread, we are > > > > starting > > > > >> this discussion thread and would like to move the discussions > here. > > > > >> > > > > >> In the interest of facilitating the discussion, we would like to > > first > > > > >> share the following design doc which describes what we have done > at > > > > Alibaba > > > > >> about Python API for Flink. It could serve as a good reference to > > the > > > > >> discussion. > > > > >> > > > > >> [DISCUSS] Flink Python API > > > > >> < > > > > >> > > > > > > > > > > https://docs.google.com/document/d/1JNGWdLwbo_btq9RVrc1PjWJV3lYUgPvK0uEWDIfVNJI/edit?usp=drive_web > > > > >>> > > > > >> > > > > >> As of now, we've implemented and delivered Python UDF for SQL for > > the > > > > >> internal users at Alibaba. > > > > >> We are starting to implement Python API. > > > > >> > > > > >> To recap and continue the discussion from the survey thread, I > agree > > > > with > > > > >> @Stephan that we should figure out in which general direction > Python > > > > >> support should go. Stephan also list three options there: > > > > >> * Option (1): Language portability via Apache Beam > > > > >> * Option (2): Implement own Python API > > > > >> * Option (3): Implement own portability layer > > > > >> > > > > >> From my perspective, > > > > >> (1). Flink language APIs and Beam's languages support are not > > mutually > > > > >> exclusive. > > > > >> It is nice that Beam has Python/NodeJS/Go APIs, and support Flink > as > > > the > > > > >> runner. > > > > >> Flink's own Python(or NodeJS/Go) APIs will benefit Flink's > > ecosystem. > > > > >> > > > > >> (2). Python API / portability layer > > > > >> To support non-JVM languages in Flink, > > > > >> * at client side, Flink would provide language interfaces, which > > > will > > > > >> translate user's application to Flink StreamGraph. > > > > >> * at server side, Flink would execute user's UDF code at runtime > > > > >> The non-JVM languages communicate with JVM via RPC(or low-level > > > socket, > > > > >> embedded interpreter and so on). What the portability layer can do > > > > maybe is > > > > >> abstracting the RPC layer. When the portability layer is ready, > > still > > > > there > > > > >> are lots of stuff to do for a specified language. Say, Python, we > > may > > > > still > > > > >> have to write the interface classes by hand for the users because > > > > generated > > > > >> code without detailed documentation is unacceptable for users, or > > > handle > > > > >> the serialization issue of lambda/closure which is not a built-in > > > > feature > > > > >> in Python. Maybe, we can start with Python API, then extend to > > other > > > > >> languages and abstract the logic in common as the portability > layer. > > > > >> > > > > >> --- > > > > >> References: > > > > >> [1] [SURVEY] Usage of flink-python and flink-streaming-python > > > > >> > > > > >> Regards, > > > > >> Xianda > > > > >> > > > > > > > > > > > > > > > |
Hi jincheng,
Thanks for activating this discussion again. I personally look forward to your design draft. Best, Vino jincheng sun <[hidden email]> 于2019年3月28日周四 下午12:16写道: > Hi everyone, > Sorry to join in this discussion late. > > Thanks to Xianda Ke for initiating this discussion. I also enjoy the > discussions&suggestions by Max, Austin, Thomas, Shaoxuan and others. > > Recently, I did feel the desire of the community and Flink users for Python > support. Stephan also pointed out in the discussion of `Adding a mid-term > roadmap`: "Table API becomes primary API for analytics use cases", while a > large number of users in analytics use cases are accustomed to the Python > language, and the accumulation of a large number of class libraries is also > deposited in the python library. > > So I am very interested in participating in the discussion of supporting > Python in Flink. With regard to the three options mentioned so far, it is a > great encouragement to leverage the beam’s language portable layer on > Flink. For now, we can start with a step in the Flink to add a Py-tableAPI. > I believe in, in this process, we will have a deeper understanding of how > Flink support python. If we can quickly let users experience the first > version of Flink Python TableAPI, we can also receive feedback from many > users, and consider the long-term goals of multi-language support on Flink. > > So if you agree, I volunteer to draft a document that would support the > detailed design and implementation plan of Py-TableAPI on Flink. > > What do you think? > > Shaoxuan Wang <[hidden email]> 于2019年2月21日周四 下午10:44写道: > > > Hey guys, > > Thanks for your comments and sorry for the late reply. > > Beam Python API and Flink Python TableAPI describe the DAG/pipeline in > > different manners. We got a chance to communicate with Tyler Akidau (from > > Beam) offline, and explained why the Flink tableAPI needs a specific > design > > for python, rather than purely leverage Beam portability layer. > > > > In our proposal, most of the Python code is just a DAG/pipeline builder > for > > tableAPI. The majority of operators run purely in Java, while only python > > UDFs executed in Python environment during the runtime. This design does > > not affect the development and adoption of Beam language portability > layer > > with Flink runner. Flink and Beam community will still collaborate, > jointly > > develop and optimize on the JVM / Non-JVM (python,GO) bridge (data and > > control connections between different processes) to ensure the robustness > > and performance. > > > > Regards, > > Shaoxuan > > > > > > On Fri, Dec 21, 2018 at 1:39 PM Thomas Weise <[hidden email]> wrote: > > > > > Interest in Python seems on the rise and so this is a good discussion > to > > > have :) > > > > > > So far there seems to be agreement that Beam's approach towards Python > > and > > > other non-JVM language support (language SDK, portability layer etc.) > is > > > the right direction? Specification and execution are native Python and > it > > > does not suffer from the shortcomings of Flink's Jython API and few > other > > > approaches. > > > > > > Overall there already is good alignment between Beam and Flink in > > concepts > > > and model. There are also few of us that are active in both > communities. > > > The Beam Flink runner has made a lot of progress this year, but work on > > > portability in Beam actually started much before that and was a very > big > > > change (originally there was just the Java SDK). Much of the code has > > been > > > rewritten as part of the effort; that's what implementing a strong > multi > > > language support story took. To have a decent shot at it, the > equivalent > > of > > > much of the Beam portability framework would need to be reinvented in > > > Flink. This would fork resources and divert focus away from things that > > may > > > be more core to Flink. As you can guess I am in favor of option (1) ! > > > > > > We could take a look at SQL for reference. Flink community has > invested a > > > lot in SQL and there remains a lot of work to do. Beam community has > done > > > the same and we have two completely separate implementations. When I > > > recently learned more about the Beam SQL work, one of my first > questions > > > was if joined effort would not lead to better user value? Calcite is > > > common, but isn't there much more that could be shared? Someone had the > > > idea that in such a world Flink could just substitute portions or all > of > > > the graph provided by Beam with it's own optimized version but much of > > the > > > tooling could be same? > > > > > > IO connectors are another area where much effort is repeated. It takes > a > > > very long time to arrive at a solid, production quality implementation > > > (typically resulting from broad user exposure and running at scale). > > > Currently there is discussion how connectors can be done much better in > > > both projects: SDF in Beam [1] and FLIP-27. > > > > > > It's a trade-off, but more synergy would be great! > > > > > > Thomas > > > > > > [1] > > > > > > > > > https://docs.google.com/document/d/1cKOB9ToasfYs1kLWQgffzvIbJx2Smy4svlodPRhFrk4/ > > > > > > > > > On Tue, Dec 18, 2018 at 2:16 PM Austin Bennett < > > > [hidden email]> > > > wrote: > > > > > > > Hi Shaoxuan, > > > > > > > > FWIW, Kenn Knowles (Beam PMC Chair) recently gave a talk at the Bay > > Area > > > > Apache Beam Meetup[1] which included a bit on a vision for how Beam > > could > > > > better leverage runner specific optimizations -- as an > > example/extension, > > > > Beam SQL leveraging Flink specific SQL optimizations (to address your > > > > point). So, that is part of the eventual roadmap for Beam, and > > > illustrates > > > > how concrete efforts towards optimizations in Runner/SDK-Harness > would > > > > likely yield the desired result of cross-language support (which > could > > be > > > > done by leveraging Beam, and devote focus to optimizing that > processing > > > on > > > > Flink). > > > > > > > > Cheers, > > > > Austin > > > > > > > > > > > > [1] > https://www.meetup.com/San-Francisco-Apache-Beam/events/256348972/ > > > -- > > > > I > > > > can post/share videos once available should someone desire. > > > > > > > > On Fri, Dec 14, 2018 at 6:03 AM Maximilian Michels <[hidden email]> > > > wrote: > > > > > > > > > Hi Xianda, hi Shaoxuan, > > > > > > > > > > I'd be in favor of option (1). There is great potential in Beam and > > > Flink > > > > > joining forces on this one. Here's why: > > > > > > > > > > The Beam project spent at least a year developing a portability > layer > > > > with > > > > > a > > > > > reasonable amount of people working on it. Developing a new > > portability > > > > > layer > > > > > from scratch will probably take about the same amount of time and > > > > > resources. > > > > > > > > > > Concerning option (2): There is already a Python API for Flink but > an > > > API > > > > > is > > > > > only one part of the portability story. In Beam the portability is > > > > > structured > > > > > into three components: > > > > > > > > > > - SDK (API, its Protobuf serialization, and interaction with the > SDK > > > > > Harness) > > > > > - Runner (Translation from Protobuf pipeline to Flink job) > > > > > - SDK Harness (UDF execution, Interaction with the SDK and the > > > execution > > > > > engine) > > > > > > > > > > I could imagine the Flink Python API would be another SDK which > could > > > > have > > > > > its > > > > > own API but would reuse code for the interaction with the SDK > > Harness. > > > > > > > > > > We would be able to focus on the optimizations instead of > rebuilding > > a > > > > > portability layer from scratch. > > > > > > > > > > Thanks, > > > > > Max > > > > > > > > > > On 13.12.18 11:52, Shaoxuan Wang wrote: > > > > > > RE: Stephen's options ( > > > > > > > > > > > > > > > > > > > > > http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/SURVEY-Usage-of-flink-python-and-flink-streaming-python-td25793.html > > > > > > ) > > > > > > * Option (1): Language portability via Apache Beam > > > > > > * Option (2): Implement own Python API > > > > > > * Option (3): Implement own portability layer > > > > > > > > > > > > Hi Stephen, > > > > > > Eventually, I think we should support both option1 and option3. > > TMO, > > > > > these > > > > > > two options are orthogonal. I agree with you that we can leverage > > the > > > > > > existing work and ecosystem in beam by supporting option1. But > the > > > > > problem > > > > > > of beam is that it skips (to the best of my knowledge) the > natural > > > > > > table/SQL optimization framework provided by Flink. We should > spend > > > all > > > > > the > > > > > > needed efforts to support solution1 (as it is the better > > alternative > > > of > > > > > the > > > > > > current Flink python API), but cannot solely bet on it. Option3 > is > > > the > > > > > > ideal choice for Flink to support all Non-JVM languages which we > > > should > > > > > > better plan to achieve. We have done some preliminary prototypes > > for > > > > > > option2/option3, and it seems not quite complex and difficult to > > > > > accomplish. > > > > > > > > > > > > Regards, > > > > > > Shaoxuan > > > > > > > > > > > > > > > > > > On Thu, Dec 13, 2018 at 4:58 PM Xianda Ke <[hidden email]> > > > wrote: > > > > > > > > > > > >> Currently there is an ongoing survey about Python usage of Flink > > > [1]. > > > > > Some > > > > > >> discussion was also brought up there regarding non-jvm language > > > > support > > > > > >> strategy in general. To avoid polluting the survey thread, we > are > > > > > starting > > > > > >> this discussion thread and would like to move the discussions > > here. > > > > > >> > > > > > >> In the interest of facilitating the discussion, we would like to > > > first > > > > > >> share the following design doc which describes what we have done > > at > > > > > Alibaba > > > > > >> about Python API for Flink. It could serve as a good reference > to > > > the > > > > > >> discussion. > > > > > >> > > > > > >> [DISCUSS] Flink Python API > > > > > >> < > > > > > >> > > > > > > > > > > > > > > > https://docs.google.com/document/d/1JNGWdLwbo_btq9RVrc1PjWJV3lYUgPvK0uEWDIfVNJI/edit?usp=drive_web > > > > > >>> > > > > > >> > > > > > >> As of now, we've implemented and delivered Python UDF for SQL > for > > > the > > > > > >> internal users at Alibaba. > > > > > >> We are starting to implement Python API. > > > > > >> > > > > > >> To recap and continue the discussion from the survey thread, I > > agree > > > > > with > > > > > >> @Stephan that we should figure out in which general direction > > Python > > > > > >> support should go. Stephan also list three options there: > > > > > >> * Option (1): Language portability via Apache Beam > > > > > >> * Option (2): Implement own Python API > > > > > >> * Option (3): Implement own portability layer > > > > > >> > > > > > >> From my perspective, > > > > > >> (1). Flink language APIs and Beam's languages support are not > > > mutually > > > > > >> exclusive. > > > > > >> It is nice that Beam has Python/NodeJS/Go APIs, and support > Flink > > as > > > > the > > > > > >> runner. > > > > > >> Flink's own Python(or NodeJS/Go) APIs will benefit Flink's > > > ecosystem. > > > > > >> > > > > > >> (2). Python API / portability layer > > > > > >> To support non-JVM languages in Flink, > > > > > >> * at client side, Flink would provide language interfaces, > which > > > > will > > > > > >> translate user's application to Flink StreamGraph. > > > > > >> * at server side, Flink would execute user's UDF code at runtime > > > > > >> The non-JVM languages communicate with JVM via RPC(or low-level > > > > socket, > > > > > >> embedded interpreter and so on). What the portability layer can > do > > > > > maybe is > > > > > >> abstracting the RPC layer. When the portability layer is ready, > > > still > > > > > there > > > > > >> are lots of stuff to do for a specified language. Say, Python, > we > > > may > > > > > still > > > > > >> have to write the interface classes by hand for the users > because > > > > > generated > > > > > >> code without detailed documentation is unacceptable for users, > or > > > > handle > > > > > >> the serialization issue of lambda/closure which is not a > built-in > > > > > feature > > > > > >> in Python. Maybe, we can start with Python API, then extend to > > > other > > > > > >> languages and abstract the logic in common as the portability > > layer. > > > > > >> > > > > > >> --- > > > > > >> References: > > > > > >> [1] [SURVEY] Usage of flink-python and flink-streaming-python > > > > > >> > > > > > >> Regards, > > > > > >> Xianda > > > > > >> > > > > > > > > > > > > > > > > > > > > > |
Hi Shaoxuan & Jincheng,
Thanks for driving this initiative. Python would be a very big add-on for flink adoption in data science world. One additional suggestion is you may need to think about how to transfer flink Table to pandas dataframe which is a very popular library in python. And you may be interested in apache arrow which is a common layer to transferring data efficiently across languages. https://arrow.apache.org/ vino yang <[hidden email]> 于2019年3月28日周四 下午2:44写道: > Hi jincheng, > > Thanks for activating this discussion again. > I personally look forward to your design draft. > > Best, > Vino > > jincheng sun <[hidden email]> 于2019年3月28日周四 下午12:16写道: > > > Hi everyone, > > Sorry to join in this discussion late. > > > > Thanks to Xianda Ke for initiating this discussion. I also enjoy the > > discussions&suggestions by Max, Austin, Thomas, Shaoxuan and others. > > > > Recently, I did feel the desire of the community and Flink users for > Python > > support. Stephan also pointed out in the discussion of `Adding a mid-term > > roadmap`: "Table API becomes primary API for analytics use cases", while > a > > large number of users in analytics use cases are accustomed to the Python > > language, and the accumulation of a large number of class libraries is > also > > deposited in the python library. > > > > So I am very interested in participating in the discussion of supporting > > Python in Flink. With regard to the three options mentioned so far, it > is a > > great encouragement to leverage the beam’s language portable layer on > > Flink. For now, we can start with a step in the Flink to add a > Py-tableAPI. > > I believe in, in this process, we will have a deeper understanding of how > > Flink support python. If we can quickly let users experience the first > > version of Flink Python TableAPI, we can also receive feedback from many > > users, and consider the long-term goals of multi-language support on > Flink. > > > > So if you agree, I volunteer to draft a document that would support the > > detailed design and implementation plan of Py-TableAPI on Flink. > > > > What do you think? > > > > Shaoxuan Wang <[hidden email]> 于2019年2月21日周四 下午10:44写道: > > > > > Hey guys, > > > Thanks for your comments and sorry for the late reply. > > > Beam Python API and Flink Python TableAPI describe the DAG/pipeline in > > > different manners. We got a chance to communicate with Tyler Akidau > (from > > > Beam) offline, and explained why the Flink tableAPI needs a specific > > design > > > for python, rather than purely leverage Beam portability layer. > > > > > > In our proposal, most of the Python code is just a DAG/pipeline builder > > for > > > tableAPI. The majority of operators run purely in Java, while only > python > > > UDFs executed in Python environment during the runtime. This design > does > > > not affect the development and adoption of Beam language portability > > layer > > > with Flink runner. Flink and Beam community will still collaborate, > > jointly > > > develop and optimize on the JVM / Non-JVM (python,GO) bridge (data and > > > control connections between different processes) to ensure the > robustness > > > and performance. > > > > > > Regards, > > > Shaoxuan > > > > > > > > > On Fri, Dec 21, 2018 at 1:39 PM Thomas Weise <[hidden email]> wrote: > > > > > > > Interest in Python seems on the rise and so this is a good discussion > > to > > > > have :) > > > > > > > > So far there seems to be agreement that Beam's approach towards > Python > > > and > > > > other non-JVM language support (language SDK, portability layer etc.) > > is > > > > the right direction? Specification and execution are native Python > and > > it > > > > does not suffer from the shortcomings of Flink's Jython API and few > > other > > > > approaches. > > > > > > > > Overall there already is good alignment between Beam and Flink in > > > concepts > > > > and model. There are also few of us that are active in both > > communities. > > > > The Beam Flink runner has made a lot of progress this year, but work > on > > > > portability in Beam actually started much before that and was a very > > big > > > > change (originally there was just the Java SDK). Much of the code has > > > been > > > > rewritten as part of the effort; that's what implementing a strong > > multi > > > > language support story took. To have a decent shot at it, the > > equivalent > > > of > > > > much of the Beam portability framework would need to be reinvented in > > > > Flink. This would fork resources and divert focus away from things > that > > > may > > > > be more core to Flink. As you can guess I am in favor of option (1) ! > > > > > > > > We could take a look at SQL for reference. Flink community has > > invested a > > > > lot in SQL and there remains a lot of work to do. Beam community has > > done > > > > the same and we have two completely separate implementations. When I > > > > recently learned more about the Beam SQL work, one of my first > > questions > > > > was if joined effort would not lead to better user value? Calcite is > > > > common, but isn't there much more that could be shared? Someone had > the > > > > idea that in such a world Flink could just substitute portions or all > > of > > > > the graph provided by Beam with it's own optimized version but much > of > > > the > > > > tooling could be same? > > > > > > > > IO connectors are another area where much effort is repeated. It > takes > > a > > > > very long time to arrive at a solid, production quality > implementation > > > > (typically resulting from broad user exposure and running at scale). > > > > Currently there is discussion how connectors can be done much better > in > > > > both projects: SDF in Beam [1] and FLIP-27. > > > > > > > > It's a trade-off, but more synergy would be great! > > > > > > > > Thomas > > > > > > > > [1] > > > > > > > > > > > > > > https://docs.google.com/document/d/1cKOB9ToasfYs1kLWQgffzvIbJx2Smy4svlodPRhFrk4/ > > > > > > > > > > > > On Tue, Dec 18, 2018 at 2:16 PM Austin Bennett < > > > > [hidden email]> > > > > wrote: > > > > > > > > > Hi Shaoxuan, > > > > > > > > > > FWIW, Kenn Knowles (Beam PMC Chair) recently gave a talk at the Bay > > > Area > > > > > Apache Beam Meetup[1] which included a bit on a vision for how Beam > > > could > > > > > better leverage runner specific optimizations -- as an > > > example/extension, > > > > > Beam SQL leveraging Flink specific SQL optimizations (to address > your > > > > > point). So, that is part of the eventual roadmap for Beam, and > > > > illustrates > > > > > how concrete efforts towards optimizations in Runner/SDK-Harness > > would > > > > > likely yield the desired result of cross-language support (which > > could > > > be > > > > > done by leveraging Beam, and devote focus to optimizing that > > processing > > > > on > > > > > Flink). > > > > > > > > > > Cheers, > > > > > Austin > > > > > > > > > > > > > > > [1] > > https://www.meetup.com/San-Francisco-Apache-Beam/events/256348972/ > > > > -- > > > > > I > > > > > can post/share videos once available should someone desire. > > > > > > > > > > On Fri, Dec 14, 2018 at 6:03 AM Maximilian Michels <[hidden email] > > > > > > wrote: > > > > > > > > > > > Hi Xianda, hi Shaoxuan, > > > > > > > > > > > > I'd be in favor of option (1). There is great potential in Beam > and > > > > Flink > > > > > > joining forces on this one. Here's why: > > > > > > > > > > > > The Beam project spent at least a year developing a portability > > layer > > > > > with > > > > > > a > > > > > > reasonable amount of people working on it. Developing a new > > > portability > > > > > > layer > > > > > > from scratch will probably take about the same amount of time and > > > > > > resources. > > > > > > > > > > > > Concerning option (2): There is already a Python API for Flink > but > > an > > > > API > > > > > > is > > > > > > only one part of the portability story. In Beam the portability > is > > > > > > structured > > > > > > into three components: > > > > > > > > > > > > - SDK (API, its Protobuf serialization, and interaction with the > > SDK > > > > > > Harness) > > > > > > - Runner (Translation from Protobuf pipeline to Flink job) > > > > > > - SDK Harness (UDF execution, Interaction with the SDK and the > > > > execution > > > > > > engine) > > > > > > > > > > > > I could imagine the Flink Python API would be another SDK which > > could > > > > > have > > > > > > its > > > > > > own API but would reuse code for the interaction with the SDK > > > Harness. > > > > > > > > > > > > We would be able to focus on the optimizations instead of > > rebuilding > > > a > > > > > > portability layer from scratch. > > > > > > > > > > > > Thanks, > > > > > > Max > > > > > > > > > > > > On 13.12.18 11:52, Shaoxuan Wang wrote: > > > > > > > RE: Stephen's options ( > > > > > > > > > > > > > > > > > > > > > > > > > > > > http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/SURVEY-Usage-of-flink-python-and-flink-streaming-python-td25793.html > > > > > > > ) > > > > > > > * Option (1): Language portability via Apache Beam > > > > > > > * Option (2): Implement own Python API > > > > > > > * Option (3): Implement own portability layer > > > > > > > > > > > > > > Hi Stephen, > > > > > > > Eventually, I think we should support both option1 and option3. > > > TMO, > > > > > > these > > > > > > > two options are orthogonal. I agree with you that we can > leverage > > > the > > > > > > > existing work and ecosystem in beam by supporting option1. But > > the > > > > > > problem > > > > > > > of beam is that it skips (to the best of my knowledge) the > > natural > > > > > > > table/SQL optimization framework provided by Flink. We should > > spend > > > > all > > > > > > the > > > > > > > needed efforts to support solution1 (as it is the better > > > alternative > > > > of > > > > > > the > > > > > > > current Flink python API), but cannot solely bet on it. Option3 > > is > > > > the > > > > > > > ideal choice for Flink to support all Non-JVM languages which > we > > > > should > > > > > > > better plan to achieve. We have done some preliminary > prototypes > > > for > > > > > > > option2/option3, and it seems not quite complex and difficult > to > > > > > > accomplish. > > > > > > > > > > > > > > Regards, > > > > > > > Shaoxuan > > > > > > > > > > > > > > > > > > > > > On Thu, Dec 13, 2018 at 4:58 PM Xianda Ke <[hidden email]> > > > > wrote: > > > > > > > > > > > > > >> Currently there is an ongoing survey about Python usage of > Flink > > > > [1]. > > > > > > Some > > > > > > >> discussion was also brought up there regarding non-jvm > language > > > > > support > > > > > > >> strategy in general. To avoid polluting the survey thread, we > > are > > > > > > starting > > > > > > >> this discussion thread and would like to move the discussions > > > here. > > > > > > >> > > > > > > >> In the interest of facilitating the discussion, we would like > to > > > > first > > > > > > >> share the following design doc which describes what we have > done > > > at > > > > > > Alibaba > > > > > > >> about Python API for Flink. It could serve as a good reference > > to > > > > the > > > > > > >> discussion. > > > > > > >> > > > > > > >> [DISCUSS] Flink Python API > > > > > > >> < > > > > > > >> > > > > > > > > > > > > > > > > > > > > > https://docs.google.com/document/d/1JNGWdLwbo_btq9RVrc1PjWJV3lYUgPvK0uEWDIfVNJI/edit?usp=drive_web > > > > > > >>> > > > > > > >> > > > > > > >> As of now, we've implemented and delivered Python UDF for SQL > > for > > > > the > > > > > > >> internal users at Alibaba. > > > > > > >> We are starting to implement Python API. > > > > > > >> > > > > > > >> To recap and continue the discussion from the survey thread, I > > > agree > > > > > > with > > > > > > >> @Stephan that we should figure out in which general direction > > > Python > > > > > > >> support should go. Stephan also list three options there: > > > > > > >> * Option (1): Language portability via Apache Beam > > > > > > >> * Option (2): Implement own Python API > > > > > > >> * Option (3): Implement own portability layer > > > > > > >> > > > > > > >> From my perspective, > > > > > > >> (1). Flink language APIs and Beam's languages support are not > > > > mutually > > > > > > >> exclusive. > > > > > > >> It is nice that Beam has Python/NodeJS/Go APIs, and support > > Flink > > > as > > > > > the > > > > > > >> runner. > > > > > > >> Flink's own Python(or NodeJS/Go) APIs will benefit Flink's > > > > ecosystem. > > > > > > >> > > > > > > >> (2). Python API / portability layer > > > > > > >> To support non-JVM languages in Flink, > > > > > > >> * at client side, Flink would provide language interfaces, > > which > > > > > will > > > > > > >> translate user's application to Flink StreamGraph. > > > > > > >> * at server side, Flink would execute user's UDF code at > runtime > > > > > > >> The non-JVM languages communicate with JVM via RPC(or > low-level > > > > > socket, > > > > > > >> embedded interpreter and so on). What the portability layer > can > > do > > > > > > maybe is > > > > > > >> abstracting the RPC layer. When the portability layer is > ready, > > > > still > > > > > > there > > > > > > >> are lots of stuff to do for a specified language. Say, Python, > > we > > > > may > > > > > > still > > > > > > >> have to write the interface classes by hand for the users > > because > > > > > > generated > > > > > > >> code without detailed documentation is unacceptable for users, > > or > > > > > handle > > > > > > >> the serialization issue of lambda/closure which is not a > > built-in > > > > > > feature > > > > > > >> in Python. Maybe, we can start with Python API, then extend > to > > > > other > > > > > > >> languages and abstract the logic in common as the portability > > > layer. > > > > > > >> > > > > > > >> --- > > > > > > >> References: > > > > > > >> [1] [SURVEY] Usage of flink-python and flink-streaming-python > > > > > > >> > > > > > > >> Regards, > > > > > > >> Xianda > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- Best Regards Jeff Zhang |
Thanks for your feedback Vino, Jeff!
I have started new threading outlining what we are proposing in Python Table API. [DISCUSS] FLIP-38 Support python language in flink TableAPI can be found here: http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-38-Support-python-language-in-flink-TableAPI-td28061.html Best, Jincheng Jeff Zhang <[hidden email]> 于2019年3月28日周四 下午4:59写道: > Hi Shaoxuan & Jincheng, > > Thanks for driving this initiative. Python would be a very big add-on for > flink adoption in data science world. One additional suggestion is you may > need to think about how to transfer flink Table to pandas dataframe which > is a very popular library in python. And you may be interested in apache > arrow which is a common layer to transferring data efficiently across > languages. https://arrow.apache.org/ > > > > > > > vino yang <[hidden email]> 于2019年3月28日周四 下午2:44写道: > >> Hi jincheng, >> >> Thanks for activating this discussion again. >> I personally look forward to your design draft. >> >> Best, >> Vino >> >> jincheng sun <[hidden email]> 于2019年3月28日周四 下午12:16写道: >> >> > Hi everyone, >> > Sorry to join in this discussion late. >> > >> > Thanks to Xianda Ke for initiating this discussion. I also enjoy the >> > discussions&suggestions by Max, Austin, Thomas, Shaoxuan and others. >> > >> > Recently, I did feel the desire of the community and Flink users for >> Python >> > support. Stephan also pointed out in the discussion of `Adding a >> mid-term >> > roadmap`: "Table API becomes primary API for analytics use cases", >> while a >> > large number of users in analytics use cases are accustomed to the >> Python >> > language, and the accumulation of a large number of class libraries is >> also >> > deposited in the python library. >> > >> > So I am very interested in participating in the discussion of supporting >> > Python in Flink. With regard to the three options mentioned so far, it >> is a >> > great encouragement to leverage the beam’s language portable layer on >> > Flink. For now, we can start with a step in the Flink to add a >> Py-tableAPI. >> > I believe in, in this process, we will have a deeper understanding of >> how >> > Flink support python. If we can quickly let users experience the first >> > version of Flink Python TableAPI, we can also receive feedback from many >> > users, and consider the long-term goals of multi-language support on >> Flink. >> > >> > So if you agree, I volunteer to draft a document that would support the >> > detailed design and implementation plan of Py-TableAPI on Flink. >> > >> > What do you think? >> > >> > Shaoxuan Wang <[hidden email]> 于2019年2月21日周四 下午10:44写道: >> > >> > > Hey guys, >> > > Thanks for your comments and sorry for the late reply. >> > > Beam Python API and Flink Python TableAPI describe the DAG/pipeline in >> > > different manners. We got a chance to communicate with Tyler Akidau >> (from >> > > Beam) offline, and explained why the Flink tableAPI needs a specific >> > design >> > > for python, rather than purely leverage Beam portability layer. >> > > >> > > In our proposal, most of the Python code is just a DAG/pipeline >> builder >> > for >> > > tableAPI. The majority of operators run purely in Java, while only >> python >> > > UDFs executed in Python environment during the runtime. This design >> does >> > > not affect the development and adoption of Beam language portability >> > layer >> > > with Flink runner. Flink and Beam community will still collaborate, >> > jointly >> > > develop and optimize on the JVM / Non-JVM (python,GO) bridge (data and >> > > control connections between different processes) to ensure the >> robustness >> > > and performance. >> > > >> > > Regards, >> > > Shaoxuan >> > > >> > > >> > > On Fri, Dec 21, 2018 at 1:39 PM Thomas Weise <[hidden email]> wrote: >> > > >> > > > Interest in Python seems on the rise and so this is a good >> discussion >> > to >> > > > have :) >> > > > >> > > > So far there seems to be agreement that Beam's approach towards >> Python >> > > and >> > > > other non-JVM language support (language SDK, portability layer >> etc.) >> > is >> > > > the right direction? Specification and execution are native Python >> and >> > it >> > > > does not suffer from the shortcomings of Flink's Jython API and few >> > other >> > > > approaches. >> > > > >> > > > Overall there already is good alignment between Beam and Flink in >> > > concepts >> > > > and model. There are also few of us that are active in both >> > communities. >> > > > The Beam Flink runner has made a lot of progress this year, but >> work on >> > > > portability in Beam actually started much before that and was a very >> > big >> > > > change (originally there was just the Java SDK). Much of the code >> has >> > > been >> > > > rewritten as part of the effort; that's what implementing a strong >> > multi >> > > > language support story took. To have a decent shot at it, the >> > equivalent >> > > of >> > > > much of the Beam portability framework would need to be reinvented >> in >> > > > Flink. This would fork resources and divert focus away from things >> that >> > > may >> > > > be more core to Flink. As you can guess I am in favor of option (1) >> ! >> > > > >> > > > We could take a look at SQL for reference. Flink community has >> > invested a >> > > > lot in SQL and there remains a lot of work to do. Beam community has >> > done >> > > > the same and we have two completely separate implementations. When I >> > > > recently learned more about the Beam SQL work, one of my first >> > questions >> > > > was if joined effort would not lead to better user value? Calcite is >> > > > common, but isn't there much more that could be shared? Someone had >> the >> > > > idea that in such a world Flink could just substitute portions or >> all >> > of >> > > > the graph provided by Beam with it's own optimized version but much >> of >> > > the >> > > > tooling could be same? >> > > > >> > > > IO connectors are another area where much effort is repeated. It >> takes >> > a >> > > > very long time to arrive at a solid, production quality >> implementation >> > > > (typically resulting from broad user exposure and running at scale). >> > > > Currently there is discussion how connectors can be done much >> better in >> > > > both projects: SDF in Beam [1] and FLIP-27. >> > > > >> > > > It's a trade-off, but more synergy would be great! >> > > > >> > > > Thomas >> > > > >> > > > [1] >> > > > >> > > > >> > > >> > >> https://docs.google.com/document/d/1cKOB9ToasfYs1kLWQgffzvIbJx2Smy4svlodPRhFrk4/ >> > > > >> > > > >> > > > On Tue, Dec 18, 2018 at 2:16 PM Austin Bennett < >> > > > [hidden email]> >> > > > wrote: >> > > > >> > > > > Hi Shaoxuan, >> > > > > >> > > > > FWIW, Kenn Knowles (Beam PMC Chair) recently gave a talk at the >> Bay >> > > Area >> > > > > Apache Beam Meetup[1] which included a bit on a vision for how >> Beam >> > > could >> > > > > better leverage runner specific optimizations -- as an >> > > example/extension, >> > > > > Beam SQL leveraging Flink specific SQL optimizations (to address >> your >> > > > > point). So, that is part of the eventual roadmap for Beam, and >> > > > illustrates >> > > > > how concrete efforts towards optimizations in Runner/SDK-Harness >> > would >> > > > > likely yield the desired result of cross-language support (which >> > could >> > > be >> > > > > done by leveraging Beam, and devote focus to optimizing that >> > processing >> > > > on >> > > > > Flink). >> > > > > >> > > > > Cheers, >> > > > > Austin >> > > > > >> > > > > >> > > > > [1] >> > https://www.meetup.com/San-Francisco-Apache-Beam/events/256348972/ >> > > > -- >> > > > > I >> > > > > can post/share videos once available should someone desire. >> > > > > >> > > > > On Fri, Dec 14, 2018 at 6:03 AM Maximilian Michels < >> [hidden email]> >> > > > wrote: >> > > > > >> > > > > > Hi Xianda, hi Shaoxuan, >> > > > > > >> > > > > > I'd be in favor of option (1). There is great potential in Beam >> and >> > > > Flink >> > > > > > joining forces on this one. Here's why: >> > > > > > >> > > > > > The Beam project spent at least a year developing a portability >> > layer >> > > > > with >> > > > > > a >> > > > > > reasonable amount of people working on it. Developing a new >> > > portability >> > > > > > layer >> > > > > > from scratch will probably take about the same amount of time >> and >> > > > > > resources. >> > > > > > >> > > > > > Concerning option (2): There is already a Python API for Flink >> but >> > an >> > > > API >> > > > > > is >> > > > > > only one part of the portability story. In Beam the portability >> is >> > > > > > structured >> > > > > > into three components: >> > > > > > >> > > > > > - SDK (API, its Protobuf serialization, and interaction with the >> > SDK >> > > > > > Harness) >> > > > > > - Runner (Translation from Protobuf pipeline to Flink job) >> > > > > > - SDK Harness (UDF execution, Interaction with the SDK and the >> > > > execution >> > > > > > engine) >> > > > > > >> > > > > > I could imagine the Flink Python API would be another SDK which >> > could >> > > > > have >> > > > > > its >> > > > > > own API but would reuse code for the interaction with the SDK >> > > Harness. >> > > > > > >> > > > > > We would be able to focus on the optimizations instead of >> > rebuilding >> > > a >> > > > > > portability layer from scratch. >> > > > > > >> > > > > > Thanks, >> > > > > > Max >> > > > > > >> > > > > > On 13.12.18 11:52, Shaoxuan Wang wrote: >> > > > > > > RE: Stephen's options ( >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/SURVEY-Usage-of-flink-python-and-flink-streaming-python-td25793.html >> > > > > > > ) >> > > > > > > * Option (1): Language portability via Apache Beam >> > > > > > > * Option (2): Implement own Python API >> > > > > > > * Option (3): Implement own portability layer >> > > > > > > >> > > > > > > Hi Stephen, >> > > > > > > Eventually, I think we should support both option1 and >> option3. >> > > TMO, >> > > > > > these >> > > > > > > two options are orthogonal. I agree with you that we can >> leverage >> > > the >> > > > > > > existing work and ecosystem in beam by supporting option1. But >> > the >> > > > > > problem >> > > > > > > of beam is that it skips (to the best of my knowledge) the >> > natural >> > > > > > > table/SQL optimization framework provided by Flink. We should >> > spend >> > > > all >> > > > > > the >> > > > > > > needed efforts to support solution1 (as it is the better >> > > alternative >> > > > of >> > > > > > the >> > > > > > > current Flink python API), but cannot solely bet on it. >> Option3 >> > is >> > > > the >> > > > > > > ideal choice for Flink to support all Non-JVM languages which >> we >> > > > should >> > > > > > > better plan to achieve. We have done some preliminary >> prototypes >> > > for >> > > > > > > option2/option3, and it seems not quite complex and difficult >> to >> > > > > > accomplish. >> > > > > > > >> > > > > > > Regards, >> > > > > > > Shaoxuan >> > > > > > > >> > > > > > > >> > > > > > > On Thu, Dec 13, 2018 at 4:58 PM Xianda Ke <[hidden email] >> > >> > > > wrote: >> > > > > > > >> > > > > > >> Currently there is an ongoing survey about Python usage of >> Flink >> > > > [1]. >> > > > > > Some >> > > > > > >> discussion was also brought up there regarding non-jvm >> language >> > > > > support >> > > > > > >> strategy in general. To avoid polluting the survey thread, we >> > are >> > > > > > starting >> > > > > > >> this discussion thread and would like to move the discussions >> > > here. >> > > > > > >> >> > > > > > >> In the interest of facilitating the discussion, we would >> like to >> > > > first >> > > > > > >> share the following design doc which describes what we have >> done >> > > at >> > > > > > Alibaba >> > > > > > >> about Python API for Flink. It could serve as a good >> reference >> > to >> > > > the >> > > > > > >> discussion. >> > > > > > >> >> > > > > > >> [DISCUSS] Flink Python API >> > > > > > >> < >> > > > > > >> >> > > > > > >> > > > > >> > > > >> > > >> > >> https://docs.google.com/document/d/1JNGWdLwbo_btq9RVrc1PjWJV3lYUgPvK0uEWDIfVNJI/edit?usp=drive_web >> > > > > > >>> >> > > > > > >> >> > > > > > >> As of now, we've implemented and delivered Python UDF for SQL >> > for >> > > > the >> > > > > > >> internal users at Alibaba. >> > > > > > >> We are starting to implement Python API. >> > > > > > >> >> > > > > > >> To recap and continue the discussion from the survey thread, >> I >> > > agree >> > > > > > with >> > > > > > >> @Stephan that we should figure out in which general direction >> > > Python >> > > > > > >> support should go. Stephan also list three options there: >> > > > > > >> * Option (1): Language portability via Apache Beam >> > > > > > >> * Option (2): Implement own Python API >> > > > > > >> * Option (3): Implement own portability layer >> > > > > > >> >> > > > > > >> From my perspective, >> > > > > > >> (1). Flink language APIs and Beam's languages support are not >> > > > mutually >> > > > > > >> exclusive. >> > > > > > >> It is nice that Beam has Python/NodeJS/Go APIs, and support >> > Flink >> > > as >> > > > > the >> > > > > > >> runner. >> > > > > > >> Flink's own Python(or NodeJS/Go) APIs will benefit Flink's >> > > > ecosystem. >> > > > > > >> >> > > > > > >> (2). Python API / portability layer >> > > > > > >> To support non-JVM languages in Flink, >> > > > > > >> * at client side, Flink would provide language interfaces, >> > which >> > > > > will >> > > > > > >> translate user's application to Flink StreamGraph. >> > > > > > >> * at server side, Flink would execute user's UDF code at >> runtime >> > > > > > >> The non-JVM languages communicate with JVM via RPC(or >> low-level >> > > > > socket, >> > > > > > >> embedded interpreter and so on). What the portability layer >> can >> > do >> > > > > > maybe is >> > > > > > >> abstracting the RPC layer. When the portability layer is >> ready, >> > > > still >> > > > > > there >> > > > > > >> are lots of stuff to do for a specified language. Say, >> Python, >> > we >> > > > may >> > > > > > still >> > > > > > >> have to write the interface classes by hand for the users >> > because >> > > > > > generated >> > > > > > >> code without detailed documentation is unacceptable for >> users, >> > or >> > > > > handle >> > > > > > >> the serialization issue of lambda/closure which is not a >> > built-in >> > > > > > feature >> > > > > > >> in Python. Maybe, we can start with Python API, then extend >> to >> > > > other >> > > > > > >> languages and abstract the logic in common as the portability >> > > layer. >> > > > > > >> >> > > > > > >> --- >> > > > > > >> References: >> > > > > > >> [1] [SURVEY] Usage of flink-python and flink-streaming-python >> > > > > > >> >> > > > > > >> Regards, >> > > > > > >> Xianda >> > > > > > >> >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> > > > -- > Best Regards > > Jeff Zhang > |
Free forum by Nabble | Edit this page |