Hi everyone,
I'd like to start a discussion about supporting conversion between PyFlink Table and Pandas DataFrame. Pandas dataframe is the de-facto standard to work with tabular data in Python community. PyFlink table is Flink’s representation of the tabular data in Python language. It would be nice to provide the functionality to convert between the PyFlink table and Pandas dataframe in PyFlink Table API. It provides users the ability to switch between PyFlink and Pandas seamlessly when processing data in Python language without an extra intermediate connectors. Jincheng Sun and I have discussed offline and have drafted the FLIP-120[1]. Looking forward to your feedback! Regards, Dian [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-120%3A+Support+conversion+between+PyFlink+Table+and+Pandas+DataFrame |
Thanks Dian for driving this, definitely +1
Here's my 2 cents: 1. I would pay more attention on to_pandas than from_pandas. Because to_pandas will be used more frequently I believe 2. I think ArrowTableSink may not be enough for to_pandas, because pandas dataframe is on client side, it is not a table sink. We still need to convert ArrowTableSink to pandas dataframe if I understand correctly. Dian Fu <[hidden email]> 于2020年4月1日周三 上午10:49写道: > Hi everyone, > > I'd like to start a discussion about supporting conversion between PyFlink > Table and Pandas DataFrame. > > Pandas dataframe is the de-facto standard to work with tabular data in > Python community. PyFlink table is Flink’s representation of the tabular > data in Python language. It would be nice to provide the functionality to > convert between the PyFlink table and Pandas dataframe in PyFlink Table > API. It provides users the ability to switch between PyFlink and Pandas > seamlessly when processing data in Python language without an extra > intermediate connectors. > > Jincheng Sun and I have discussed offline and have drafted the > FLIP-120[1]. Looking forward to your feedback! > > Regards, > Dian > > [1] > https://cwiki.apache.org/confluence/display/FLINK/FLIP-120%3A+Support+conversion+between+PyFlink+Table+and+Pandas+DataFrame -- Best Regards Jeff Zhang |
Hi Jeff,
Thanks for your feedback. ArrowTableSink is a Flink sink which is responsible for collecting the data of the table. It will serialize the data of the table to Arrow format to make sure that it could be deserialized to pandas dataframe efficiently. You are right that pandas dataframe is constructed at the client side and so there needs a way to transfer the table data from the ArrowTableSink to the client. It shares the same design as Table.collect on how to transfer the data to the client. This is still under lively discussion in FLINK-14807. I think we can discuss it there on this aspect and so it's not touched in this design(already mentioned in the design doc). Then we can focus on table/dataframe conversion in this design. Does that make sense to you? Thanks, Dian [1] https://issues.apache.org/jira/browse/FLINK-14807 <https://issues.apache.org/jira/browse/FLINK-14807> > 在 2020年4月1日,上午11:14,Jeff Zhang <[hidden email]> 写道: > > Thanks Dian for driving this, definitely +1 > > Here's my 2 cents: > > 1. I would pay more attention on to_pandas than from_pandas. Because > to_pandas will be used more frequently I believe > 2. I think ArrowTableSink may not be enough for to_pandas, because pandas > dataframe is on client side, it is not a table sink. We still need to > convert ArrowTableSink to pandas dataframe if I understand correctly. > > > > > Dian Fu <[hidden email]> 于2020年4月1日周三 上午10:49写道: > >> Hi everyone, >> >> I'd like to start a discussion about supporting conversion between PyFlink >> Table and Pandas DataFrame. >> >> Pandas dataframe is the de-facto standard to work with tabular data in >> Python community. PyFlink table is Flink’s representation of the tabular >> data in Python language. It would be nice to provide the functionality to >> convert between the PyFlink table and Pandas dataframe in PyFlink Table >> API. It provides users the ability to switch between PyFlink and Pandas >> seamlessly when processing data in Python language without an extra >> intermediate connectors. >> >> Jincheng Sun and I have discussed offline and have drafted the >> FLIP-120[1]. Looking forward to your feedback! >> >> Regards, >> Dian >> >> [1] >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-120%3A+Support+conversion+between+PyFlink+Table+and+Pandas+DataFrame > > > > -- > Best Regards > > Jeff Zhang |
Thanks for the reply, Dian, that make sense to me.
Dian Fu <[hidden email]> 于2020年4月1日周三 上午11:53写道: > Hi Jeff, > > Thanks for your feedback. > > ArrowTableSink is a Flink sink which is responsible for collecting the > data of the table. It will serialize the data of the table to Arrow format > to make sure that it could be deserialized to pandas dataframe efficiently. > You are right that pandas dataframe is constructed at the client side and > so there needs a way to transfer the table data from the ArrowTableSink to > the client. It shares the same design as Table.collect on how to transfer > the data to the client. This is still under lively discussion in > FLINK-14807. I think we can discuss it there on this aspect and so it's not > touched in this design(already mentioned in the design doc). Then we can > focus on table/dataframe conversion in this design. Does that make sense to > you? > > Thanks, > Dian > > [1] https://issues.apache.org/jira/browse/FLINK-14807 < > https://issues.apache.org/jira/browse/FLINK-14807> > > 在 2020年4月1日,上午11:14,Jeff Zhang <[hidden email]> 写道: > > > > Thanks Dian for driving this, definitely +1 > > > > Here's my 2 cents: > > > > 1. I would pay more attention on to_pandas than from_pandas. Because > > to_pandas will be used more frequently I believe > > 2. I think ArrowTableSink may not be enough for to_pandas, because pandas > > dataframe is on client side, it is not a table sink. We still need to > > convert ArrowTableSink to pandas dataframe if I understand correctly. > > > > > > > > > > Dian Fu <[hidden email]> 于2020年4月1日周三 上午10:49写道: > > > >> Hi everyone, > >> > >> I'd like to start a discussion about supporting conversion between > PyFlink > >> Table and Pandas DataFrame. > >> > >> Pandas dataframe is the de-facto standard to work with tabular data in > >> Python community. PyFlink table is Flink’s representation of the tabular > >> data in Python language. It would be nice to provide the functionality > to > >> convert between the PyFlink table and Pandas dataframe in PyFlink Table > >> API. It provides users the ability to switch between PyFlink and Pandas > >> seamlessly when processing data in Python language without an extra > >> intermediate connectors. > >> > >> Jincheng Sun and I have discussed offline and have drafted the > >> FLIP-120[1]. Looking forward to your feedback! > >> > >> Regards, > >> Dian > >> > >> [1] > >> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-120%3A+Support+conversion+between+PyFlink+Table+and+Pandas+DataFrame > > > > > > > > -- > > Best Regards > > > > Jeff Zhang > > -- Best Regards Jeff Zhang |
+1, Thanks for bring up this discussion @Dian Fu <[hidden email]>
Best, Jincheng Jeff Zhang <[hidden email]> 于2020年4月1日周三 下午1:27写道: > Thanks for the reply, Dian, that make sense to me. > > Dian Fu <[hidden email]> 于2020年4月1日周三 上午11:53写道: > > > Hi Jeff, > > > > Thanks for your feedback. > > > > ArrowTableSink is a Flink sink which is responsible for collecting the > > data of the table. It will serialize the data of the table to Arrow > format > > to make sure that it could be deserialized to pandas dataframe > efficiently. > > You are right that pandas dataframe is constructed at the client side and > > so there needs a way to transfer the table data from the ArrowTableSink > to > > the client. It shares the same design as Table.collect on how to transfer > > the data to the client. This is still under lively discussion in > > FLINK-14807. I think we can discuss it there on this aspect and so it's > not > > touched in this design(already mentioned in the design doc). Then we can > > focus on table/dataframe conversion in this design. Does that make sense > to > > you? > > > > Thanks, > > Dian > > > > [1] https://issues.apache.org/jira/browse/FLINK-14807 < > > https://issues.apache.org/jira/browse/FLINK-14807> > > > 在 2020年4月1日,上午11:14,Jeff Zhang <[hidden email]> 写道: > > > > > > Thanks Dian for driving this, definitely +1 > > > > > > Here's my 2 cents: > > > > > > 1. I would pay more attention on to_pandas than from_pandas. Because > > > to_pandas will be used more frequently I believe > > > 2. I think ArrowTableSink may not be enough for to_pandas, because > pandas > > > dataframe is on client side, it is not a table sink. We still need to > > > convert ArrowTableSink to pandas dataframe if I understand correctly. > > > > > > > > > > > > > > > Dian Fu <[hidden email]> 于2020年4月1日周三 上午10:49写道: > > > > > >> Hi everyone, > > >> > > >> I'd like to start a discussion about supporting conversion between > > PyFlink > > >> Table and Pandas DataFrame. > > >> > > >> Pandas dataframe is the de-facto standard to work with tabular data in > > >> Python community. PyFlink table is Flink’s representation of the > tabular > > >> data in Python language. It would be nice to provide the functionality > > to > > >> convert between the PyFlink table and Pandas dataframe in PyFlink > Table > > >> API. It provides users the ability to switch between PyFlink and > Pandas > > >> seamlessly when processing data in Python language without an extra > > >> intermediate connectors. > > >> > > >> Jincheng Sun and I have discussed offline and have drafted the > > >> FLIP-120[1]. Looking forward to your feedback! > > >> > > >> Regards, > > >> Dian > > >> > > >> [1] > > >> > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-120%3A+Support+conversion+between+PyFlink+Table+and+Pandas+DataFrame > > > > > > > > > > > > -- > > > Best Regards > > > > > > Jeff Zhang > > > > > > -- > Best Regards > > Jeff Zhang > |
Hi Dian,
Thanks for driving this. Big +1 for supporting from/to pandas in PyFlink! Best, Wei > 在 2020年4月3日,13:46,jincheng sun <[hidden email]> 写道: > > +1, Thanks for bring up this discussion @Dian Fu <[hidden email]> > > Best, > Jincheng > > > Jeff Zhang <[hidden email]> 于2020年4月1日周三 下午1:27写道: > >> Thanks for the reply, Dian, that make sense to me. >> >> Dian Fu <[hidden email]> 于2020年4月1日周三 上午11:53写道: >> >>> Hi Jeff, >>> >>> Thanks for your feedback. >>> >>> ArrowTableSink is a Flink sink which is responsible for collecting the >>> data of the table. It will serialize the data of the table to Arrow >> format >>> to make sure that it could be deserialized to pandas dataframe >> efficiently. >>> You are right that pandas dataframe is constructed at the client side and >>> so there needs a way to transfer the table data from the ArrowTableSink >> to >>> the client. It shares the same design as Table.collect on how to transfer >>> the data to the client. This is still under lively discussion in >>> FLINK-14807. I think we can discuss it there on this aspect and so it's >> not >>> touched in this design(already mentioned in the design doc). Then we can >>> focus on table/dataframe conversion in this design. Does that make sense >> to >>> you? >>> >>> Thanks, >>> Dian >>> >>> [1] https://issues.apache.org/jira/browse/FLINK-14807 < >>> https://issues.apache.org/jira/browse/FLINK-14807> >>>> 在 2020年4月1日,上午11:14,Jeff Zhang <[hidden email]> 写道: >>>> >>>> Thanks Dian for driving this, definitely +1 >>>> >>>> Here's my 2 cents: >>>> >>>> 1. I would pay more attention on to_pandas than from_pandas. Because >>>> to_pandas will be used more frequently I believe >>>> 2. I think ArrowTableSink may not be enough for to_pandas, because >> pandas >>>> dataframe is on client side, it is not a table sink. We still need to >>>> convert ArrowTableSink to pandas dataframe if I understand correctly. >>>> >>>> >>>> >>>> >>>> Dian Fu <[hidden email]> 于2020年4月1日周三 上午10:49写道: >>>> >>>>> Hi everyone, >>>>> >>>>> I'd like to start a discussion about supporting conversion between >>> PyFlink >>>>> Table and Pandas DataFrame. >>>>> >>>>> Pandas dataframe is the de-facto standard to work with tabular data in >>>>> Python community. PyFlink table is Flink’s representation of the >> tabular >>>>> data in Python language. It would be nice to provide the functionality >>> to >>>>> convert between the PyFlink table and Pandas dataframe in PyFlink >> Table >>>>> API. It provides users the ability to switch between PyFlink and >> Pandas >>>>> seamlessly when processing data in Python language without an extra >>>>> intermediate connectors. >>>>> >>>>> Jincheng Sun and I have discussed offline and have drafted the >>>>> FLIP-120[1]. Looking forward to your feedback! >>>>> >>>>> Regards, >>>>> Dian >>>>> >>>>> [1] >>>>> >>> >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-120%3A+Support+conversion+between+PyFlink+Table+and+Pandas+DataFrame >>>> >>>> >>>> >>>> -- >>>> Best Regards >>>> >>>> Jeff Zhang >>> >>> >> >> -- >> Best Regards >> >> Jeff Zhang >> |
Thanks you all for the discussion. It seems that we have reached consensus on the design. I will start a VOTE thread if there are no other feedbacks.
Regards, Dian > 在 2020年4月3日,下午2:58,Wei Zhong <[hidden email]> 写道: > > Hi Dian, > > Thanks for driving this. Big +1 for supporting from/to pandas in PyFlink! > > Best, > Wei > >> 在 2020年4月3日,13:46,jincheng sun <[hidden email]> 写道: >> >> +1, Thanks for bring up this discussion @Dian Fu <[hidden email]> >> >> Best, >> Jincheng >> >> >> Jeff Zhang <[hidden email]> 于2020年4月1日周三 下午1:27写道: >> >>> Thanks for the reply, Dian, that make sense to me. >>> >>> Dian Fu <[hidden email]> 于2020年4月1日周三 上午11:53写道: >>> >>>> Hi Jeff, >>>> >>>> Thanks for your feedback. >>>> >>>> ArrowTableSink is a Flink sink which is responsible for collecting the >>>> data of the table. It will serialize the data of the table to Arrow >>> format >>>> to make sure that it could be deserialized to pandas dataframe >>> efficiently. >>>> You are right that pandas dataframe is constructed at the client side and >>>> so there needs a way to transfer the table data from the ArrowTableSink >>> to >>>> the client. It shares the same design as Table.collect on how to transfer >>>> the data to the client. This is still under lively discussion in >>>> FLINK-14807. I think we can discuss it there on this aspect and so it's >>> not >>>> touched in this design(already mentioned in the design doc). Then we can >>>> focus on table/dataframe conversion in this design. Does that make sense >>> to >>>> you? >>>> >>>> Thanks, >>>> Dian >>>> >>>> [1] https://issues.apache.org/jira/browse/FLINK-14807 < >>>> https://issues.apache.org/jira/browse/FLINK-14807> >>>>> 在 2020年4月1日,上午11:14,Jeff Zhang <[hidden email]> 写道: >>>>> >>>>> Thanks Dian for driving this, definitely +1 >>>>> >>>>> Here's my 2 cents: >>>>> >>>>> 1. I would pay more attention on to_pandas than from_pandas. Because >>>>> to_pandas will be used more frequently I believe >>>>> 2. I think ArrowTableSink may not be enough for to_pandas, because >>> pandas >>>>> dataframe is on client side, it is not a table sink. We still need to >>>>> convert ArrowTableSink to pandas dataframe if I understand correctly. >>>>> >>>>> >>>>> >>>>> >>>>> Dian Fu <[hidden email]> 于2020年4月1日周三 上午10:49写道: >>>>> >>>>>> Hi everyone, >>>>>> >>>>>> I'd like to start a discussion about supporting conversion between >>>> PyFlink >>>>>> Table and Pandas DataFrame. >>>>>> >>>>>> Pandas dataframe is the de-facto standard to work with tabular data in >>>>>> Python community. PyFlink table is Flink’s representation of the >>> tabular >>>>>> data in Python language. It would be nice to provide the functionality >>>> to >>>>>> convert between the PyFlink table and Pandas dataframe in PyFlink >>> Table >>>>>> API. It provides users the ability to switch between PyFlink and >>> Pandas >>>>>> seamlessly when processing data in Python language without an extra >>>>>> intermediate connectors. >>>>>> >>>>>> Jincheng Sun and I have discussed offline and have drafted the >>>>>> FLIP-120[1]. Looking forward to your feedback! >>>>>> >>>>>> Regards, >>>>>> Dian >>>>>> >>>>>> [1] >>>>>> >>>> >>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-120%3A+Support+conversion+between+PyFlink+Table+and+Pandas+DataFrame >>>>> >>>>> >>>>> >>>>> -- >>>>> Best Regards >>>>> >>>>> Jeff Zhang >>>> >>>> >>> >>> -- >>> Best Regards >>> >>> Jeff Zhang >>> > |
Free forum by Nabble | Edit this page |