Re: Add Bucket File System Table Sink

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: Add Bucket File System Table Sink

Kurt Young
Hi Jun,

Thanks for bringing this up, in general I'm +1 on this feature. As
you might know, there is another ongoing efforts about such kind
of table sink, which covered in newly proposed partition support
reworking[1]. In this proposal, we also want to introduce a new
file system connector, which can not only cover the partition
support, but also end-to-end exactly once in streaming mode.

I would suggest we could combine these two efforts into one. The
benefits would be save some review efforts, also reduce the core
connector number to ease our maintaining effort in the future.
What do you think?

BTW, BucketingSink is already deprecated, I think we should refer
to StreamingFileSink instead.

Best,
Kurt

[1]
http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-63-Rework-table-partition-support-td32770.html


On Tue, Sep 17, 2019 at 10:39 AM Jun Zhang <[hidden email]> wrote:

> Hello everyone:
> I am a user and fan of flink. I also want to join the flink community. I
> contributed my first PR a few days ago. Can anyone help me to review my
> code? If there is something wrong, hope I would be grateful if you can give
> some advice.
>
> This PR is mainly in the process of development, I use sql to read data
> from kafka and then write to hdfs, I found that there is no suitable
> tablesink, I found the document and found that File System Connector is
> only experimental (
> https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/table/connect.html#file-system-connector),
> so I wrote a Bucket File System Table Sink that supports writing stream
> data. Hdfs, file file system, data format supports json, csv, parquet,
> avro. Subsequently add other format support, such as protobuf, thrift, etc.
>
> In addition, I also added documentation, python api, units test,
> end-end-test, sql-client, DDL, and compiled on travis.
>
> the issue is https://issues.apache.org/jira/browse/FLINK-12584
> thank you very much
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Add Bucket File System Table Sink

Kurt Young
Thanks. Let me clarify a bit more about my thinkings. Generally, I would
prefer we can concentrate the functionalities about connector, especially
some standard & most popular connectors, like kafka, different file
system with different formats, etc. We should make these core connectors
as powerful as we can, and can also prevent something badly from
happening, such as "if you want use this feature, please use connectorA.
But if you want use another feature, please use connectorB".

Best,
Kurt


On Tue, Sep 17, 2019 at 11:11 AM Jun Zhang <[hidden email]> wrote:

> Hi Kurt:
> thank you very much.
>         I will take a closer look at the FLIP-63.
>
>         I develop this PR, the underlying is StreamingFileSink, not
> BuckingSink, but I gave him a name, called Bucket.
>
>
> On 09/17/2019 10:57,Kurt Young<[hidden email]> <[hidden email]>
> wrote:
>
> Hi Jun,
>
> Thanks for bringing this up, in general I'm +1 on this feature. As
> you might know, there is another ongoing efforts about such kind
> of table sink, which covered in newly proposed partition support
> reworking[1]. In this proposal, we also want to introduce a new
> file system connector, which can not only cover the partition
> support, but also end-to-end exactly once in streaming mode.
>
> I would suggest we could combine these two efforts into one. The
> benefits would be save some review efforts, also reduce the core
> connector number to ease our maintaining effort in the future.
> What do you think?
>
> BTW, BucketingSink is already deprecated, I think we should refer
> to StreamingFileSink instead.
>
> Best,
> Kurt
>
> [1]
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-63-Rework-table-partition-support-td32770.html
>
>
> On Tue, Sep 17, 2019 at 10:39 AM Jun Zhang <[hidden email]> wrote:
>
>> Hello everyone:
>> I am a user and fan of flink. I also want to join the flink community. I
>> contributed my first PR a few days ago. Can anyone help me to review my
>> code? If there is something wrong, hope I would be grateful if you can give
>> some advice.
>>
>> This PR is mainly in the process of development, I use sql to read data
>> from kafka and then write to hdfs, I found that there is no suitable
>> tablesink, I found the document and found that File System Connector is
>> only experimental (
>> https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/table/connect.html#file-system-connector),
>> so I wrote a Bucket File System Table Sink that supports writing stream
>> data. Hdfs, file file system, data format supports json, csv, parquet,
>> avro. Subsequently add other format support, such as protobuf, thrift, etc.
>>
>> In addition, I also added documentation, python api, units test,
>> end-end-test, sql-client, DDL, and compiled on travis.
>>
>> the issue is https://issues.apache.org/jira/browse/FLINK-12584
>> thank you very much
>>
>>
>>
Reply | Threaded
Open this post in threaded view
|

回复: Add Bucket File System Table Sink

zhangjun
Hi Kurt:
        Thanks.
        When I encountered this problem, I found a File System Connector, but its function is not powerful enough and rich.
        I also found that it is built into Flink, there are many unit tests that refer to it, so I dare not easily modify it to enrich its functions.


        So I develop a new Connector, and later we can keep only one File System Connector and ensure that it is powerful and stable.


&nbsp; &nbsp; &nbsp;I will learn about FLIP-63 and see if there is a better solution to combine these two functions. I am very willing to join this development.







------------------&nbsp;原始邮件&nbsp;------------------
发件人:&nbsp;"Kurt Young"<[hidden email]&gt;;
发送时间:&nbsp;2019年9月17日(星期二) 中午11:19
收件人:&nbsp;"Jun Zhang"<[hidden email]&gt;;
抄送:&nbsp;"dev"<[hidden email]&gt;;"user"<[hidden email]&gt;;
主题:&nbsp;Re: Add Bucket File System Table Sink



Thanks. Let me clarify a bit more about my thinkings. Generally, I would
prefer we can concentrate the functionalities about connector, especially
some standard &amp; most popular connectors, like kafka, different file
system with different formats, etc. We should make these core connectors
as powerful as we can, and can also prevent something badly from
happening, such as "if you want use this feature, please use connectorA.
But if you want use another feature, please use connectorB".

Best,
Kurt


On Tue, Sep 17, 2019 at 11:11 AM Jun Zhang <[hidden email]&gt; wrote:

&gt; Hi Kurt:
&gt; thank you very much.
&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; I will take a closer look at the FLIP-63.
&gt;
&gt;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; I develop this PR, the underlying is StreamingFileSink, not
&gt; BuckingSink, but I gave him a name, called Bucket.
&gt;
&gt;
&gt; On 09/17/2019 10:57,Kurt Young<[hidden email]&gt; <[hidden email]&gt;
&gt; wrote:
&gt;
&gt; Hi Jun,
&gt;
&gt; Thanks for bringing this up, in general I'm +1 on this feature. As
&gt; you might know, there is another ongoing efforts about such kind
&gt; of table sink, which covered in newly proposed partition support
&gt; reworking[1]. In this proposal, we also want to introduce a new
&gt; file system connector, which can not only cover the partition
&gt; support, but also end-to-end exactly once in streaming mode.
&gt;
&gt; I would suggest we could combine these two efforts into one. The
&gt; benefits would be save some review efforts, also reduce the core
&gt; connector number to ease our maintaining effort in the future.
&gt; What do you think?
&gt;
&gt; BTW, BucketingSink is already deprecated, I think we should refer
&gt; to StreamingFileSink instead.
&gt;
&gt; Best,
&gt; Kurt
&gt;
&gt; [1]
&gt; http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-63-Rework-table-partition-support-td32770.html
&gt;
&gt;
&gt; On Tue, Sep 17, 2019 at 10:39 AM Jun Zhang <[hidden email]&gt; wrote:
&gt;
&gt;&gt; Hello everyone:
&gt;&gt; I am a user and fan of flink. I also want to join the flink community. I
&gt;&gt; contributed my first PR a few days ago. Can anyone help me to review my
&gt;&gt; code? If there is something wrong, hope I would be grateful if you can give
&gt;&gt; some advice.
&gt;&gt;
&gt;&gt; This PR is mainly in the process of development, I use sql to read data
&gt;&gt; from kafka and then write to hdfs, I found that there is no suitable
&gt;&gt; tablesink, I found the document and found that File System Connector is
&gt;&gt; only experimental (
&gt;&gt; https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/table/connect.html#file-system-connector),
&gt;&gt; so I wrote a Bucket File System Table Sink that supports writing stream
&gt;&gt; data. Hdfs, file file system, data format supports json, csv, parquet,
&gt;&gt; avro. Subsequently add other format support, such as protobuf, thrift, etc.
&gt;&gt;
&gt;&gt; In addition, I also added documentation, python api, units test,
&gt;&gt; end-end-test, sql-client, DDL, and compiled on travis.
&gt;&gt;
&gt;&gt; the issue is https://issues.apache.org/jira/browse/FLINK-12584
&gt;&gt; thank you very much
&gt;&gt;
&gt;&gt;
&gt;&gt;
Reply | Threaded
Open this post in threaded view
|

Re: Add Bucket File System Table Sink

Kurt Young
Great to hear.

Best,
Kurt


On Tue, Sep 17, 2019 at 11:45 AM Jun Zhang <[hidden email]> wrote:

>
> Hi Kurt:
> Thanks.
> When I encountered this problem, I found a File System Connector, but its
> function is not powerful enough and rich.
> I also found that it is built into Flink, there are many unit tests that
> refer to it, so I dare not easily modify it to enrich its functions.
>
> So I develop a new Connector, and later we can keep only one File System
> Connector and ensure that it is powerful and stable.
>
>      I will learn about FLIP-63 and see if there is a better solution to
> combine these two functions. I am very willing to join this development.
>
>
>
> ------------------ 原始邮件 ------------------
> *发件人:* "Kurt Young"<[hidden email]>;
> *发送时间:* 2019年9月17日(星期二) 中午11:19
> *收件人:* "Jun Zhang"<[hidden email]>;
> *抄送:* "dev"<[hidden email]>;"user"<[hidden email]>;
> *主题:* Re: Add Bucket File System Table Sink
>
> Thanks. Let me clarify a bit more about my thinkings. Generally, I would
> prefer we can concentrate the functionalities about connector, especially
> some standard & most popular connectors, like kafka, different file
> system with different formats, etc. We should make these core connectors
> as powerful as we can, and can also prevent something badly from
> happening, such as "if you want use this feature, please use connectorA.
> But if you want use another feature, please use connectorB".
>
> Best,
> Kurt
>
>
> On Tue, Sep 17, 2019 at 11:11 AM Jun Zhang <[hidden email]> wrote:
>
> > Hi Kurt:
> > thank you very much.
> >         I will take a closer look at the FLIP-63.
> >
> >         I develop this PR, the underlying is StreamingFileSink, not
> > BuckingSink, but I gave him a name, called Bucket.
> >
> >
> > On 09/17/2019 10:57,Kurt Young<[hidden email]> <[hidden email]>
> > wrote:
> >
> > Hi Jun,
> >
> > Thanks for bringing this up, in general I'm +1 on this feature. As
> > you might know, there is another ongoing efforts about such kind
> > of table sink, which covered in newly proposed partition support
> > reworking[1]. In this proposal, we also want to introduce a new
> > file system connector, which can not only cover the partition
> > support, but also end-to-end exactly once in streaming mode.
> >
> > I would suggest we could combine these two efforts into one. The
> > benefits would be save some review efforts, also reduce the core
> > connector number to ease our maintaining effort in the future.
> > What do you think?
> >
> > BTW, BucketingSink is already deprecated, I think we should refer
> > to StreamingFileSink instead.
> >
> > Best,
> > Kurt
> >
> > [1]
> >
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-63-Rework-table-partition-support-td32770.html
> >
> >
> > On Tue, Sep 17, 2019 at 10:39 AM Jun Zhang <[hidden email]> wrote:
> >
> >> Hello everyone:
> >> I am a user and fan of flink. I also want to join the flink community. I
> >> contributed my first PR a few days ago. Can anyone help me to review my
> >> code? If there is something wrong, hope I would be grateful if you can
> give
> >> some advice.
> >>
> >> This PR is mainly in the process of development, I use sql to read data
> >> from kafka and then write to hdfs, I found that there is no suitable
> >> tablesink, I found the document and found that File System Connector is
> >> only experimental (
> >>
> https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/table/connect.html#file-system-connector
> ),
> >> so I wrote a Bucket File System Table Sink that supports writing stream
> >> data. Hdfs, file file system, data format supports json, csv, parquet,
> >> avro. Subsequently add other format support, such as protobuf, thrift,
> etc.
> >>
> >> In addition, I also added documentation, python api, units test,
> >> end-end-test, sql-client, DDL, and compiled on travis.
> >>
> >> the issue is https://issues.apache.org/jira/browse/FLINK-12584
> >> thank you very much
> >>
> >>
> >>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Add Bucket File System Table Sink

Fabian Hueske-2
Hi Jun,

Thank you very much for your contribution.

I think a Bucketing File System Table Sink would be a great addition.

Our code contribution guidelines [1] recommend to discuss the design with
the community before opening a PR.
First of all, this ensures that the design is aligned with Flink's codebase
and the future features.
Moreover, it helps to find a committer who can help to shepherd the PR.

Something that is always a good idea is to split a contribution in multiple
smaller PRs (if possible).
This allows for faster review and progress.

Best, Fabian

[1] https://flink.apache.org/contributing/contribute-code.html

Am Di., 17. Sept. 2019 um 04:39 Uhr schrieb Jun Zhang <[hidden email]>:

> Hello everyone:
> I am a user and fan of flink. I also want to join the flink community. I
> contributed my first PR a few days ago. Can anyone help me to review my
> code? If there is something wrong, hope I would be grateful if you can give
> some advice.
>
> This PR is mainly in the process of development, I use sql to read data
> from kafka and then write to hdfs, I found that there is no suitable
> tablesink, I found the document and found that File System Connector is
> only experimental (
> https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/table/connect.html#file-system-connector),
> so I wrote a Bucket File System Table Sink that supports writing stream
> data. Hdfs, file file system, data format supports json, csv, parquet,
> avro. Subsequently add other format support, such as protobuf, thrift, etc.
>
> In addition, I also added documentation, python api, units test,
> end-end-test, sql-client, DDL, and compiled on travis.
>
> the issue is https://issues.apache.org/jira/browse/FLINK-12584
> thank you very much
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Add Bucket File System Table Sink

zhangjun
Hi,Fabian :


Thank you very much for your suggestion. This is when I use flink sql to write data to hdfs at work. I feel that it is inconvenient. I wrote this function, and then I want to contribute it to the community. This is my first PR , some processes may not be clear, I am very sorry.


Kurt suggested combining this feature with FLIP-63 because they have some common features, such as write data to file system with kinds of format, so I want to treat this function as a sub-task of FLIP-63. Add a partitionable &nbsp;bucket file system table sink.


I then added the document and sent a DISCUSS to explain my detailed design ideas and implementation. How do you see it?






------------------ Original ------------------
From: Fabian Hueske <[hidden email]&gt;
Date: Fri,Sep 20,2019 9:38 PM
To: Jun Zhang <[hidden email]&gt;
Cc: dev <[hidden email]&gt;, user <[hidden email]&gt;
Subject: Re: Add Bucket File System Table Sink



Hi Jun,

Thank you very much for your contribution.

I think a Bucketing File System Table Sink would be a great addition.

Our code contribution guidelines [1] recommend to discuss the design with
the community before opening a PR.
First of all, this ensures that the design is aligned with Flink's codebase
and the future features.
Moreover, it helps to find a committer who can help to shepherd the PR.

Something that is always a good idea is to split a contribution in multiple
smaller PRs (if possible).
This allows for faster review and progress.

Best, Fabian

[1] https://flink.apache.org/contributing/contribute-code.html

Am Di., 17. Sept. 2019 um 04:39 Uhr schrieb Jun Zhang <[hidden email]&gt;:

&gt; Hello everyone:
&gt; I am a user and fan of flink. I also want to join the flink community. I
&gt; contributed my first PR a few days ago. Can anyone help me to review my
&gt; code? If there is something wrong, hope I would be grateful if you can give
&gt; some advice.
&gt;
&gt; This PR is mainly in the process of development, I use sql to read data
&gt; from kafka and then write to hdfs, I found that there is no suitable
&gt; tablesink, I found the document and found that File System Connector is
&gt; only experimental (
&gt; https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/table/connect.html#file-system-connector),
&gt; so I wrote a Bucket File System Table Sink that supports writing stream
&gt; data. Hdfs, file file system, data format supports json, csv, parquet,
&gt; avro. Subsequently add other format support, such as protobuf, thrift, etc.
&gt;
&gt; In addition, I also added documentation, python api, units test,
&gt; end-end-test, sql-client, DDL, and compiled on travis.
&gt;
&gt; the issue is https://issues.apache.org/jira/browse/FLINK-12584
&gt; thank you very much
&gt;
&gt;
&gt;
Reply | Threaded
Open this post in threaded view
|

Re: Add Bucket File System Table Sink

Kurt Young
Hi Jun,

Thanks for your understanding. If we all agree adding this functionality
into FLIP-63 is
a good idea, I would suggest you also help reviewing the FLIP-63 design
document to
see if current design meet your requirements. You can also raise some
comments to
the design document if you have some thoughts.

Best,
Kurt


On Fri, Sep 20, 2019 at 10:44 PM Jun Zhang <[hidden email]> wrote:

> Hi,Fabian :
>
> Thank you very much for your suggestion. This is when I use flink sql to
> write data to hdfs at work. I feel that it is inconvenient. I wrote this
> function, and then I want to contribute it to the community. This is my
> first PR , some processes may not be clear, I am very sorry.
>
> Kurt suggested combining this feature with FLIP-63 because they have some
> common features, such as write data to file system with kinds of format, so
> I want to treat this function as a sub-task of FLIP-63. Add a partitionable
>  bucket file system table sink.
>
> I then added the document and sent a DISCUSS to explain my detailed design
> ideas and implementation. How do you see it?
>
>
>
> ------------------ Original ------------------
> *From:* Fabian Hueske <[hidden email]>
> *Date:* Fri,Sep 20,2019 9:38 PM
> *To:* Jun Zhang <[hidden email]>
> *Cc:* dev <[hidden email]>, user <[hidden email]>
> *Subject:* Re: Add Bucket File System Table Sink
>
> Hi Jun,
>
> Thank you very much for your contribution.
>
> I think a Bucketing File System Table Sink would be a great addition.
>
> Our code contribution guidelines [1] recommend to discuss the design with
> the community before opening a PR.
> First of all, this ensures that the design is aligned with Flink's codebase
> and the future features.
> Moreover, it helps to find a committer who can help to shepherd the PR.
>
> Something that is always a good idea is to split a contribution in multiple
> smaller PRs (if possible).
> This allows for faster review and progress.
>
> Best, Fabian
>
> [1] https://flink.apache.org/contributing/contribute-code.html
>
> Am Di., 17. Sept. 2019 um 04:39 Uhr schrieb Jun Zhang <[hidden email]>:
>
> > Hello everyone:
> > I am a user and fan of flink. I also want to join the flink community. I
> > contributed my first PR a few days ago. Can anyone help me to review my
> > code? If there is something wrong, hope I would be grateful if you can
> give
> > some advice.
> >
> > This PR is mainly in the process of development, I use sql to read data
> > from kafka and then write to hdfs, I found that there is no suitable
> > tablesink, I found the document and found that File System Connector is
> > only experimental (
> >
> https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/table/connect.html#file-system-connector
> ),
> > so I wrote a Bucket File System Table Sink that supports writing stream
> > data. Hdfs, file file system, data format supports json, csv, parquet,
> > avro. Subsequently add other format support, such as protobuf, thrift,
> etc.
> >
> > In addition, I also added documentation, python api, units test,
> > end-end-test, sql-client, DDL, and compiled on travis.
> >
> > the issue is https://issues.apache.org/jira/browse/FLINK-12584
> > thank you very much
> >
> >
> >
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Add Bucket File System Table Sink

zhangjun
In reply to this post by Kurt Young
Hi, Kurt
&nbsp; &nbsp; &nbsp;Thank you very much for your suggestion.
&nbsp; &nbsp; &nbsp;I have studied the design documentation of FLIP-63. I have no comments here.
&nbsp; &nbsp; &nbsp;With your suggestion from the previous few days, I am writing a design documentation about the Bucket FileSystem Table Sink and considering how to support the partition. I will send a DISCUSS later,I hope  you can help me review it, thank you.

Best&nbsp;&nbsp;Jun


------------------ Original ------------------
From: Kurt Young <[hidden email]&gt;
Date: Sat,Sep 21,2019 11:06 AM
To: Jun Zhang <[hidden email]&gt;
Cc: dev <[hidden email]&gt;
Subject: Re: Add Bucket File System Table Sink



Hi Jun,


Thanks for your understanding. If we all agree adding this functionality into FLIP-63 is
a good idea, I would suggest you also help reviewing the FLIP-63 design document to
see if current design meet your requirements. You can also raise some comments to
the design document if you have some thoughts.&nbsp;


Best,

Kurt










On Fri, Sep 20, 2019 at 10:44 PM Jun Zhang <[hidden email]&gt; wrote:

Hi,Fabian :


Thank you very much for your suggestion. This is when I use flink sql to write data to hdfs at work. I feel that it is inconvenient. I wrote this function, and then I want to contribute it to the community. This is my first PR , some processes may not be clear, I am very sorry.


Kurt suggested combining this feature with FLIP-63 because they have some common features, such as write data to file system with kinds of format, so I want to treat this function as a sub-task of FLIP-63. Add a partitionable &nbsp;bucket file system table sink.


I then added the document and sent a DISCUSS to explain my detailed design ideas and implementation. How do you see it?






------------------ Original ------------------
From: Fabian Hueske <[hidden email]&gt;
Date: Fri,Sep 20,2019 9:38 PM
To: Jun Zhang <[hidden email]&gt;
Cc: dev <[hidden email]&gt;, user <[hidden email]&gt;
Subject: Re: Add Bucket File System Table Sink



Hi Jun,

Thank you very much for your contribution.

I think a Bucketing File System Table Sink would be a great addition.

Our code contribution guidelines [1] recommend to discuss the design with
the community before opening a PR.
First of all, this ensures that the design is aligned with Flink's codebase
and the future features.
Moreover, it helps to find a committer who can help to shepherd the PR.

Something that is always a good idea is to split a contribution in multiple
smaller PRs (if possible).
This allows for faster review and progress.

Best, Fabian

[1] https://flink.apache.org/contributing/contribute-code.html

Am Di., 17. Sept. 2019 um 04:39 Uhr schrieb Jun Zhang <[hidden email]&gt;:

&gt; Hello everyone:
&gt; I am a user and fan of flink. I also want to join the flink community. I
&gt; contributed my first PR a few days ago. Can anyone help me to review my
&gt; code? If there is something wrong, hope I would be grateful if you can give
&gt; some advice.
&gt;
&gt; This PR is mainly in the process of development, I use sql to read data
&gt; from kafka and then write to hdfs, I found that there is no suitable
&gt; tablesink, I found the document and found that File System Connector is
&gt; only experimental (
&gt; https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/table/connect.html#file-system-connector),
&gt; so I wrote a Bucket File System Table Sink that supports writing stream
&gt; data. Hdfs, file file system, data format supports json, csv, parquet,
&gt; avro. Subsequently add other format support, such as protobuf, thrift, etc.
&gt;
&gt; In addition, I also added documentation, python api, units test,
&gt; end-end-test, sql-client, DDL, and compiled on travis.
&gt;
&gt; the issue is https://issues.apache.org/jira/browse/FLINK-12584
&gt; thank you very much
&gt;
&gt;
&gt;