Support for ProtoBuf data format in StreamingFileSink

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Support for ProtoBuf data format in StreamingFileSink

Kailash Dayanand-2
Hello,

I am looking to contribute a ProtoParquetWriter support which can be used
in Bulk format for the StreamingFileSink api. There has been earlier
discussions on this in the user mailing list: https://goo.gl/ya2StL and
thought it would be a good addition to have.

For implementation, looking at the current APIs present at
ProtoParquetWriter with the parguet project (http://tinyurl.com/y378be42),
it looks like there is some different in the interface between Avro and
Proto writes (ProtoParquetWriter does not have a builder class as well as
not interface with Outputfile). Due to this, I was looking at directly
extending the ParquetWriter within Flink to define the Builder static class
and have newer interfaces. This is needed as the bulk writer takes a
builder to crate the ParquetWriter in the bulkWriter.Factory. (
http://tinyurl.com/yyg9cn9b)

Any thoughts if this is a reasonable approach?

Thanks
Kailash
Reply | Threaded
Open this post in threaded view
|

Re: Support for ProtoBuf data format in StreamingFileSink

Kailash Dayanand-2
cc'ing few folks who are interested in this discussion.

On Tue, Apr 9, 2019 at 11:35 AM Kailash Dayanand <[hidden email]> wrote:

> Hello,
>
> I am looking to contribute a ProtoParquetWriter support which can be used
> in Bulk format for the StreamingFileSink api. There has been earlier
> discussions on this in the user mailing list: https://goo.gl/ya2StL and
> thought it would be a good addition to have.
>
> For implementation, looking at the current APIs present at
> ProtoParquetWriter with the parguet project (http://tinyurl.com/y378be42),
> it looks like there is some different in the interface between Avro and
> Proto writes (ProtoParquetWriter does not have a builder class as well as
> not interface with Outputfile). Due to this, I was looking at directly
> extending the ParquetWriter within Flink to define the Builder static class
> and have newer interfaces. This is needed as the bulk writer takes a
> builder to crate the ParquetWriter in the bulkWriter.Factory. (
> http://tinyurl.com/yyg9cn9b)
>
> Any thoughts if this is a reasonable approach?
>
> Thanks
> Kailash
>
Reply | Threaded
Open this post in threaded view
|

Re: Support for ProtoBuf data format in StreamingFileSink

Kailash Dayanand-2
Friendly remainder. Any thoughts on this approach ?

On Tue, Apr 9, 2019 at 11:36 AM Kailash Dayanand <[hidden email]> wrote:

> cc'ing few folks who are interested in this discussion.
>
> On Tue, Apr 9, 2019 at 11:35 AM Kailash Dayanand <[hidden email]>
> wrote:
>
>> Hello,
>>
>> I am looking to contribute a ProtoParquetWriter support which can be used
>> in Bulk format for the StreamingFileSink api. There has been earlier
>> discussions on this in the user mailing list: https://goo.gl/ya2StL and
>> thought it would be a good addition to have.
>>
>> For implementation, looking at the current APIs present at
>> ProtoParquetWriter with the parguet project (http://tinyurl.com/y378be42),
>> it looks like there is some different in the interface between Avro and
>> Proto writes (ProtoParquetWriter does not have a builder class as well as
>> not interface with Outputfile). Due to this, I was looking at directly
>> extending the ParquetWriter within Flink to define the Builder static class
>> and have newer interfaces. This is needed as the bulk writer takes a
>> builder to crate the ParquetWriter in the bulkWriter.Factory. (
>> http://tinyurl.com/yyg9cn9b)
>>
>> Any thoughts if this is a reasonable approach?
>>
>> Thanks
>> Kailash
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: Support for ProtoBuf data format in StreamingFileSink

Jakob Homan
+1.  Sounds good to me.
-Jakob

On Mon, Apr 22, 2019 at 9:00 AM Kailash Dayanand
<[hidden email]> wrote:

>
> Friendly remainder. Any thoughts on this approach ?
>
> On Tue, Apr 9, 2019 at 11:36 AM Kailash Dayanand <[hidden email]> wrote:
>
> > cc'ing few folks who are interested in this discussion.
> >
> > On Tue, Apr 9, 2019 at 11:35 AM Kailash Dayanand <[hidden email]>
> > wrote:
> >
> >> Hello,
> >>
> >> I am looking to contribute a ProtoParquetWriter support which can be used
> >> in Bulk format for the StreamingFileSink api. There has been earlier
> >> discussions on this in the user mailing list: https://goo.gl/ya2StL and
> >> thought it would be a good addition to have.
> >>
> >> For implementation, looking at the current APIs present at
> >> ProtoParquetWriter with the parguet project (http://tinyurl.com/y378be42),
> >> it looks like there is some different in the interface between Avro and
> >> Proto writes (ProtoParquetWriter does not have a builder class as well as
> >> not interface with Outputfile). Due to this, I was looking at directly
> >> extending the ParquetWriter within Flink to define the Builder static class
> >> and have newer interfaces. This is needed as the bulk writer takes a
> >> builder to crate the ParquetWriter in the bulkWriter.Factory. (
> >> http://tinyurl.com/yyg9cn9b)
> >>
> >> Any thoughts if this is a reasonable approach?
> >>
> >> Thanks
> >> Kailash
> >>
> >
Reply | Threaded
Open this post in threaded view
|

Re: Support for ProtoBuf data format in StreamingFileSink

Driesprong, Fokko
Thanks Kailash for bringing this up. I think this is a good idea. By
passing the ParquetWriter we gain much more flexibility.

I did a small PR on adding the ability to add compression to the Parquet
writer: https://github.com/apache/flink/pull/7547 But I believe this is the
wrong approach. For example, I'd like to tune the Page sizes as well, but
this requires another PR and a lot of code changes, to simply set this due
to the rather complex structure of the builder pattern. Personally, I would
prefer it to just pass a generic ParquetWriter to the BulkWriter. This
makes it much easier, and I don't see the added value of having the
constructor here.

Cheers, Fokko

Op ma 22 apr. 2019 om 19:53 schreef Jakob Homan <[hidden email]>:

> +1.  Sounds good to me.
> -Jakob
>
> On Mon, Apr 22, 2019 at 9:00 AM Kailash Dayanand
> <[hidden email]> wrote:
> >
> > Friendly remainder. Any thoughts on this approach ?
> >
> > On Tue, Apr 9, 2019 at 11:36 AM Kailash Dayanand <[hidden email]>
> wrote:
> >
> > > cc'ing few folks who are interested in this discussion.
> > >
> > > On Tue, Apr 9, 2019 at 11:35 AM Kailash Dayanand <[hidden email]>
> > > wrote:
> > >
> > >> Hello,
> > >>
> > >> I am looking to contribute a ProtoParquetWriter support which can be
> used
> > >> in Bulk format for the StreamingFileSink api. There has been earlier
> > >> discussions on this in the user mailing list: https://goo.gl/ya2StL
> and
> > >> thought it would be a good addition to have.
> > >>
> > >> For implementation, looking at the current APIs present at
> > >> ProtoParquetWriter with the parguet project (
> http://tinyurl.com/y378be42),
> > >> it looks like there is some different in the interface between Avro
> and
> > >> Proto writes (ProtoParquetWriter does not have a builder class as
> well as
> > >> not interface with Outputfile). Due to this, I was looking at directly
> > >> extending the ParquetWriter within Flink to define the Builder static
> class
> > >> and have newer interfaces. This is needed as the bulk writer takes a
> > >> builder to crate the ParquetWriter in the bulkWriter.Factory. (
> > >> http://tinyurl.com/yyg9cn9b)
> > >>
> > >> Any thoughts if this is a reasonable approach?
> > >>
> > >> Thanks
> > >> Kailash
> > >>
> > >
>