Hello,
I am looking to contribute a ProtoParquetWriter support which can be used in Bulk format for the StreamingFileSink api. There has been earlier discussions on this in the user mailing list: https://goo.gl/ya2StL and thought it would be a good addition to have. For implementation, looking at the current APIs present at ProtoParquetWriter with the parguet project (http://tinyurl.com/y378be42), it looks like there is some different in the interface between Avro and Proto writes (ProtoParquetWriter does not have a builder class as well as not interface with Outputfile). Due to this, I was looking at directly extending the ParquetWriter within Flink to define the Builder static class and have newer interfaces. This is needed as the bulk writer takes a builder to crate the ParquetWriter in the bulkWriter.Factory. ( http://tinyurl.com/yyg9cn9b) Any thoughts if this is a reasonable approach? Thanks Kailash |
cc'ing few folks who are interested in this discussion.
On Tue, Apr 9, 2019 at 11:35 AM Kailash Dayanand <[hidden email]> wrote: > Hello, > > I am looking to contribute a ProtoParquetWriter support which can be used > in Bulk format for the StreamingFileSink api. There has been earlier > discussions on this in the user mailing list: https://goo.gl/ya2StL and > thought it would be a good addition to have. > > For implementation, looking at the current APIs present at > ProtoParquetWriter with the parguet project (http://tinyurl.com/y378be42), > it looks like there is some different in the interface between Avro and > Proto writes (ProtoParquetWriter does not have a builder class as well as > not interface with Outputfile). Due to this, I was looking at directly > extending the ParquetWriter within Flink to define the Builder static class > and have newer interfaces. This is needed as the bulk writer takes a > builder to crate the ParquetWriter in the bulkWriter.Factory. ( > http://tinyurl.com/yyg9cn9b) > > Any thoughts if this is a reasonable approach? > > Thanks > Kailash > |
Friendly remainder. Any thoughts on this approach ?
On Tue, Apr 9, 2019 at 11:36 AM Kailash Dayanand <[hidden email]> wrote: > cc'ing few folks who are interested in this discussion. > > On Tue, Apr 9, 2019 at 11:35 AM Kailash Dayanand <[hidden email]> > wrote: > >> Hello, >> >> I am looking to contribute a ProtoParquetWriter support which can be used >> in Bulk format for the StreamingFileSink api. There has been earlier >> discussions on this in the user mailing list: https://goo.gl/ya2StL and >> thought it would be a good addition to have. >> >> For implementation, looking at the current APIs present at >> ProtoParquetWriter with the parguet project (http://tinyurl.com/y378be42), >> it looks like there is some different in the interface between Avro and >> Proto writes (ProtoParquetWriter does not have a builder class as well as >> not interface with Outputfile). Due to this, I was looking at directly >> extending the ParquetWriter within Flink to define the Builder static class >> and have newer interfaces. This is needed as the bulk writer takes a >> builder to crate the ParquetWriter in the bulkWriter.Factory. ( >> http://tinyurl.com/yyg9cn9b) >> >> Any thoughts if this is a reasonable approach? >> >> Thanks >> Kailash >> > |
+1. Sounds good to me.
-Jakob On Mon, Apr 22, 2019 at 9:00 AM Kailash Dayanand <[hidden email]> wrote: > > Friendly remainder. Any thoughts on this approach ? > > On Tue, Apr 9, 2019 at 11:36 AM Kailash Dayanand <[hidden email]> wrote: > > > cc'ing few folks who are interested in this discussion. > > > > On Tue, Apr 9, 2019 at 11:35 AM Kailash Dayanand <[hidden email]> > > wrote: > > > >> Hello, > >> > >> I am looking to contribute a ProtoParquetWriter support which can be used > >> in Bulk format for the StreamingFileSink api. There has been earlier > >> discussions on this in the user mailing list: https://goo.gl/ya2StL and > >> thought it would be a good addition to have. > >> > >> For implementation, looking at the current APIs present at > >> ProtoParquetWriter with the parguet project (http://tinyurl.com/y378be42), > >> it looks like there is some different in the interface between Avro and > >> Proto writes (ProtoParquetWriter does not have a builder class as well as > >> not interface with Outputfile). Due to this, I was looking at directly > >> extending the ParquetWriter within Flink to define the Builder static class > >> and have newer interfaces. This is needed as the bulk writer takes a > >> builder to crate the ParquetWriter in the bulkWriter.Factory. ( > >> http://tinyurl.com/yyg9cn9b) > >> > >> Any thoughts if this is a reasonable approach? > >> > >> Thanks > >> Kailash > >> > > |
Thanks Kailash for bringing this up. I think this is a good idea. By
passing the ParquetWriter we gain much more flexibility. I did a small PR on adding the ability to add compression to the Parquet writer: https://github.com/apache/flink/pull/7547 But I believe this is the wrong approach. For example, I'd like to tune the Page sizes as well, but this requires another PR and a lot of code changes, to simply set this due to the rather complex structure of the builder pattern. Personally, I would prefer it to just pass a generic ParquetWriter to the BulkWriter. This makes it much easier, and I don't see the added value of having the constructor here. Cheers, Fokko Op ma 22 apr. 2019 om 19:53 schreef Jakob Homan <[hidden email]>: > +1. Sounds good to me. > -Jakob > > On Mon, Apr 22, 2019 at 9:00 AM Kailash Dayanand > <[hidden email]> wrote: > > > > Friendly remainder. Any thoughts on this approach ? > > > > On Tue, Apr 9, 2019 at 11:36 AM Kailash Dayanand <[hidden email]> > wrote: > > > > > cc'ing few folks who are interested in this discussion. > > > > > > On Tue, Apr 9, 2019 at 11:35 AM Kailash Dayanand <[hidden email]> > > > wrote: > > > > > >> Hello, > > >> > > >> I am looking to contribute a ProtoParquetWriter support which can be > used > > >> in Bulk format for the StreamingFileSink api. There has been earlier > > >> discussions on this in the user mailing list: https://goo.gl/ya2StL > and > > >> thought it would be a good addition to have. > > >> > > >> For implementation, looking at the current APIs present at > > >> ProtoParquetWriter with the parguet project ( > http://tinyurl.com/y378be42), > > >> it looks like there is some different in the interface between Avro > and > > >> Proto writes (ProtoParquetWriter does not have a builder class as > well as > > >> not interface with Outputfile). Due to this, I was looking at directly > > >> extending the ParquetWriter within Flink to define the Builder static > class > > >> and have newer interfaces. This is needed as the bulk writer takes a > > >> builder to crate the ParquetWriter in the bulkWriter.Factory. ( > > >> http://tinyurl.com/yyg9cn9b) > > >> > > >> Any thoughts if this is a reasonable approach? > > >> > > >> Thanks > > >> Kailash > > >> > > > > |
Free forum by Nabble | Edit this page |