HI all,
I am moving some code to use the StreamingFileSink. Currently, it doesn't look like there is any native support for compression (gzip or otherwise) built into flink when using the StreamingFileSink. It seems like this is a really common need that as far as I could tell, wasn't represented in jira. After a fair amount of digging, it seems like the way to do that is to implement that is the BulkWriter interface where you can trivially wrap an outputStream with something like a GZIPOutputStream. It seems like it would make sense that until compression functionality is built into the StreamingFileSink, it might make sense to add some docs on how to use compression with the StreamingFileSink. I am willing to spend a bit of time documenting that, but before I do i wanted to make sure I understand if that is in fact the correct way to think about this problem and get your thoughts. Thanks! |
Just noticed one detail about using the BulkWriter interface, you no longer
can assign a rolling policy. That makes sense for formats like orc/parquet, but perhaps not for simple text compression. On Wed, Nov 14, 2018 at 1:43 PM Addison Higham <[hidden email]> wrote: > HI all, > > I am moving some code to use the StreamingFileSink. Currently, it doesn't > look like there is any native support for compression (gzip or otherwise) > built into flink when using the StreamingFileSink. It seems like this is a > really common need that as far as I could tell, wasn't represented in jira. > > After a fair amount of digging, it seems like the way to do that is to > implement that is the BulkWriter interface where you can trivially wrap an > outputStream with something like a GZIPOutputStream. > > It seems like it would make sense that until compression functionality is > built into the StreamingFileSink, it might make sense to add some docs on > how to use compression with the StreamingFileSink. > > I am willing to spend a bit of time documenting that, but before I do i > wanted to make sure I understand if that is in fact the correct way to > think about this problem and get your thoughts. > > Thanks! > > > |
Hi Addison,
I think it is a good idea to add some more details to the documentation. Thus, it would be great if you could contribute how to enable compression. Concerning the RollingPolicy, I've pulled in Klou who might give you more details about the design decisions. Cheers, Till On Wed, Nov 14, 2018 at 10:07 PM Addison Higham <[hidden email]> wrote: > Just noticed one detail about using the BulkWriter interface, you no longer > can assign a rolling policy. That makes sense for formats like orc/parquet, > but perhaps not for simple text compression. > > > > On Wed, Nov 14, 2018 at 1:43 PM Addison Higham <[hidden email]> wrote: > > > HI all, > > > > I am moving some code to use the StreamingFileSink. Currently, it doesn't > > look like there is any native support for compression (gzip or otherwise) > > built into flink when using the StreamingFileSink. It seems like this is > a > > really common need that as far as I could tell, wasn't represented in > jira. > > > > After a fair amount of digging, it seems like the way to do that is to > > implement that is the BulkWriter interface where you can trivially wrap > an > > outputStream with something like a GZIPOutputStream. > > > > It seems like it would make sense that until compression functionality is > > built into the StreamingFileSink, it might make sense to add some docs on > > how to use compression with the StreamingFileSink. > > > > I am willing to spend a bit of time documenting that, but before I do i > > wanted to make sure I understand if that is in fact the correct way to > > think about this problem and get your thoughts. > > > > Thanks! > > > > > > > |
Hi Addison,
Sorry for the late reply. I agree that the documentation can be significantly improved and that adding compression could be a nice thing to have. There is already a PR open for supporting writing SequenceFiles with the StreamingFileSink. When this gets merged, you will be able to use compression when writing SequenceFiles ( https://github.com/apache/flink/pull/6774). If this is not enough and you want to write plain-text and compress it when you finalise your part-file, then you are right that you will need to write your own BulkWriter. As you said, BulkWriters have only one RollingPolicy, and this is that they roll on every checkpoint but there are plans to alleviate this limitation in the future. Cheers, Kostas On Thu, Nov 15, 2018 at 10:25 AM Till Rohrmann <[hidden email]> wrote: > Hi Addison, > > I think it is a good idea to add some more details to the documentation. > Thus, it would be great if you could contribute how to enable compression. > > Concerning the RollingPolicy, I've pulled in Klou who might give you more > details about the design decisions. > > Cheers, > Till > > On Wed, Nov 14, 2018 at 10:07 PM Addison Higham <[hidden email]> > wrote: > >> Just noticed one detail about using the BulkWriter interface, you no >> longer >> can assign a rolling policy. That makes sense for formats like >> orc/parquet, >> but perhaps not for simple text compression. >> >> >> >> On Wed, Nov 14, 2018 at 1:43 PM Addison Higham <[hidden email]> >> wrote: >> >> > HI all, >> > >> > I am moving some code to use the StreamingFileSink. Currently, it >> doesn't >> > look like there is any native support for compression (gzip or >> otherwise) >> > built into flink when using the StreamingFileSink. It seems like this >> is a >> > really common need that as far as I could tell, wasn't represented in >> jira. >> > >> > After a fair amount of digging, it seems like the way to do that is to >> > implement that is the BulkWriter interface where you can trivially wrap >> an >> > outputStream with something like a GZIPOutputStream. >> > >> > It seems like it would make sense that until compression functionality >> is >> > built into the StreamingFileSink, it might make sense to add some docs >> on >> > how to use compression with the StreamingFileSink. >> > >> > I am willing to spend a bit of time documenting that, but before I do i >> > wanted to make sure I understand if that is in fact the correct way to >> > think about this problem and get your thoughts. >> > >> > Thanks! >> > >> > >> > >> > |
Free forum by Nabble | Edit this page |