Compressing files with the Bucketing Sink

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Compressing files with the Bucketing Sink

lrao@lyft.com
I want to upload a compressed file (gzip preferrably) using the Bucketing Sink. What is the best way to do this? Would I have to implement my own Writer that does the compression? Has anyone done something similar?
Reply | Threaded
Open this post in threaded view
|

Re: Compressing files with the Bucketing Sink

Till Rohrmann
Hi,

the SequenceFileWriter and the AvroKeyValueSinkWriter both support
compressed outputs. Apart from that, I'm not aware of any other Writers
which support compression. Maybe you could use these two Writers as a
guiding example. Alternatively, you could try to extend the
StreamWriterBase and wrapping the outStream into a GZIPOutputStream.

Cheers,
Till

On Wed, Mar 28, 2018 at 1:59 AM, [hidden email] <[hidden email]> wrote:

> I want to upload a compressed file (gzip preferrably) using the Bucketing
> Sink. What is the best way to do this? Would I have to implement my own
> Writer that does the compression? Has anyone done something similar?
>
Reply | Threaded
Open this post in threaded view
|

Re: Compressing files with the Bucketing Sink

lrao@lyft.com






Thanks a lot for the suggestion Till!

I ended up using your suggestion of extending StreamWriterBase and wrapping the FSDataOutputStream with GZIPOutputStream.


On 2018/03/28 09:44:26, Till Rohrmann <[hidden email]> wrote:

> Hi,
>
> the SequenceFileWriter and the AvroKeyValueSinkWriter both support
> compressed outputs. Apart from that, I'm not aware of any other Writers
> which support compression. Maybe you could use these two Writers as a
> guiding example. Alternatively, you could try to extend the
> StreamWriterBase and wrapping the outStream into a GZIPOutputStream.
>
> Cheers,
> Till
>
> On Wed, Mar 28, 2018 at 1:59 AM, [hidden email] <[hidden email]> wrote:
>
> > I want to upload a compressed file (gzip preferrably) using the Bucketing
> > Sink. What is the best way to do this? Would I have to implement my own
> > Writer that does the compression? Has anyone done something similar?
> >
>