(DEPRECATED) Apache Flink Mailing List archive.

Compression of network traffic

Classic

List

Threaded

14 messages Options

Viktor Rosenfeld

Compression of network traffic

Hi everybody,

Does Flink do any kind of compression of the data that is send across the network?

Best,
Viktor

Robert Metzger

Re: Compression of network traffic

Hi,

currently, we are not compressing the network data. However, it should be
easy to add support for that.
Ufuk can probably elaborate on the details, but he told me some time ago
that we can add code to compress outgoing network buffers using snappy or
other fast compression algorithms. In particular I/O intensive applications
would benefit from such a change.

Robert

On Fri, Nov 21, 2014 at 1:12 PM, Viktor Rosenfeld <
[hidden email]> wrote:

> Hi everybody,
>
> Does Flink do any kind of compression of the data that is send across the
> network?
>
> Best,
> Viktor
>
>
>
> --
> View this message in context:
> http://apache-flink-incubator-mailing-list-archive.1008284.n3.nabble.com/Compression-of-network-traffic-tp2568.html
> Sent from the Apache Flink (Incubator) Mailing List archive. mailing list
> archive at Nabble.com.
>

Ufuk Celebi-2

Re: Compression of network traffic

On 23 Nov 2014, at 21:04, Robert Metzger <[hidden email]> wrote:

> Ufuk can probably elaborate on the details, but he told me some time ago
> that we can add code to compress outgoing network buffers using snappy or
> other fast compression algorithms. In particular I/O intensive applications
> would benefit from such a change.

I agree and think that compressing shuffle data is generally a good idea (given codecs like Snappy et al.)

I've opened a corresponding issue (https://issues.apache.org/jira/browse/FLINK-1275). The network layer is going through a set of big changes at the moment. I think it makes sense to wait for them to be finished/merged, before working on this.

@Viktor: If you want to work on this, feel free to comment on the JIRA issue and we can discuss at which points you would have to extend the system.

– Ufuk

Stephan Ewen

Re: Compression of network traffic

Will the compression Codec will be inserted in the Netty loops, or before
that?

In any case, would it make sense to prototype this on the current code and
forward port this to the new network stack later? I assume the code would
mostly be similar, especially all the JNI vs. Java considerations and
tricks.
Am 23.11.2014 23:11 schrieb "Ufuk Celebi" <[hidden email]>:

>
> On 23 Nov 2014, at 21:04, Robert Metzger <[hidden email]> wrote:
>
> > Ufuk can probably elaborate on the details, but he told me some time ago
> > that we can add code to compress outgoing network buffers using snappy or
> > other fast compression algorithms. In particular I/O intensive
> applications
> > would benefit from such a change.
>
> I agree and think that compressing shuffle data is generally a good idea
> (given codecs like Snappy et al.)
>
> I've opened a corresponding issue (
> https://issues.apache.org/jira/browse/FLINK-1275). The network layer is
> going through a set of big changes at the moment. I think it makes sense to
> wait for them to be finished/merged, before working on this.
>
> @Viktor: If you want to work on this, feel free to comment on the JIRA
> issue and we can discuss at which points you would have to extend the
> system.
>
> – Ufuk

Sebastian Schelter

Re: Compression of network traffic

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

What reduction in network traffic can we expect from using compression?

- -s

On 11/24/2014 10:20 AM, Stephan Ewen wrote:

> Will the compression Codec will be inserted in the Netty loops, or
> before that?
>
> In any case, would it make sense to prototype this on the current
> code and forward port this to the new network stack later? I assume
> the code would mostly be similar, especially all the JNI vs. Java
> considerations and tricks. Am 23.11.2014 23:11 schrieb "Ufuk
> Celebi" <[hidden email]>:
>
>>
>> On 23 Nov 2014, at 21:04, Robert Metzger <[hidden email]>
>> wrote:
>>
>>> Ufuk can probably elaborate on the details, but he told me some
>>> time ago that we can add code to compress outgoing network
>>> buffers using snappy or other fast compression algorithms. In
>>> particular I/O intensive
>> applications
>>> would benefit from such a change.
>>
>> I agree and think that compressing shuffle data is generally a
>> good idea (given codecs like Snappy et al.)
>>
>> I've opened a corresponding issue (
>> https://issues.apache.org/jira/browse/FLINK-1275). The network
>> layer is going through a set of big changes at the moment. I
>> think it makes sense to wait for them to be finished/merged,
>> before working on this.
>>
>> @Viktor: If you want to work on this, feel free to comment on the
>> JIRA issue and we can discuss at which points you would have to
>> extend the system.
>>
>> – Ufuk
>

Stephan Ewen

Re: Compression of network traffic

Always depends on the data - we need to measure that. From what I have
heard in Hadoop, anywhere between none and a factor of 3-4 may happen.
Especially if the records contain repetitive strings...
Am 24.11.2014 10:24 schrieb "Sebastian Schelter" <[hidden email]>:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> What reduction in network traffic can we expect from using compression?
>
> - -s
>
>
>
> On 11/24/2014 10:20 AM, Stephan Ewen wrote:
> > Will the compression Codec will be inserted in the Netty loops, or
> > before that?
> >
> > In any case, would it make sense to prototype this on the current
> > code and forward port this to the new network stack later? I assume
> > the code would mostly be similar, especially all the JNI vs. Java
> > considerations and tricks. Am 23.11.2014 23:11 schrieb "Ufuk
> > Celebi" <[hidden email]>:
> >
> >>
> >> On 23 Nov 2014, at 21:04, Robert Metzger <[hidden email]>
> >> wrote:
> >>
> >>> Ufuk can probably elaborate on the details, but he told me some
> >>> time ago that we can add code to compress outgoing network
> >>> buffers using snappy or other fast compression algorithms. In
> >>> particular I/O intensive
> >> applications
> >>> would benefit from such a change.
> >>
> >> I agree and think that compressing shuffle data is generally a
> >> good idea (given codecs like Snappy et al.)
> >>
> >> I've opened a corresponding issue (
> >> https://issues.apache.org/jira/browse/FLINK-1275). The network
> >> layer is going through a set of big changes at the moment. I
> >> think it makes sense to wait for them to be finished/merged,
> >> before working on this.
> >>
> >> @Viktor: If you want to work on this, feel free to comment on the
> >> JIRA issue and we can discuss at which points you would have to
> >> extend the system.
> >>
> >> – Ufuk
> >
>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1
>
> iQIcBAEBAgAGBQJUcvjyAAoJEFWNjCUTnssiCYgQAK5QzsYPW9Q+C8BUJBxfeudg
> 61zFPW3cfydfD4L6UXghwz/j7addMm5zdY/xJsFgmN4NAXrtacDfr6+KPyB9glZ0
> E8E201q8NN6Czek/TtWEG0SSLj+KXqRAwJBHfBCVaCV/lD7wEAjef2sSFpVL5RK9
> 0RyS5UeCBW/RxTp/fGXWfy2d9G0SSTIy78/u8Y9nxflD6KnZVGZYZA7jrP/h8Wis
> ry1ZWLPHKpEbaC+MNbYX1mMJaf9lH0liJ/HopCIUzrxfiwskub17NkfVC/0Uwwc3
> r1+/pHFYNfO3FLIkXxWYRLgNyGW6LCE1zmM/BGUo1/C9/rtuiRl24io+G+4V/ac6
> tZ9jxjwHQkWGFscoep2p7E8pGVezV1QqYv3prWXaZ2+TUKKzlbd6QCL8i2qmymeb
> ABy88cpN1U4c/p6GDrva1+llyBG1PK07awAfFmK3OA7VK3MLW1pyHuJiLjOZaFP1
> r2K9T5/d2YZa7nPFaJs8bLXsV/WMzFJDAOaILcaCDymxF1xnNu7iwAyuJhg/i6ht
> KaOiM45JnqC++9MQ2D83zyt/+Yc/OT+06hVN7hdRaLFD+OqG7vNv7LZC+cfAWmFE
> KyJb9kXNVDxhJXAWyOyKN1nzBtDfaugghGBzvX0u3QP/2uGX1Gca2oqCJvNesOM4
> dBlCMt88/gkYjWXadjnK
> =0q6A
> -----END PGP SIGNATURE-----
>

Ufuk Celebi-2

Re: Compression of network traffic

In reply to this post by Stephan Ewen

On Mon, Nov 24, 2014 at 10:20 AM, Stephan Ewen <[hidden email]> wrote:

> Will the compression Codec will be inserted in the Netty loops, or before
> that?
>

In the current master, I would say that it makes sense to do it in the
Netty loops during shuffling. The compression would then be totally
transparent to the writers/readers.

With the new network stack we will have to think this through a little bit
as we have more options and scenarios due to the fact that results will
possibly be consumed multiple times.

> In any case, would it make sense to prototype this on the current code and
> forward port this to the new network stack later? I assume the code would
> mostly be similar, especially all the JNI vs. Java considerations and
> tricks.
>

Yes. It makes sense to have this in any case.

Viktor Rosenfeld

Re: Compression of network traffic

In reply to this post by Ufuk Celebi-2

Hi,

A codec like Snappy would work on an entire network buffer as one big blob, right? I was more thinking along the lines of compressing individual tuples fields by treating them as columns, e.g., using frame-of-reference encoding and bit backing. Compression on tuple fields should yield much better results than compressing the entire blob. Given that Flink controls the serialization process this should be transparent to other layers in the code. Not sure it is worth the effort though.

Cheers,
Viktor

Stephan Ewen

Re: Compression of network traffic

I would start with a simple compression of network buffers as a blob.

At some point, Flink's internal data layout may become columnar, which
should also help the blob-style compression, because more similar strings
will be within one window...

On Tue, Nov 25, 2014 at 11:26 AM, Viktor Rosenfeld <
[hidden email]> wrote:

> Hi,
>
> A codec like Snappy would work on an entire network buffer as one big blob,
> right? I was more thinking along the lines of compressing individual tuples
> fields by treating them as columns, e.g., using frame-of-reference encoding
> and bit backing. Compression on tuple fields should yield much better
> results than compressing the entire blob. Given that Flink controls the
> serialization process this should be transparent to other layers in the
> code. Not sure it is worth the effort though.
>
> Cheers,
> Viktor
>
>
>
> --
> View this message in context:
> http://apache-flink-incubator-mailing-list-archive.1008284.n3.nabble.com/Compression-of-network-traffic-tp2568p2607.html
> Sent from the Apache Flink (Incubator) Mailing List archive. mailing list
> archive at Nabble.com.
>

Daniel Warneke

Re: Compression of network traffic

Hi,

if you really want to add compression on the data path, I would
encourage you to choose something as lightweight as possible. 10 GBit
Ethernet is becoming pretty much commodity these days in the server
space and it is not easy to saturate such a link even without compression.

Snappy is not a bad choice, but the fastest algorithm I’ve seen so far
is LZ4:

https://code.google.com/p/lz4/

Best regards,

Daniel

Am 25.11.2014 20:53, schrieb Stephan Ewen:

> I would start with a simple compression of network buffers as a blob.
>
> At some point, Flink's internal data layout may become columnar, which
> should also help the blob-style compression, because more similar strings
> will be within one window...
>
> On Tue, Nov 25, 2014 at 11:26 AM, Viktor Rosenfeld <
> [hidden email]> wrote:
>
>> Hi,
>>
>> A codec like Snappy would work on an entire network buffer as one big blob,
>> right? I was more thinking along the lines of compressing individual tuples
>> fields by treating them as columns, e.g., using frame-of-reference encoding
>> and bit backing. Compression on tuple fields should yield much better
>> results than compressing the entire blob. Given that Flink controls the
>> serialization process this should be transparent to other layers in the
>> code. Not sure it is worth the effort though.
>>
>> Cheers,
>> Viktor
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-flink-incubator-mailing-list-archive.1008284.n3.nabble.com/Compression-of-network-traffic-tp2568p2607.html
>> Sent from the Apache Flink (Incubator) Mailing List archive. mailing list
>> archive at Nabble.com.
>>

Viktor Rosenfeld

Re: Compression of network traffic

In reply to this post by Stephan Ewen

Hi Stephan,

Compressing network buffers as a blob is probably the fastest/easiest way to achieve some measure of results.

But I wonder if it would be possible to implement some operations on the serialized compressed form. For example, groupings and joins should be easy to implement. If the field is not accessed in the UDF then it won't have to be deserialized.

A complication would be how the compression scheme is passed on to the next nodes in the computation chain.

Cheers,
Viktor

Stephan Ewen

Re: Compression of network traffic

Yes, that working on serialized data happens in parts right now and it
would be great to extend that.

While it would be possible to work on a compact serialized representation,
I can't think of a way to work on a snappy/lz4 compressed version.
Am 26.11.2014 23:00 schrieb "Viktor Rosenfeld" <
[hidden email]>:

> Hi Stephan,
>
> Compressing network buffers as a blob is probably the fastest/easiest way
> to
> achieve some measure of results.
>
> But I wonder if it would be possible to implement some operations on the
> serialized compressed form. For example, groupings and joins should be easy
> to implement. If the field is not accessed in the UDF then it won't have to
> be deserialized.
>
> A complication would be how the compression scheme is passed on to the next
> nodes in the computation chain.
>
> Cheers,
> Viktor
>
>
>
> --
> View this message in context:
> http://apache-flink-incubator-mailing-list-archive.1008284.n3.nabble.com/Compression-of-network-traffic-tp2568p2637.html
> Sent from the Apache Flink (Incubator) Mailing List archive. mailing list
> archive at Nabble.com.
>

Viktor Rosenfeld

Re: Compression of network traffic

Hi Stephan,

can you give me an example (or a few) where Flink is working on the serialized data?

Cheers,
Viktor

Stephan Ewen

Re: Compression of network traffic

Hey!

As an example, you can have a look at the normalized key sorter, which
sorts and compares complete on serialized data.

Stephan

On Thu, Nov 27, 2014 at 2:25 PM, Viktor Rosenfeld <
[hidden email]> wrote:

> Hi Stephan,
>
> can you give me an example (or a few) where Flink is working on the
> serialized data?
>
> Cheers,
> Viktor
>
>
>
> --
> View this message in context:
> http://apache-flink-incubator-mailing-list-archive.1008284.n3.nabble.com/Compression-of-network-traffic-tp2568p2651.html
> Sent from the Apache Flink (Incubator) Mailing List archive. mailing list
> archive at Nabble.com.
>