Compression of network traffic

classic Classic list List threaded Threaded
14 messages Options
Reply | Threaded
Open this post in threaded view
|

Compression of network traffic

Viktor Rosenfeld
Hi everybody,

Does Flink do any kind of compression of the data that is send across the network?

Best,
Viktor
Reply | Threaded
Open this post in threaded view
|

Re: Compression of network traffic

Robert Metzger
Hi,

currently, we are not compressing the network data. However, it should be
easy to add support for that.
Ufuk can probably elaborate on the details, but he told me some time ago
that we can add code to compress outgoing network buffers using snappy or
other fast compression algorithms. In particular I/O intensive applications
would benefit from such a change.


Robert


On Fri, Nov 21, 2014 at 1:12 PM, Viktor Rosenfeld <
[hidden email]> wrote:

> Hi everybody,
>
> Does Flink do any kind of compression of the data that is send across the
> network?
>
> Best,
> Viktor
>
>
>
> --
> View this message in context:
> http://apache-flink-incubator-mailing-list-archive.1008284.n3.nabble.com/Compression-of-network-traffic-tp2568.html
> Sent from the Apache Flink (Incubator) Mailing List archive. mailing list
> archive at Nabble.com.
>
Reply | Threaded
Open this post in threaded view
|

Re: Compression of network traffic

Ufuk Celebi-2

On 23 Nov 2014, at 21:04, Robert Metzger <[hidden email]> wrote:

> Ufuk can probably elaborate on the details, but he told me some time ago
> that we can add code to compress outgoing network buffers using snappy or
> other fast compression algorithms. In particular I/O intensive applications
> would benefit from such a change.

I agree and think that compressing shuffle data is generally a good idea (given codecs like Snappy et al.)

I've opened a corresponding issue (https://issues.apache.org/jira/browse/FLINK-1275). The network layer is going through a set of big changes at the moment. I think it makes sense to wait for them to be finished/merged, before working on this.

@Viktor: If you want to work on this, feel free to comment on the JIRA issue and we can discuss at which points you would have to extend the system.

– Ufuk
Reply | Threaded
Open this post in threaded view
|

Re: Compression of network traffic

Stephan Ewen
Will the compression Codec will be inserted in the Netty loops, or before
that?

In any case, would it make sense to prototype this on the current code and
forward port this to the new network stack later? I assume the code would
mostly be similar, especially all the JNI vs. Java considerations and
tricks.
Am 23.11.2014 23:11 schrieb "Ufuk Celebi" <[hidden email]>:

>
> On 23 Nov 2014, at 21:04, Robert Metzger <[hidden email]> wrote:
>
> > Ufuk can probably elaborate on the details, but he told me some time ago
> > that we can add code to compress outgoing network buffers using snappy or
> > other fast compression algorithms. In particular I/O intensive
> applications
> > would benefit from such a change.
>
> I agree and think that compressing shuffle data is generally a good idea
> (given codecs like Snappy et al.)
>
> I've opened a corresponding issue (
> https://issues.apache.org/jira/browse/FLINK-1275). The network layer is
> going through a set of big changes at the moment. I think it makes sense to
> wait for them to be finished/merged, before working on this.
>
> @Viktor: If you want to work on this, feel free to comment on the JIRA
> issue and we can discuss at which points you would have to extend the
> system.
>
> – Ufuk
Reply | Threaded
Open this post in threaded view
|

Re: Compression of network traffic

Sebastian Schelter
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

What reduction in network traffic can we expect from using compression?

- -s



On 11/24/2014 10:20 AM, Stephan Ewen wrote:

> Will the compression Codec will be inserted in the Netty loops, or
> before that?
>
> In any case, would it make sense to prototype this on the current
> code and forward port this to the new network stack later? I assume
> the code would mostly be similar, especially all the JNI vs. Java
> considerations and tricks. Am 23.11.2014 23:11 schrieb "Ufuk
> Celebi" <[hidden email]>:
>
>>
>> On 23 Nov 2014, at 21:04, Robert Metzger <[hidden email]>
>> wrote:
>>
>>> Ufuk can probably elaborate on the details, but he told me some
>>> time ago that we can add code to compress outgoing network
>>> buffers using snappy or other fast compression algorithms. In
>>> particular I/O intensive
>> applications
>>> would benefit from such a change.
>>
>> I agree and think that compressing shuffle data is generally a
>> good idea (given codecs like Snappy et al.)
>>
>> I've opened a corresponding issue (
>> https://issues.apache.org/jira/browse/FLINK-1275). The network
>> layer is going through a set of big changes at the moment. I
>> think it makes sense to wait for them to be finished/merged,
>> before working on this.
>>
>> @Viktor: If you want to work on this, feel free to comment on the
>> JIRA issue and we can discuss at which points you would have to
>> extend the system.
>>
>> – Ufuk
>

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iQIcBAEBAgAGBQJUcvjyAAoJEFWNjCUTnssiCYgQAK5QzsYPW9Q+C8BUJBxfeudg
61zFPW3cfydfD4L6UXghwz/j7addMm5zdY/xJsFgmN4NAXrtacDfr6+KPyB9glZ0
E8E201q8NN6Czek/TtWEG0SSLj+KXqRAwJBHfBCVaCV/lD7wEAjef2sSFpVL5RK9
0RyS5UeCBW/RxTp/fGXWfy2d9G0SSTIy78/u8Y9nxflD6KnZVGZYZA7jrP/h8Wis
ry1ZWLPHKpEbaC+MNbYX1mMJaf9lH0liJ/HopCIUzrxfiwskub17NkfVC/0Uwwc3
r1+/pHFYNfO3FLIkXxWYRLgNyGW6LCE1zmM/BGUo1/C9/rtuiRl24io+G+4V/ac6
tZ9jxjwHQkWGFscoep2p7E8pGVezV1QqYv3prWXaZ2+TUKKzlbd6QCL8i2qmymeb
ABy88cpN1U4c/p6GDrva1+llyBG1PK07awAfFmK3OA7VK3MLW1pyHuJiLjOZaFP1
r2K9T5/d2YZa7nPFaJs8bLXsV/WMzFJDAOaILcaCDymxF1xnNu7iwAyuJhg/i6ht
KaOiM45JnqC++9MQ2D83zyt/+Yc/OT+06hVN7hdRaLFD+OqG7vNv7LZC+cfAWmFE
KyJb9kXNVDxhJXAWyOyKN1nzBtDfaugghGBzvX0u3QP/2uGX1Gca2oqCJvNesOM4
dBlCMt88/gkYjWXadjnK
=0q6A
-----END PGP SIGNATURE-----
Reply | Threaded
Open this post in threaded view
|

Re: Compression of network traffic

Stephan Ewen
Always depends on the data - we need to measure that. From what I have
heard in Hadoop, anywhere between none and a factor of 3-4 may happen.
Especially if the records contain repetitive strings...
Am 24.11.2014 10:24 schrieb "Sebastian Schelter" <[hidden email]>:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> What reduction in network traffic can we expect from using compression?
>
> - -s
>
>
>
> On 11/24/2014 10:20 AM, Stephan Ewen wrote:
> > Will the compression Codec will be inserted in the Netty loops, or
> > before that?
> >
> > In any case, would it make sense to prototype this on the current
> > code and forward port this to the new network stack later? I assume
> > the code would mostly be similar, especially all the JNI vs. Java
> > considerations and tricks. Am 23.11.2014 23:11 schrieb "Ufuk
> > Celebi" <[hidden email]>:
> >
> >>
> >> On 23 Nov 2014, at 21:04, Robert Metzger <[hidden email]>
> >> wrote:
> >>
> >>> Ufuk can probably elaborate on the details, but he told me some
> >>> time ago that we can add code to compress outgoing network
> >>> buffers using snappy or other fast compression algorithms. In
> >>> particular I/O intensive
> >> applications
> >>> would benefit from such a change.
> >>
> >> I agree and think that compressing shuffle data is generally a
> >> good idea (given codecs like Snappy et al.)
> >>
> >> I've opened a corresponding issue (
> >> https://issues.apache.org/jira/browse/FLINK-1275). The network
> >> layer is going through a set of big changes at the moment. I
> >> think it makes sense to wait for them to be finished/merged,
> >> before working on this.
> >>
> >> @Viktor: If you want to work on this, feel free to comment on the
> >> JIRA issue and we can discuss at which points you would have to
> >> extend the system.
> >>
> >> – Ufuk
> >
>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1
>
> iQIcBAEBAgAGBQJUcvjyAAoJEFWNjCUTnssiCYgQAK5QzsYPW9Q+C8BUJBxfeudg
> 61zFPW3cfydfD4L6UXghwz/j7addMm5zdY/xJsFgmN4NAXrtacDfr6+KPyB9glZ0
> E8E201q8NN6Czek/TtWEG0SSLj+KXqRAwJBHfBCVaCV/lD7wEAjef2sSFpVL5RK9
> 0RyS5UeCBW/RxTp/fGXWfy2d9G0SSTIy78/u8Y9nxflD6KnZVGZYZA7jrP/h8Wis
> ry1ZWLPHKpEbaC+MNbYX1mMJaf9lH0liJ/HopCIUzrxfiwskub17NkfVC/0Uwwc3
> r1+/pHFYNfO3FLIkXxWYRLgNyGW6LCE1zmM/BGUo1/C9/rtuiRl24io+G+4V/ac6
> tZ9jxjwHQkWGFscoep2p7E8pGVezV1QqYv3prWXaZ2+TUKKzlbd6QCL8i2qmymeb
> ABy88cpN1U4c/p6GDrva1+llyBG1PK07awAfFmK3OA7VK3MLW1pyHuJiLjOZaFP1
> r2K9T5/d2YZa7nPFaJs8bLXsV/WMzFJDAOaILcaCDymxF1xnNu7iwAyuJhg/i6ht
> KaOiM45JnqC++9MQ2D83zyt/+Yc/OT+06hVN7hdRaLFD+OqG7vNv7LZC+cfAWmFE
> KyJb9kXNVDxhJXAWyOyKN1nzBtDfaugghGBzvX0u3QP/2uGX1Gca2oqCJvNesOM4
> dBlCMt88/gkYjWXadjnK
> =0q6A
> -----END PGP SIGNATURE-----
>
Reply | Threaded
Open this post in threaded view
|

Re: Compression of network traffic

Ufuk Celebi-2
In reply to this post by Stephan Ewen
On Mon, Nov 24, 2014 at 10:20 AM, Stephan Ewen <[hidden email]> wrote:

> Will the compression Codec will be inserted in the Netty loops, or before
> that?
>

In the current master, I would say that it makes sense to do it in the
Netty loops during shuffling. The compression would then be totally
transparent to the writers/readers.

With the new network stack we will have to think this through a little bit
as we have more options and scenarios due to the fact that results will
possibly be consumed multiple times.


> In any case, would it make sense to prototype this on the current code and
> forward port this to the new network stack later? I assume the code would
> mostly be similar, especially all the JNI vs. Java considerations and
> tricks.
>

Yes. It makes sense to have this in any case.
Reply | Threaded
Open this post in threaded view
|

Re: Compression of network traffic

Viktor Rosenfeld
In reply to this post by Ufuk Celebi-2
Hi,

A codec like Snappy would work on an entire network buffer as one big blob, right? I was more thinking along the lines of compressing individual tuples fields by treating them as columns, e.g., using frame-of-reference encoding and bit backing. Compression on tuple fields should yield much better results than compressing the entire blob. Given that Flink controls the serialization process this should be transparent to other layers in the code. Not sure it is worth the effort though.

Cheers,
Viktor
Reply | Threaded
Open this post in threaded view
|

Re: Compression of network traffic

Stephan Ewen
I would start with a simple compression of network buffers as a blob.

At some point, Flink's internal data layout may become columnar, which
should also help the blob-style compression, because more similar strings
will be within one window...

On Tue, Nov 25, 2014 at 11:26 AM, Viktor Rosenfeld <
[hidden email]> wrote:

> Hi,
>
> A codec like Snappy would work on an entire network buffer as one big blob,
> right? I was more thinking along the lines of compressing individual tuples
> fields by treating them as columns, e.g., using frame-of-reference encoding
> and bit backing. Compression on tuple fields should yield much better
> results than compressing the entire blob. Given that Flink controls the
> serialization process this should be transparent to other layers in the
> code. Not sure it is worth the effort though.
>
> Cheers,
> Viktor
>
>
>
> --
> View this message in context:
> http://apache-flink-incubator-mailing-list-archive.1008284.n3.nabble.com/Compression-of-network-traffic-tp2568p2607.html
> Sent from the Apache Flink (Incubator) Mailing List archive. mailing list
> archive at Nabble.com.
>
Reply | Threaded
Open this post in threaded view
|

Re: Compression of network traffic

Daniel Warneke
Hi,

if you really want to add compression on the data path, I would
encourage you to choose something as lightweight as possible. 10 GBit
Ethernet is becoming pretty much commodity these days in the server
space and it is not easy to saturate such a link even without compression.

Snappy is not a bad choice, but the fastest algorithm I’ve seen so far
is LZ4:

https://code.google.com/p/lz4/

Best regards,

     Daniel


Am 25.11.2014 20:53, schrieb Stephan Ewen:

> I would start with a simple compression of network buffers as a blob.
>
> At some point, Flink's internal data layout may become columnar, which
> should also help the blob-style compression, because more similar strings
> will be within one window...
>
> On Tue, Nov 25, 2014 at 11:26 AM, Viktor Rosenfeld <
> [hidden email]> wrote:
>
>> Hi,
>>
>> A codec like Snappy would work on an entire network buffer as one big blob,
>> right? I was more thinking along the lines of compressing individual tuples
>> fields by treating them as columns, e.g., using frame-of-reference encoding
>> and bit backing. Compression on tuple fields should yield much better
>> results than compressing the entire blob. Given that Flink controls the
>> serialization process this should be transparent to other layers in the
>> code. Not sure it is worth the effort though.
>>
>> Cheers,
>> Viktor
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-flink-incubator-mailing-list-archive.1008284.n3.nabble.com/Compression-of-network-traffic-tp2568p2607.html
>> Sent from the Apache Flink (Incubator) Mailing List archive. mailing list
>> archive at Nabble.com.
>>

Reply | Threaded
Open this post in threaded view
|

Re: Compression of network traffic

Viktor Rosenfeld
In reply to this post by Stephan Ewen
Hi Stephan,

Compressing network buffers as a blob is probably the fastest/easiest way to achieve some measure of results.

But I wonder if it would be possible to implement some operations on the serialized compressed form. For example, groupings and joins should be easy to implement. If the field is not accessed in the UDF then it won't have to be deserialized.

A complication would be how the compression scheme is passed on to the next nodes in the computation chain.

Cheers,
Viktor
Reply | Threaded
Open this post in threaded view
|

Re: Compression of network traffic

Stephan Ewen
Yes, that working on serialized data happens in parts right now and it
would be great to extend that.

While it would be possible to work on a compact serialized representation,
I can't think of a way to work on a snappy/lz4 compressed version.
Am 26.11.2014 23:00 schrieb "Viktor Rosenfeld" <
[hidden email]>:

> Hi Stephan,
>
> Compressing network buffers as a blob is probably the fastest/easiest way
> to
> achieve some measure of results.
>
> But I wonder if it would be possible to implement some operations on the
> serialized compressed form. For example, groupings and joins should be easy
> to implement. If the field is not accessed in the UDF then it won't have to
> be deserialized.
>
> A complication would be how the compression scheme is passed on to the next
> nodes in the computation chain.
>
> Cheers,
> Viktor
>
>
>
> --
> View this message in context:
> http://apache-flink-incubator-mailing-list-archive.1008284.n3.nabble.com/Compression-of-network-traffic-tp2568p2637.html
> Sent from the Apache Flink (Incubator) Mailing List archive. mailing list
> archive at Nabble.com.
>
Reply | Threaded
Open this post in threaded view
|

Re: Compression of network traffic

Viktor Rosenfeld
Hi Stephan,

can you give me an example (or a few) where Flink is working on the serialized data?

Cheers,
Viktor
Reply | Threaded
Open this post in threaded view
|

Re: Compression of network traffic

Stephan Ewen
Hey!

As an example, you can have a look at the normalized key sorter, which
sorts and compares complete on serialized data.

Stephan


On Thu, Nov 27, 2014 at 2:25 PM, Viktor Rosenfeld <
[hidden email]> wrote:

> Hi Stephan,
>
> can you give me an example (or a few) where Flink is working on the
> serialized data?
>
> Cheers,
> Viktor
>
>
>
> --
> View this message in context:
> http://apache-flink-incubator-mailing-list-archive.1008284.n3.nabble.com/Compression-of-network-traffic-tp2568p2651.html
> Sent from the Apache Flink (Incubator) Mailing List archive. mailing list
> archive at Nabble.com.
>