GSoC Project Proposal Draft: Code Generation in Serializers

classic Classic list List threaded Threaded
18 messages Options
Reply | Threaded
Open this post in threaded view
|

GSoC Project Proposal Draft: Code Generation in Serializers

Gábor Horváth
Hi,I did not want to send this proposal out before the I have some initial
benchmarks, but this issue was mentioned on the mailing list (
http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/Tuple-performance-and-the-curious-JIT-compiler-td10666.html),
and I wanted to make this information available to be able to incorporate
this into that discussion. I have written this draft with the help of Gábor
Gévay and Márton Balassi and I am open to every suggestion.


The proposal draft:
Code Generation in Serializers and Comparators of Apache Flink

I am doing my last semester of my MSc studies and I’m a former GSoC student
in the LLVM project. I plan to improve the serialization code in Flink
during this summer. The current implementation of the serializers can be a
performance bottleneck in some scenarios. These performance problems were
also reported on the mailing list recently [1]. I plan to implement code
generation into the serializers to improve the performance (as Stephan Ewen
also suggested.)

TODO: I plan to include some preliminary benchmarks in this section.
Performance problems with the current serializers

   1.

   PojoSerializer uses reflection for accessing the fields, which is slow
   (eg. [2])


   -

   This is also a serious problem for the comparators


   1.

   When deserializing fields of primitive types (eg. int), the reusing
   overload of the corresponding field serializers cannot really do any reuse,
   because boxed primitive types are immutable in Java. This results in lots
   of object creations. [3][7]
   2.

   The loop to call the field serializers makes virtual function calls,
   that cannot be speculatively devirtualized by the JVM or predicted by the
   CPU, because different serializer subclasses are invoked for the different
   fields. (And the loop cannot be unrolled, because the number of iterations
   is not a compile time constant.) See also the following discussion on the
   mailing list [1].
   3.

   A POJO field can have the value null, so the serializer inserts 1 byte
   null tags, which wastes space. (Also, the type extractor logic does not
   distinguish between primitive types and their boxed versions, so even an
   int field has a null tag.)
   4.

   Subclass tags also add a byte at the beginning of every POJO
   5.

   getLength() does not know the size in most cases [4]
   Knowing the size of a type when serialized has numerous performance
   benefits throughout Flink:
   1.

      Sorters can do in-place, when the type is small [5]
      2.

      Chaining hash tables do not need resizes, because they know how many
      buckets to allocate upfront [6]
      3.

      Different hash table architectures could be used, eg. open addressing
      with linear probing instead of some chaining
      4.

      It is possible to deserialize, modify, and then serialize back a
      record to its original place, because it cannot happen that the modified
      version does not fit in the place allocated there for the old
version (see
      CompactingHashTable and ReduceHashTable for concrete instances of this
      problem)


Note, that 2. and 3. are problems with not just the PojoSerializer, but
also with the TupleSerializer.
Solution approaches

   1.

   Run time code generation for every POJO


   -

      1. and 3 . would be automatically solved, if the serializers for
      POJOs would be generated on-the-fly (by, for example, Javassist)
      -

      2. also needs code generation, and also some extra effort in the type
      extractor to distinguish between primitive types and their boxed versions
      -

      could be used for PojoComparator as well (which could greatly
      increase the performance of sorting)


   1.

   Annotations on POJOs (by the users)


   -

      Concretely:
      -

         annotate fields that will never be nulls -> no null tag needed
         before every field!
         -

         make a POJO final -> no subclass tag needed
         -

         annotating a POJO that it will not be null -> no top level null
         tag needed
         -

      These would also help with the getLength problem (6.), because the
      length is often not known because currently anything can be null or a
      subclass can appear anywhere
      -

      These annotations could be done without code generation, but then
      they would add some overhead when there are no annotations
present, so this
      would work better together with the code generation
      -

      Tuples would become a special case of POJOs, where nothing can be
      null, and no subclass can appear, so maybe we could eliminate the
      TupleSerializer
      -

      We could annotate some internal types in Flink libraries (Gelly
      (Vertex, Edge), FlinkML)


TODO: what is the situation with Scala case classes? Run time code
generation is probably easier in Scala? (with quasiquotes)

About me

I am in the last year of my Computer Science MSc studies at Eotvos Lorand
University in Budapest, and planning to start a PhD in the autumn. I have
been working for almost three years at Ericsson on static analysis tools
for C++. In 2014 I participated in GSoC, working on the LLVM project, and I
am a frequent contributor ever since. The next summer I was interning at
Apple.

I learned about the Flink project not too long ago and I like it so far.
The last few weeks I was working on some tickets to familiarize myself with
the codebase:

https://issues.apache.org/jira/browse/FLINK-3422

https://issues.apache.org/jira/browse/FLINK-3322

https://issues.apache.org/jira/browse/FLINK-3457

My CV is available here: http://xazax.web.elte.hu/files/resume.pdf
References

[1]
http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/Tuple-performance-and-the-curious-JIT-compiler-td10666.html

[2]
https://github.com/apache/flink/blob/master/flink-java/src/main/java/org/apache/flink/api/java/typeutils/runtime/PojoSerializer.java#L369

[3]
https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/api/common/typeutils/base/IntSerializer.java#L73

[4]
https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/api/common/typeutils/TypeSerializer.java#L98

[5]
https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/operators/sort/FixedLengthRecordSorter.java

[6]
https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/operators/hash/CompactingHashTable.java#L861
[7] https://issues.apache.org/jira/browse/FLINK-3277


Best Regards,

Gábor
Reply | Threaded
Open this post in threaded view
|

Re: GSoC Project Proposal Draft: Code Generation in Serializers

Gábor Horváth
Hi!

As far as I can see the formatting was not correct in my previous mail. A
better formatted version is available here:
https://docs.google.com/document/d/1VC8lCeErx9kI5lCMPiUn625PO0rxR-iKlVqtt3hkVnk
Sorry for that.

Regards,
Gábor

On 9 March 2016 at 15:51, Gábor Horváth <[hidden email]> wrote:

> Hi,I did not want to send this proposal out before the I have some
> initial benchmarks, but this issue was mentioned on the mailing list (
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/Tuple-performance-and-the-curious-JIT-compiler-td10666.html),
> and I wanted to make this information available to be able to incorporate
> this into that discussion. I have written this draft with the help of Gábor
> Gévay and Márton Balassi and I am open to every suggestion.
>
>
> The proposal draft:
> Code Generation in Serializers and Comparators of Apache Flink
>
> I am doing my last semester of my MSc studies and I’m a former GSoC
> student in the LLVM project. I plan to improve the serialization code in
> Flink during this summer. The current implementation of the serializers can
> be a performance bottleneck in some scenarios. These performance problems
> were also reported on the mailing list recently [1]. I plan to implement
> code generation into the serializers to improve the performance (as Stephan
> Ewen also suggested.)
>
> TODO: I plan to include some preliminary benchmarks in this section.
> Performance problems with the current serializers
>
>    1.
>
>    PojoSerializer uses reflection for accessing the fields, which is slow
>    (eg. [2])
>
>
>    -
>
>    This is also a serious problem for the comparators
>
>
>    1.
>
>    When deserializing fields of primitive types (eg. int), the reusing
>    overload of the corresponding field serializers cannot really do any reuse,
>    because boxed primitive types are immutable in Java. This results in lots
>    of object creations. [3][7]
>    2.
>
>    The loop to call the field serializers makes virtual function calls,
>    that cannot be speculatively devirtualized by the JVM or predicted by the
>    CPU, because different serializer subclasses are invoked for the different
>    fields. (And the loop cannot be unrolled, because the number of iterations
>    is not a compile time constant.) See also the following discussion on the
>    mailing list [1].
>    3.
>
>    A POJO field can have the value null, so the serializer inserts 1 byte
>    null tags, which wastes space. (Also, the type extractor logic does not
>    distinguish between primitive types and their boxed versions, so even an
>    int field has a null tag.)
>    4.
>
>    Subclass tags also add a byte at the beginning of every POJO
>    5.
>
>    getLength() does not know the size in most cases [4]
>    Knowing the size of a type when serialized has numerous performance
>    benefits throughout Flink:
>    1.
>
>       Sorters can do in-place, when the type is small [5]
>       2.
>
>       Chaining hash tables do not need resizes, because they know how
>       many buckets to allocate upfront [6]
>       3.
>
>       Different hash table architectures could be used, eg. open
>       addressing with linear probing instead of some chaining
>       4.
>
>       It is possible to deserialize, modify, and then serialize back a
>       record to its original place, because it cannot happen that the modified
>       version does not fit in the place allocated there for the old version (see
>       CompactingHashTable and ReduceHashTable for concrete instances of this
>       problem)
>
>
> Note, that 2. and 3. are problems with not just the PojoSerializer, but
> also with the TupleSerializer.
> Solution approaches
>
>    1.
>
>    Run time code generation for every POJO
>
>
>    -
>
>       1. and 3 . would be automatically solved, if the serializers for
>       POJOs would be generated on-the-fly (by, for example, Javassist)
>       -
>
>       2. also needs code generation, and also some extra effort in the
>       type extractor to distinguish between primitive types and their boxed
>       versions
>       -
>
>       could be used for PojoComparator as well (which could greatly
>       increase the performance of sorting)
>
>
>    1.
>
>    Annotations on POJOs (by the users)
>
>
>    -
>
>       Concretely:
>       -
>
>          annotate fields that will never be nulls -> no null tag needed
>          before every field!
>          -
>
>          make a POJO final -> no subclass tag needed
>          -
>
>          annotating a POJO that it will not be null -> no top level null
>          tag needed
>          -
>
>       These would also help with the getLength problem (6.), because the
>       length is often not known because currently anything can be null or a
>       subclass can appear anywhere
>       -
>
>       These annotations could be done without code generation, but then
>       they would add some overhead when there are no annotations present, so this
>       would work better together with the code generation
>       -
>
>       Tuples would become a special case of POJOs, where nothing can be
>       null, and no subclass can appear, so maybe we could eliminate the
>       TupleSerializer
>       -
>
>       We could annotate some internal types in Flink libraries (Gelly
>       (Vertex, Edge), FlinkML)
>
>
> TODO: what is the situation with Scala case classes? Run time code
> generation is probably easier in Scala? (with quasiquotes)
>
> About me
>
> I am in the last year of my Computer Science MSc studies at Eotvos Lorand
> University in Budapest, and planning to start a PhD in the autumn. I have
> been working for almost three years at Ericsson on static analysis tools
> for C++. In 2014 I participated in GSoC, working on the LLVM project, and I
> am a frequent contributor ever since. The next summer I was interning at
> Apple.
>
> I learned about the Flink project not too long ago and I like it so far.
> The last few weeks I was working on some tickets to familiarize myself with
> the codebase:
>
> https://issues.apache.org/jira/browse/FLINK-3422
>
> https://issues.apache.org/jira/browse/FLINK-3322
>
> https://issues.apache.org/jira/browse/FLINK-3457
>
> My CV is available here: http://xazax.web.elte.hu/files/resume.pdf
> References
>
> [1]
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/Tuple-performance-and-the-curious-JIT-compiler-td10666.html
>
> [2]
> https://github.com/apache/flink/blob/master/flink-java/src/main/java/org/apache/flink/api/java/typeutils/runtime/PojoSerializer.java#L369
>
> [3]
> https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/api/common/typeutils/base/IntSerializer.java#L73
>
> [4]
> https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/api/common/typeutils/TypeSerializer.java#L98
>
> [5]
> https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/operators/sort/FixedLengthRecordSorter.java
>
> [6]
> https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/operators/hash/CompactingHashTable.java#L861
> [7] https://issues.apache.org/jira/browse/FLINK-3277
>
>
> Best Regards,
>
> Gábor
>
Reply | Threaded
Open this post in threaded view
|

Re: GSoC Project Proposal Draft: Code Generation in Serializers

Gábor Horváth
Hi,

I have updated this draft to include preliminary benchmarks, mentioned the
interaction of annotations with savepoints, extended it with a timeline,
and some notes about scala case classes.

Regards,
Gábor

On 9 March 2016 at 16:12, Gábor Horváth <[hidden email]> wrote:

> Hi!
>
> As far as I can see the formatting was not correct in my previous mail. A
> better formatted version is available here:
> https://docs.google.com/document/d/1VC8lCeErx9kI5lCMPiUn625PO0rxR-iKlVqtt3hkVnk
> Sorry for that.
>
> Regards,
> Gábor
>
> On 9 March 2016 at 15:51, Gábor Horváth <[hidden email]> wrote:
>
>> Hi,I did not want to send this proposal out before the I have some
>> initial benchmarks, but this issue was mentioned on the mailing list (
>> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/Tuple-performance-and-the-curious-JIT-compiler-td10666.html),
>> and I wanted to make this information available to be able to incorporate
>> this into that discussion. I have written this draft with the help of Gábor
>> Gévay and Márton Balassi and I am open to every suggestion.
>>
>>
>> The proposal draft:
>> Code Generation in Serializers and Comparators of Apache Flink
>>
>> I am doing my last semester of my MSc studies and I’m a former GSoC
>> student in the LLVM project. I plan to improve the serialization code in
>> Flink during this summer. The current implementation of the serializers can
>> be a performance bottleneck in some scenarios. These performance problems
>> were also reported on the mailing list recently [1]. I plan to implement
>> code generation into the serializers to improve the performance (as Stephan
>> Ewen also suggested.)
>>
>> TODO: I plan to include some preliminary benchmarks in this section.
>> Performance problems with the current serializers
>>
>>    1.
>>
>>    PojoSerializer uses reflection for accessing the fields, which is
>>    slow (eg. [2])
>>
>>
>>    -
>>
>>    This is also a serious problem for the comparators
>>
>>
>>    1.
>>
>>    When deserializing fields of primitive types (eg. int), the reusing
>>    overload of the corresponding field serializers cannot really do any reuse,
>>    because boxed primitive types are immutable in Java. This results in lots
>>    of object creations. [3][7]
>>    2.
>>
>>    The loop to call the field serializers makes virtual function calls,
>>    that cannot be speculatively devirtualized by the JVM or predicted by the
>>    CPU, because different serializer subclasses are invoked for the different
>>    fields. (And the loop cannot be unrolled, because the number of iterations
>>    is not a compile time constant.) See also the following discussion on the
>>    mailing list [1].
>>    3.
>>
>>    A POJO field can have the value null, so the serializer inserts 1
>>    byte null tags, which wastes space. (Also, the type extractor logic does
>>    not distinguish between primitive types and their boxed versions, so even
>>    an int field has a null tag.)
>>    4.
>>
>>    Subclass tags also add a byte at the beginning of every POJO
>>    5.
>>
>>    getLength() does not know the size in most cases [4]
>>    Knowing the size of a type when serialized has numerous performance
>>    benefits throughout Flink:
>>    1.
>>
>>       Sorters can do in-place, when the type is small [5]
>>       2.
>>
>>       Chaining hash tables do not need resizes, because they know how
>>       many buckets to allocate upfront [6]
>>       3.
>>
>>       Different hash table architectures could be used, eg. open
>>       addressing with linear probing instead of some chaining
>>       4.
>>
>>       It is possible to deserialize, modify, and then serialize back a
>>       record to its original place, because it cannot happen that the modified
>>       version does not fit in the place allocated there for the old version (see
>>       CompactingHashTable and ReduceHashTable for concrete instances of this
>>       problem)
>>
>>
>> Note, that 2. and 3. are problems with not just the PojoSerializer, but
>> also with the TupleSerializer.
>> Solution approaches
>>
>>    1.
>>
>>    Run time code generation for every POJO
>>
>>
>>    -
>>
>>       1. and 3 . would be automatically solved, if the serializers for
>>       POJOs would be generated on-the-fly (by, for example, Javassist)
>>       -
>>
>>       2. also needs code generation, and also some extra effort in the
>>       type extractor to distinguish between primitive types and their boxed
>>       versions
>>       -
>>
>>       could be used for PojoComparator as well (which could greatly
>>       increase the performance of sorting)
>>
>>
>>    1.
>>
>>    Annotations on POJOs (by the users)
>>
>>
>>    -
>>
>>       Concretely:
>>       -
>>
>>          annotate fields that will never be nulls -> no null tag needed
>>          before every field!
>>          -
>>
>>          make a POJO final -> no subclass tag needed
>>          -
>>
>>          annotating a POJO that it will not be null -> no top level null
>>          tag needed
>>          -
>>
>>       These would also help with the getLength problem (6.), because the
>>       length is often not known because currently anything can be null or a
>>       subclass can appear anywhere
>>       -
>>
>>       These annotations could be done without code generation, but then
>>       they would add some overhead when there are no annotations present, so this
>>       would work better together with the code generation
>>       -
>>
>>       Tuples would become a special case of POJOs, where nothing can be
>>       null, and no subclass can appear, so maybe we could eliminate the
>>       TupleSerializer
>>       -
>>
>>       We could annotate some internal types in Flink libraries (Gelly
>>       (Vertex, Edge), FlinkML)
>>
>>
>> TODO: what is the situation with Scala case classes? Run time code
>> generation is probably easier in Scala? (with quasiquotes)
>>
>> About me
>>
>> I am in the last year of my Computer Science MSc studies at Eotvos Lorand
>> University in Budapest, and planning to start a PhD in the autumn. I have
>> been working for almost three years at Ericsson on static analysis tools
>> for C++. In 2014 I participated in GSoC, working on the LLVM project, and I
>> am a frequent contributor ever since. The next summer I was interning at
>> Apple.
>>
>> I learned about the Flink project not too long ago and I like it so far.
>> The last few weeks I was working on some tickets to familiarize myself with
>> the codebase:
>>
>> https://issues.apache.org/jira/browse/FLINK-3422
>>
>> https://issues.apache.org/jira/browse/FLINK-3322
>>
>> https://issues.apache.org/jira/browse/FLINK-3457
>>
>> My CV is available here: http://xazax.web.elte.hu/files/resume.pdf
>> References
>>
>> [1]
>> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/Tuple-performance-and-the-curious-JIT-compiler-td10666.html
>>
>> [2]
>> https://github.com/apache/flink/blob/master/flink-java/src/main/java/org/apache/flink/api/java/typeutils/runtime/PojoSerializer.java#L369
>>
>> [3]
>> https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/api/common/typeutils/base/IntSerializer.java#L73
>>
>> [4]
>> https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/api/common/typeutils/TypeSerializer.java#L98
>>
>> [5]
>> https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/operators/sort/FixedLengthRecordSorter.java
>>
>> [6]
>> https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/operators/hash/CompactingHashTable.java#L861
>> [7] https://issues.apache.org/jira/browse/FLINK-3277
>>
>>
>> Best Regards,
>>
>> Gábor
>>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: GSoC Project Proposal Draft: Code Generation in Serializers

Márton Balassi
Thanks Gábor, now I also see it on the internal GSoC interface. I have
indicated that I wish to mentor your project, I think you can hit finalize
on your project there.

On Mon, Mar 14, 2016 at 11:16 AM, Gábor Horváth <[hidden email]> wrote:

> Hi,
>
> I have updated this draft to include preliminary benchmarks, mentioned the
> interaction of annotations with savepoints, extended it with a timeline,
> and some notes about scala case classes.
>
> Regards,
> Gábor
>
> On 9 March 2016 at 16:12, Gábor Horváth <[hidden email]> wrote:
>
> > Hi!
> >
> > As far as I can see the formatting was not correct in my previous mail. A
> > better formatted version is available here:
> >
> https://docs.google.com/document/d/1VC8lCeErx9kI5lCMPiUn625PO0rxR-iKlVqtt3hkVnk
> > Sorry for that.
> >
> > Regards,
> > Gábor
> >
> > On 9 March 2016 at 15:51, Gábor Horváth <[hidden email]> wrote:
> >
> >> Hi,I did not want to send this proposal out before the I have some
> >> initial benchmarks, but this issue was mentioned on the mailing list (
> >>
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/Tuple-performance-and-the-curious-JIT-compiler-td10666.html
> ),
> >> and I wanted to make this information available to be able to
> incorporate
> >> this into that discussion. I have written this draft with the help of
> Gábor
> >> Gévay and Márton Balassi and I am open to every suggestion.
> >>
> >>
> >> The proposal draft:
> >> Code Generation in Serializers and Comparators of Apache Flink
> >>
> >> I am doing my last semester of my MSc studies and I’m a former GSoC
> >> student in the LLVM project. I plan to improve the serialization code in
> >> Flink during this summer. The current implementation of the serializers
> can
> >> be a performance bottleneck in some scenarios. These performance
> problems
> >> were also reported on the mailing list recently [1]. I plan to implement
> >> code generation into the serializers to improve the performance (as
> Stephan
> >> Ewen also suggested.)
> >>
> >> TODO: I plan to include some preliminary benchmarks in this section.
> >> Performance problems with the current serializers
> >>
> >>    1.
> >>
> >>    PojoSerializer uses reflection for accessing the fields, which is
> >>    slow (eg. [2])
> >>
> >>
> >>    -
> >>
> >>    This is also a serious problem for the comparators
> >>
> >>
> >>    1.
> >>
> >>    When deserializing fields of primitive types (eg. int), the reusing
> >>    overload of the corresponding field serializers cannot really do any
> reuse,
> >>    because boxed primitive types are immutable in Java. This results in
> lots
> >>    of object creations. [3][7]
> >>    2.
> >>
> >>    The loop to call the field serializers makes virtual function calls,
> >>    that cannot be speculatively devirtualized by the JVM or predicted
> by the
> >>    CPU, because different serializer subclasses are invoked for the
> different
> >>    fields. (And the loop cannot be unrolled, because the number of
> iterations
> >>    is not a compile time constant.) See also the following discussion
> on the
> >>    mailing list [1].
> >>    3.
> >>
> >>    A POJO field can have the value null, so the serializer inserts 1
> >>    byte null tags, which wastes space. (Also, the type extractor logic
> does
> >>    not distinguish between primitive types and their boxed versions, so
> even
> >>    an int field has a null tag.)
> >>    4.
> >>
> >>    Subclass tags also add a byte at the beginning of every POJO
> >>    5.
> >>
> >>    getLength() does not know the size in most cases [4]
> >>    Knowing the size of a type when serialized has numerous performance
> >>    benefits throughout Flink:
> >>    1.
> >>
> >>       Sorters can do in-place, when the type is small [5]
> >>       2.
> >>
> >>       Chaining hash tables do not need resizes, because they know how
> >>       many buckets to allocate upfront [6]
> >>       3.
> >>
> >>       Different hash table architectures could be used, eg. open
> >>       addressing with linear probing instead of some chaining
> >>       4.
> >>
> >>       It is possible to deserialize, modify, and then serialize back a
> >>       record to its original place, because it cannot happen that the
> modified
> >>       version does not fit in the place allocated there for the old
> version (see
> >>       CompactingHashTable and ReduceHashTable for concrete instances of
> this
> >>       problem)
> >>
> >>
> >> Note, that 2. and 3. are problems with not just the PojoSerializer, but
> >> also with the TupleSerializer.
> >> Solution approaches
> >>
> >>    1.
> >>
> >>    Run time code generation for every POJO
> >>
> >>
> >>    -
> >>
> >>       1. and 3 . would be automatically solved, if the serializers for
> >>       POJOs would be generated on-the-fly (by, for example, Javassist)
> >>       -
> >>
> >>       2. also needs code generation, and also some extra effort in the
> >>       type extractor to distinguish between primitive types and their
> boxed
> >>       versions
> >>       -
> >>
> >>       could be used for PojoComparator as well (which could greatly
> >>       increase the performance of sorting)
> >>
> >>
> >>    1.
> >>
> >>    Annotations on POJOs (by the users)
> >>
> >>
> >>    -
> >>
> >>       Concretely:
> >>       -
> >>
> >>          annotate fields that will never be nulls -> no null tag needed
> >>          before every field!
> >>          -
> >>
> >>          make a POJO final -> no subclass tag needed
> >>          -
> >>
> >>          annotating a POJO that it will not be null -> no top level null
> >>          tag needed
> >>          -
> >>
> >>       These would also help with the getLength problem (6.), because the
> >>       length is often not known because currently anything can be null
> or a
> >>       subclass can appear anywhere
> >>       -
> >>
> >>       These annotations could be done without code generation, but then
> >>       they would add some overhead when there are no annotations
> present, so this
> >>       would work better together with the code generation
> >>       -
> >>
> >>       Tuples would become a special case of POJOs, where nothing can be
> >>       null, and no subclass can appear, so maybe we could eliminate the
> >>       TupleSerializer
> >>       -
> >>
> >>       We could annotate some internal types in Flink libraries (Gelly
> >>       (Vertex, Edge), FlinkML)
> >>
> >>
> >> TODO: what is the situation with Scala case classes? Run time code
> >> generation is probably easier in Scala? (with quasiquotes)
> >>
> >> About me
> >>
> >> I am in the last year of my Computer Science MSc studies at Eotvos
> Lorand
> >> University in Budapest, and planning to start a PhD in the autumn. I
> have
> >> been working for almost three years at Ericsson on static analysis tools
> >> for C++. In 2014 I participated in GSoC, working on the LLVM project,
> and I
> >> am a frequent contributor ever since. The next summer I was interning at
> >> Apple.
> >>
> >> I learned about the Flink project not too long ago and I like it so far.
> >> The last few weeks I was working on some tickets to familiarize myself
> with
> >> the codebase:
> >>
> >> https://issues.apache.org/jira/browse/FLINK-3422
> >>
> >> https://issues.apache.org/jira/browse/FLINK-3322
> >>
> >> https://issues.apache.org/jira/browse/FLINK-3457
> >>
> >> My CV is available here: http://xazax.web.elte.hu/files/resume.pdf
> >> References
> >>
> >> [1]
> >>
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/Tuple-performance-and-the-curious-JIT-compiler-td10666.html
> >>
> >> [2]
> >>
> https://github.com/apache/flink/blob/master/flink-java/src/main/java/org/apache/flink/api/java/typeutils/runtime/PojoSerializer.java#L369
> >>
> >> [3]
> >>
> https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/api/common/typeutils/base/IntSerializer.java#L73
> >>
> >> [4]
> >>
> https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/api/common/typeutils/TypeSerializer.java#L98
> >>
> >> [5]
> >>
> https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/operators/sort/FixedLengthRecordSorter.java
> >>
> >> [6]
> >>
> https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/operators/hash/CompactingHashTable.java#L861
> >> [7] https://issues.apache.org/jira/browse/FLINK-3277
> >>
> >>
> >> Best Regards,
> >>
> >> Gábor
> >>
> >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: GSoC Project Proposal Draft: Code Generation in Serializers

Gábor Horváth
Thank you! I finalized the project.

On 18 March 2016 at 10:29, Márton Balassi <[hidden email]> wrote:

> Thanks Gábor, now I also see it on the internal GSoC interface. I have
> indicated that I wish to mentor your project, I think you can hit finalize
> on your project there.
>
> On Mon, Mar 14, 2016 at 11:16 AM, Gábor Horváth <[hidden email]>
> wrote:
>
> > Hi,
> >
> > I have updated this draft to include preliminary benchmarks, mentioned
> the
> > interaction of annotations with savepoints, extended it with a timeline,
> > and some notes about scala case classes.
> >
> > Regards,
> > Gábor
> >
> > On 9 March 2016 at 16:12, Gábor Horváth <[hidden email]> wrote:
> >
> > > Hi!
> > >
> > > As far as I can see the formatting was not correct in my previous
> mail. A
> > > better formatted version is available here:
> > >
> >
> https://docs.google.com/document/d/1VC8lCeErx9kI5lCMPiUn625PO0rxR-iKlVqtt3hkVnk
> > > Sorry for that.
> > >
> > > Regards,
> > > Gábor
> > >
> > > On 9 March 2016 at 15:51, Gábor Horváth <[hidden email]> wrote:
> > >
> > >> Hi,I did not want to send this proposal out before the I have some
> > >> initial benchmarks, but this issue was mentioned on the mailing list (
> > >>
> >
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/Tuple-performance-and-the-curious-JIT-compiler-td10666.html
> > ),
> > >> and I wanted to make this information available to be able to
> > incorporate
> > >> this into that discussion. I have written this draft with the help of
> > Gábor
> > >> Gévay and Márton Balassi and I am open to every suggestion.
> > >>
> > >>
> > >> The proposal draft:
> > >> Code Generation in Serializers and Comparators of Apache Flink
> > >>
> > >> I am doing my last semester of my MSc studies and I’m a former GSoC
> > >> student in the LLVM project. I plan to improve the serialization code
> in
> > >> Flink during this summer. The current implementation of the
> serializers
> > can
> > >> be a performance bottleneck in some scenarios. These performance
> > problems
> > >> were also reported on the mailing list recently [1]. I plan to
> implement
> > >> code generation into the serializers to improve the performance (as
> > Stephan
> > >> Ewen also suggested.)
> > >>
> > >> TODO: I plan to include some preliminary benchmarks in this section.
> > >> Performance problems with the current serializers
> > >>
> > >>    1.
> > >>
> > >>    PojoSerializer uses reflection for accessing the fields, which is
> > >>    slow (eg. [2])
> > >>
> > >>
> > >>    -
> > >>
> > >>    This is also a serious problem for the comparators
> > >>
> > >>
> > >>    1.
> > >>
> > >>    When deserializing fields of primitive types (eg. int), the reusing
> > >>    overload of the corresponding field serializers cannot really do
> any
> > reuse,
> > >>    because boxed primitive types are immutable in Java. This results
> in
> > lots
> > >>    of object creations. [3][7]
> > >>    2.
> > >>
> > >>    The loop to call the field serializers makes virtual function
> calls,
> > >>    that cannot be speculatively devirtualized by the JVM or predicted
> > by the
> > >>    CPU, because different serializer subclasses are invoked for the
> > different
> > >>    fields. (And the loop cannot be unrolled, because the number of
> > iterations
> > >>    is not a compile time constant.) See also the following discussion
> > on the
> > >>    mailing list [1].
> > >>    3.
> > >>
> > >>    A POJO field can have the value null, so the serializer inserts 1
> > >>    byte null tags, which wastes space. (Also, the type extractor logic
> > does
> > >>    not distinguish between primitive types and their boxed versions,
> so
> > even
> > >>    an int field has a null tag.)
> > >>    4.
> > >>
> > >>    Subclass tags also add a byte at the beginning of every POJO
> > >>    5.
> > >>
> > >>    getLength() does not know the size in most cases [4]
> > >>    Knowing the size of a type when serialized has numerous performance
> > >>    benefits throughout Flink:
> > >>    1.
> > >>
> > >>       Sorters can do in-place, when the type is small [5]
> > >>       2.
> > >>
> > >>       Chaining hash tables do not need resizes, because they know how
> > >>       many buckets to allocate upfront [6]
> > >>       3.
> > >>
> > >>       Different hash table architectures could be used, eg. open
> > >>       addressing with linear probing instead of some chaining
> > >>       4.
> > >>
> > >>       It is possible to deserialize, modify, and then serialize back a
> > >>       record to its original place, because it cannot happen that the
> > modified
> > >>       version does not fit in the place allocated there for the old
> > version (see
> > >>       CompactingHashTable and ReduceHashTable for concrete instances
> of
> > this
> > >>       problem)
> > >>
> > >>
> > >> Note, that 2. and 3. are problems with not just the PojoSerializer,
> but
> > >> also with the TupleSerializer.
> > >> Solution approaches
> > >>
> > >>    1.
> > >>
> > >>    Run time code generation for every POJO
> > >>
> > >>
> > >>    -
> > >>
> > >>       1. and 3 . would be automatically solved, if the serializers for
> > >>       POJOs would be generated on-the-fly (by, for example, Javassist)
> > >>       -
> > >>
> > >>       2. also needs code generation, and also some extra effort in the
> > >>       type extractor to distinguish between primitive types and their
> > boxed
> > >>       versions
> > >>       -
> > >>
> > >>       could be used for PojoComparator as well (which could greatly
> > >>       increase the performance of sorting)
> > >>
> > >>
> > >>    1.
> > >>
> > >>    Annotations on POJOs (by the users)
> > >>
> > >>
> > >>    -
> > >>
> > >>       Concretely:
> > >>       -
> > >>
> > >>          annotate fields that will never be nulls -> no null tag
> needed
> > >>          before every field!
> > >>          -
> > >>
> > >>          make a POJO final -> no subclass tag needed
> > >>          -
> > >>
> > >>          annotating a POJO that it will not be null -> no top level
> null
> > >>          tag needed
> > >>          -
> > >>
> > >>       These would also help with the getLength problem (6.), because
> the
> > >>       length is often not known because currently anything can be null
> > or a
> > >>       subclass can appear anywhere
> > >>       -
> > >>
> > >>       These annotations could be done without code generation, but
> then
> > >>       they would add some overhead when there are no annotations
> > present, so this
> > >>       would work better together with the code generation
> > >>       -
> > >>
> > >>       Tuples would become a special case of POJOs, where nothing can
> be
> > >>       null, and no subclass can appear, so maybe we could eliminate
> the
> > >>       TupleSerializer
> > >>       -
> > >>
> > >>       We could annotate some internal types in Flink libraries (Gelly
> > >>       (Vertex, Edge), FlinkML)
> > >>
> > >>
> > >> TODO: what is the situation with Scala case classes? Run time code
> > >> generation is probably easier in Scala? (with quasiquotes)
> > >>
> > >> About me
> > >>
> > >> I am in the last year of my Computer Science MSc studies at Eotvos
> > Lorand
> > >> University in Budapest, and planning to start a PhD in the autumn. I
> > have
> > >> been working for almost three years at Ericsson on static analysis
> tools
> > >> for C++. In 2014 I participated in GSoC, working on the LLVM project,
> > and I
> > >> am a frequent contributor ever since. The next summer I was interning
> at
> > >> Apple.
> > >>
> > >> I learned about the Flink project not too long ago and I like it so
> far.
> > >> The last few weeks I was working on some tickets to familiarize myself
> > with
> > >> the codebase:
> > >>
> > >> https://issues.apache.org/jira/browse/FLINK-3422
> > >>
> > >> https://issues.apache.org/jira/browse/FLINK-3322
> > >>
> > >> https://issues.apache.org/jira/browse/FLINK-3457
> > >>
> > >> My CV is available here: http://xazax.web.elte.hu/files/resume.pdf
> > >> References
> > >>
> > >> [1]
> > >>
> >
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/Tuple-performance-and-the-curious-JIT-compiler-td10666.html
> > >>
> > >> [2]
> > >>
> >
> https://github.com/apache/flink/blob/master/flink-java/src/main/java/org/apache/flink/api/java/typeutils/runtime/PojoSerializer.java#L369
> > >>
> > >> [3]
> > >>
> >
> https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/api/common/typeutils/base/IntSerializer.java#L73
> > >>
> > >> [4]
> > >>
> >
> https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/api/common/typeutils/TypeSerializer.java#L98
> > >>
> > >> [5]
> > >>
> >
> https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/operators/sort/FixedLengthRecordSorter.java
> > >>
> > >> [6]
> > >>
> >
> https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/operators/hash/CompactingHashTable.java#L861
> > >> [7] https://issues.apache.org/jira/browse/FLINK-3277
> > >>
> > >>
> > >> Best Regards,
> > >>
> > >> Gábor
> > >>
> > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: GSoC Project Proposal Draft: Code Generation in Serializers

Gábor Horváth
Hi!

Table API already uses code generation and the Janino compiler [1]. Is it a
dependency that is ok to add to flink-core? In case it is ok, I think I
will use the same in order to be consistent with the other code generation
efforts.

I started to look at the Table API code generation [2] and it uses Scala
extensively. There are several Scala features that can make Java code
generation easier such as pattern matching and string interpolation. I did
not see any Scala code in flink-core yet. Is it ok to implement the code
generation inside the flink-core using Scala?

Regards,
Gábor

[1] http://unkrig.de/w/Janino
[2]
https://github.com/apache/flink/blob/master/flink-libraries/flink-table/src/main/scala/org/apache/flink/api/table/codegen/CodeGenerator.scala

On 18 March 2016 at 19:37, Gábor Horváth <[hidden email]> wrote:

> Thank you! I finalized the project.
>
>
> On 18 March 2016 at 10:29, Márton Balassi <[hidden email]>
> wrote:
>
>> Thanks Gábor, now I also see it on the internal GSoC interface. I have
>> indicated that I wish to mentor your project, I think you can hit finalize
>> on your project there.
>>
>> On Mon, Mar 14, 2016 at 11:16 AM, Gábor Horváth <[hidden email]>
>> wrote:
>>
>> > Hi,
>> >
>> > I have updated this draft to include preliminary benchmarks, mentioned
>> the
>> > interaction of annotations with savepoints, extended it with a timeline,
>> > and some notes about scala case classes.
>> >
>> > Regards,
>> > Gábor
>> >
>> > On 9 March 2016 at 16:12, Gábor Horváth <[hidden email]> wrote:
>> >
>> > > Hi!
>> > >
>> > > As far as I can see the formatting was not correct in my previous
>> mail. A
>> > > better formatted version is available here:
>> > >
>> >
>> https://docs.google.com/document/d/1VC8lCeErx9kI5lCMPiUn625PO0rxR-iKlVqtt3hkVnk
>> > > Sorry for that.
>> > >
>> > > Regards,
>> > > Gábor
>> > >
>> > > On 9 March 2016 at 15:51, Gábor Horváth <[hidden email]> wrote:
>> > >
>> > >> Hi,I did not want to send this proposal out before the I have some
>> > >> initial benchmarks, but this issue was mentioned on the mailing list
>> (
>> > >>
>> >
>> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/Tuple-performance-and-the-curious-JIT-compiler-td10666.html
>> > ),
>> > >> and I wanted to make this information available to be able to
>> > incorporate
>> > >> this into that discussion. I have written this draft with the help of
>> > Gábor
>> > >> Gévay and Márton Balassi and I am open to every suggestion.
>> > >>
>> > >>
>> > >> The proposal draft:
>> > >> Code Generation in Serializers and Comparators of Apache Flink
>> > >>
>> > >> I am doing my last semester of my MSc studies and I’m a former GSoC
>> > >> student in the LLVM project. I plan to improve the serialization
>> code in
>> > >> Flink during this summer. The current implementation of the
>> serializers
>> > can
>> > >> be a performance bottleneck in some scenarios. These performance
>> > problems
>> > >> were also reported on the mailing list recently [1]. I plan to
>> implement
>> > >> code generation into the serializers to improve the performance (as
>> > Stephan
>> > >> Ewen also suggested.)
>> > >>
>> > >> TODO: I plan to include some preliminary benchmarks in this section.
>> > >> Performance problems with the current serializers
>> > >>
>> > >>    1.
>> > >>
>> > >>    PojoSerializer uses reflection for accessing the fields, which is
>> > >>    slow (eg. [2])
>> > >>
>> > >>
>> > >>    -
>> > >>
>> > >>    This is also a serious problem for the comparators
>> > >>
>> > >>
>> > >>    1.
>> > >>
>> > >>    When deserializing fields of primitive types (eg. int), the
>> reusing
>> > >>    overload of the corresponding field serializers cannot really do
>> any
>> > reuse,
>> > >>    because boxed primitive types are immutable in Java. This results
>> in
>> > lots
>> > >>    of object creations. [3][7]
>> > >>    2.
>> > >>
>> > >>    The loop to call the field serializers makes virtual function
>> calls,
>> > >>    that cannot be speculatively devirtualized by the JVM or predicted
>> > by the
>> > >>    CPU, because different serializer subclasses are invoked for the
>> > different
>> > >>    fields. (And the loop cannot be unrolled, because the number of
>> > iterations
>> > >>    is not a compile time constant.) See also the following discussion
>> > on the
>> > >>    mailing list [1].
>> > >>    3.
>> > >>
>> > >>    A POJO field can have the value null, so the serializer inserts 1
>> > >>    byte null tags, which wastes space. (Also, the type extractor
>> logic
>> > does
>> > >>    not distinguish between primitive types and their boxed versions,
>> so
>> > even
>> > >>    an int field has a null tag.)
>> > >>    4.
>> > >>
>> > >>    Subclass tags also add a byte at the beginning of every POJO
>> > >>    5.
>> > >>
>> > >>    getLength() does not know the size in most cases [4]
>> > >>    Knowing the size of a type when serialized has numerous
>> performance
>> > >>    benefits throughout Flink:
>> > >>    1.
>> > >>
>> > >>       Sorters can do in-place, when the type is small [5]
>> > >>       2.
>> > >>
>> > >>       Chaining hash tables do not need resizes, because they know how
>> > >>       many buckets to allocate upfront [6]
>> > >>       3.
>> > >>
>> > >>       Different hash table architectures could be used, eg. open
>> > >>       addressing with linear probing instead of some chaining
>> > >>       4.
>> > >>
>> > >>       It is possible to deserialize, modify, and then serialize back
>> a
>> > >>       record to its original place, because it cannot happen that the
>> > modified
>> > >>       version does not fit in the place allocated there for the old
>> > version (see
>> > >>       CompactingHashTable and ReduceHashTable for concrete instances
>> of
>> > this
>> > >>       problem)
>> > >>
>> > >>
>> > >> Note, that 2. and 3. are problems with not just the PojoSerializer,
>> but
>> > >> also with the TupleSerializer.
>> > >> Solution approaches
>> > >>
>> > >>    1.
>> > >>
>> > >>    Run time code generation for every POJO
>> > >>
>> > >>
>> > >>    -
>> > >>
>> > >>       1. and 3 . would be automatically solved, if the serializers
>> for
>> > >>       POJOs would be generated on-the-fly (by, for example,
>> Javassist)
>> > >>       -
>> > >>
>> > >>       2. also needs code generation, and also some extra effort in
>> the
>> > >>       type extractor to distinguish between primitive types and their
>> > boxed
>> > >>       versions
>> > >>       -
>> > >>
>> > >>       could be used for PojoComparator as well (which could greatly
>> > >>       increase the performance of sorting)
>> > >>
>> > >>
>> > >>    1.
>> > >>
>> > >>    Annotations on POJOs (by the users)
>> > >>
>> > >>
>> > >>    -
>> > >>
>> > >>       Concretely:
>> > >>       -
>> > >>
>> > >>          annotate fields that will never be nulls -> no null tag
>> needed
>> > >>          before every field!
>> > >>          -
>> > >>
>> > >>          make a POJO final -> no subclass tag needed
>> > >>          -
>> > >>
>> > >>          annotating a POJO that it will not be null -> no top level
>> null
>> > >>          tag needed
>> > >>          -
>> > >>
>> > >>       These would also help with the getLength problem (6.), because
>> the
>> > >>       length is often not known because currently anything can be
>> null
>> > or a
>> > >>       subclass can appear anywhere
>> > >>       -
>> > >>
>> > >>       These annotations could be done without code generation, but
>> then
>> > >>       they would add some overhead when there are no annotations
>> > present, so this
>> > >>       would work better together with the code generation
>> > >>       -
>> > >>
>> > >>       Tuples would become a special case of POJOs, where nothing can
>> be
>> > >>       null, and no subclass can appear, so maybe we could eliminate
>> the
>> > >>       TupleSerializer
>> > >>       -
>> > >>
>> > >>       We could annotate some internal types in Flink libraries (Gelly
>> > >>       (Vertex, Edge), FlinkML)
>> > >>
>> > >>
>> > >> TODO: what is the situation with Scala case classes? Run time code
>> > >> generation is probably easier in Scala? (with quasiquotes)
>> > >>
>> > >> About me
>> > >>
>> > >> I am in the last year of my Computer Science MSc studies at Eotvos
>> > Lorand
>> > >> University in Budapest, and planning to start a PhD in the autumn. I
>> > have
>> > >> been working for almost three years at Ericsson on static analysis
>> tools
>> > >> for C++. In 2014 I participated in GSoC, working on the LLVM project,
>> > and I
>> > >> am a frequent contributor ever since. The next summer I was
>> interning at
>> > >> Apple.
>> > >>
>> > >> I learned about the Flink project not too long ago and I like it so
>> far.
>> > >> The last few weeks I was working on some tickets to familiarize
>> myself
>> > with
>> > >> the codebase:
>> > >>
>> > >> https://issues.apache.org/jira/browse/FLINK-3422
>> > >>
>> > >> https://issues.apache.org/jira/browse/FLINK-3322
>> > >>
>> > >> https://issues.apache.org/jira/browse/FLINK-3457
>> > >>
>> > >> My CV is available here: http://xazax.web.elte.hu/files/resume.pdf
>> > >> References
>> > >>
>> > >> [1]
>> > >>
>> >
>> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/Tuple-performance-and-the-curious-JIT-compiler-td10666.html
>> > >>
>> > >> [2]
>> > >>
>> >
>> https://github.com/apache/flink/blob/master/flink-java/src/main/java/org/apache/flink/api/java/typeutils/runtime/PojoSerializer.java#L369
>> > >>
>> > >> [3]
>> > >>
>> >
>> https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/api/common/typeutils/base/IntSerializer.java#L73
>> > >>
>> > >> [4]
>> > >>
>> >
>> https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/api/common/typeutils/TypeSerializer.java#L98
>> > >>
>> > >> [5]
>> > >>
>> >
>> https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/operators/sort/FixedLengthRecordSorter.java
>> > >>
>> > >> [6]
>> > >>
>> >
>> https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/operators/hash/CompactingHashTable.java#L861
>> > >> [7] https://issues.apache.org/jira/browse/FLINK-3277
>> > >>
>> > >>
>> > >> Best Regards,
>> > >>
>> > >> Gábor
>> > >>
>> > >
>> > >
>> >
>>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: GSoC Project Proposal Draft: Code Generation in Serializers

Márton Balassi
Hi Gábor,

I think that adding the Janino dep to flink-core should be fine, as it has
quite slim dependencies [1,2] which are generally orthogonal to Flink's
main dependency line (also it is already used elsewhere).

As for mixing Scala code that is used from the Java parts of the same maven
module I am skeptical. We have seen IDE compilation issues with projects
using this setup and have decided that the community-wide potential IDE
setup pain outweighs the individual implementation convenience with Scala.

[1]
https://repo1.maven.org/maven2/org/codehaus/janino/janino-parent/2.7.8/janino-parent-2.7.8.pom
[2]
https://repo1.maven.org/maven2/org/codehaus/janino/janino/2.7.8/janino-2.7.8.pom

On Sat, Apr 16, 2016 at 5:51 PM, Gábor Horváth <[hidden email]> wrote:

> Hi!
>
> Table API already uses code generation and the Janino compiler [1]. Is it a
> dependency that is ok to add to flink-core? In case it is ok, I think I
> will use the same in order to be consistent with the other code generation
> efforts.
>
> I started to look at the Table API code generation [2] and it uses Scala
> extensively. There are several Scala features that can make Java code
> generation easier such as pattern matching and string interpolation. I did
> not see any Scala code in flink-core yet. Is it ok to implement the code
> generation inside the flink-core using Scala?
>
> Regards,
> Gábor
>
> [1] http://unkrig.de/w/Janino
> [2]
>
> https://github.com/apache/flink/blob/master/flink-libraries/flink-table/src/main/scala/org/apache/flink/api/table/codegen/CodeGenerator.scala
>
> On 18 March 2016 at 19:37, Gábor Horváth <[hidden email]> wrote:
>
> > Thank you! I finalized the project.
> >
> >
> > On 18 March 2016 at 10:29, Márton Balassi <[hidden email]>
> > wrote:
> >
> >> Thanks Gábor, now I also see it on the internal GSoC interface. I have
> >> indicated that I wish to mentor your project, I think you can hit
> finalize
> >> on your project there.
> >>
> >> On Mon, Mar 14, 2016 at 11:16 AM, Gábor Horváth <[hidden email]>
> >> wrote:
> >>
> >> > Hi,
> >> >
> >> > I have updated this draft to include preliminary benchmarks, mentioned
> >> the
> >> > interaction of annotations with savepoints, extended it with a
> timeline,
> >> > and some notes about scala case classes.
> >> >
> >> > Regards,
> >> > Gábor
> >> >
> >> > On 9 March 2016 at 16:12, Gábor Horváth <[hidden email]> wrote:
> >> >
> >> > > Hi!
> >> > >
> >> > > As far as I can see the formatting was not correct in my previous
> >> mail. A
> >> > > better formatted version is available here:
> >> > >
> >> >
> >>
> https://docs.google.com/document/d/1VC8lCeErx9kI5lCMPiUn625PO0rxR-iKlVqtt3hkVnk
> >> > > Sorry for that.
> >> > >
> >> > > Regards,
> >> > > Gábor
> >> > >
> >> > > On 9 March 2016 at 15:51, Gábor Horváth <[hidden email]>
> wrote:
> >> > >
> >> > >> Hi,I did not want to send this proposal out before the I have some
> >> > >> initial benchmarks, but this issue was mentioned on the mailing
> list
> >> (
> >> > >>
> >> >
> >>
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/Tuple-performance-and-the-curious-JIT-compiler-td10666.html
> >> > ),
> >> > >> and I wanted to make this information available to be able to
> >> > incorporate
> >> > >> this into that discussion. I have written this draft with the help
> of
> >> > Gábor
> >> > >> Gévay and Márton Balassi and I am open to every suggestion.
> >> > >>
> >> > >>
> >> > >> The proposal draft:
> >> > >> Code Generation in Serializers and Comparators of Apache Flink
> >> > >>
> >> > >> I am doing my last semester of my MSc studies and I’m a former GSoC
> >> > >> student in the LLVM project. I plan to improve the serialization
> >> code in
> >> > >> Flink during this summer. The current implementation of the
> >> serializers
> >> > can
> >> > >> be a performance bottleneck in some scenarios. These performance
> >> > problems
> >> > >> were also reported on the mailing list recently [1]. I plan to
> >> implement
> >> > >> code generation into the serializers to improve the performance (as
> >> > Stephan
> >> > >> Ewen also suggested.)
> >> > >>
> >> > >> TODO: I plan to include some preliminary benchmarks in this
> section.
> >> > >> Performance problems with the current serializers
> >> > >>
> >> > >>    1.
> >> > >>
> >> > >>    PojoSerializer uses reflection for accessing the fields, which
> is
> >> > >>    slow (eg. [2])
> >> > >>
> >> > >>
> >> > >>    -
> >> > >>
> >> > >>    This is also a serious problem for the comparators
> >> > >>
> >> > >>
> >> > >>    1.
> >> > >>
> >> > >>    When deserializing fields of primitive types (eg. int), the
> >> reusing
> >> > >>    overload of the corresponding field serializers cannot really do
> >> any
> >> > reuse,
> >> > >>    because boxed primitive types are immutable in Java. This
> results
> >> in
> >> > lots
> >> > >>    of object creations. [3][7]
> >> > >>    2.
> >> > >>
> >> > >>    The loop to call the field serializers makes virtual function
> >> calls,
> >> > >>    that cannot be speculatively devirtualized by the JVM or
> predicted
> >> > by the
> >> > >>    CPU, because different serializer subclasses are invoked for the
> >> > different
> >> > >>    fields. (And the loop cannot be unrolled, because the number of
> >> > iterations
> >> > >>    is not a compile time constant.) See also the following
> discussion
> >> > on the
> >> > >>    mailing list [1].
> >> > >>    3.
> >> > >>
> >> > >>    A POJO field can have the value null, so the serializer inserts
> 1
> >> > >>    byte null tags, which wastes space. (Also, the type extractor
> >> logic
> >> > does
> >> > >>    not distinguish between primitive types and their boxed
> versions,
> >> so
> >> > even
> >> > >>    an int field has a null tag.)
> >> > >>    4.
> >> > >>
> >> > >>    Subclass tags also add a byte at the beginning of every POJO
> >> > >>    5.
> >> > >>
> >> > >>    getLength() does not know the size in most cases [4]
> >> > >>    Knowing the size of a type when serialized has numerous
> >> performance
> >> > >>    benefits throughout Flink:
> >> > >>    1.
> >> > >>
> >> > >>       Sorters can do in-place, when the type is small [5]
> >> > >>       2.
> >> > >>
> >> > >>       Chaining hash tables do not need resizes, because they know
> how
> >> > >>       many buckets to allocate upfront [6]
> >> > >>       3.
> >> > >>
> >> > >>       Different hash table architectures could be used, eg. open
> >> > >>       addressing with linear probing instead of some chaining
> >> > >>       4.
> >> > >>
> >> > >>       It is possible to deserialize, modify, and then serialize
> back
> >> a
> >> > >>       record to its original place, because it cannot happen that
> the
> >> > modified
> >> > >>       version does not fit in the place allocated there for the old
> >> > version (see
> >> > >>       CompactingHashTable and ReduceHashTable for concrete
> instances
> >> of
> >> > this
> >> > >>       problem)
> >> > >>
> >> > >>
> >> > >> Note, that 2. and 3. are problems with not just the PojoSerializer,
> >> but
> >> > >> also with the TupleSerializer.
> >> > >> Solution approaches
> >> > >>
> >> > >>    1.
> >> > >>
> >> > >>    Run time code generation for every POJO
> >> > >>
> >> > >>
> >> > >>    -
> >> > >>
> >> > >>       1. and 3 . would be automatically solved, if the serializers
> >> for
> >> > >>       POJOs would be generated on-the-fly (by, for example,
> >> Javassist)
> >> > >>       -
> >> > >>
> >> > >>       2. also needs code generation, and also some extra effort in
> >> the
> >> > >>       type extractor to distinguish between primitive types and
> their
> >> > boxed
> >> > >>       versions
> >> > >>       -
> >> > >>
> >> > >>       could be used for PojoComparator as well (which could greatly
> >> > >>       increase the performance of sorting)
> >> > >>
> >> > >>
> >> > >>    1.
> >> > >>
> >> > >>    Annotations on POJOs (by the users)
> >> > >>
> >> > >>
> >> > >>    -
> >> > >>
> >> > >>       Concretely:
> >> > >>       -
> >> > >>
> >> > >>          annotate fields that will never be nulls -> no null tag
> >> needed
> >> > >>          before every field!
> >> > >>          -
> >> > >>
> >> > >>          make a POJO final -> no subclass tag needed
> >> > >>          -
> >> > >>
> >> > >>          annotating a POJO that it will not be null -> no top level
> >> null
> >> > >>          tag needed
> >> > >>          -
> >> > >>
> >> > >>       These would also help with the getLength problem (6.),
> because
> >> the
> >> > >>       length is often not known because currently anything can be
> >> null
> >> > or a
> >> > >>       subclass can appear anywhere
> >> > >>       -
> >> > >>
> >> > >>       These annotations could be done without code generation, but
> >> then
> >> > >>       they would add some overhead when there are no annotations
> >> > present, so this
> >> > >>       would work better together with the code generation
> >> > >>       -
> >> > >>
> >> > >>       Tuples would become a special case of POJOs, where nothing
> can
> >> be
> >> > >>       null, and no subclass can appear, so maybe we could eliminate
> >> the
> >> > >>       TupleSerializer
> >> > >>       -
> >> > >>
> >> > >>       We could annotate some internal types in Flink libraries
> (Gelly
> >> > >>       (Vertex, Edge), FlinkML)
> >> > >>
> >> > >>
> >> > >> TODO: what is the situation with Scala case classes? Run time code
> >> > >> generation is probably easier in Scala? (with quasiquotes)
> >> > >>
> >> > >> About me
> >> > >>
> >> > >> I am in the last year of my Computer Science MSc studies at Eotvos
> >> > Lorand
> >> > >> University in Budapest, and planning to start a PhD in the autumn.
> I
> >> > have
> >> > >> been working for almost three years at Ericsson on static analysis
> >> tools
> >> > >> for C++. In 2014 I participated in GSoC, working on the LLVM
> project,
> >> > and I
> >> > >> am a frequent contributor ever since. The next summer I was
> >> interning at
> >> > >> Apple.
> >> > >>
> >> > >> I learned about the Flink project not too long ago and I like it so
> >> far.
> >> > >> The last few weeks I was working on some tickets to familiarize
> >> myself
> >> > with
> >> > >> the codebase:
> >> > >>
> >> > >> https://issues.apache.org/jira/browse/FLINK-3422
> >> > >>
> >> > >> https://issues.apache.org/jira/browse/FLINK-3322
> >> > >>
> >> > >> https://issues.apache.org/jira/browse/FLINK-3457
> >> > >>
> >> > >> My CV is available here: http://xazax.web.elte.hu/files/resume.pdf
> >> > >> References
> >> > >>
> >> > >> [1]
> >> > >>
> >> >
> >>
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/Tuple-performance-and-the-curious-JIT-compiler-td10666.html
> >> > >>
> >> > >> [2]
> >> > >>
> >> >
> >>
> https://github.com/apache/flink/blob/master/flink-java/src/main/java/org/apache/flink/api/java/typeutils/runtime/PojoSerializer.java#L369
> >> > >>
> >> > >> [3]
> >> > >>
> >> >
> >>
> https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/api/common/typeutils/base/IntSerializer.java#L73
> >> > >>
> >> > >> [4]
> >> > >>
> >> >
> >>
> https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/api/common/typeutils/TypeSerializer.java#L98
> >> > >>
> >> > >> [5]
> >> > >>
> >> >
> >>
> https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/operators/sort/FixedLengthRecordSorter.java
> >> > >>
> >> > >> [6]
> >> > >>
> >> >
> >>
> https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/operators/hash/CompactingHashTable.java#L861
> >> > >> [7] https://issues.apache.org/jira/browse/FLINK-3277
> >> > >>
> >> > >>
> >> > >> Best Regards,
> >> > >>
> >> > >> Gábor
> >> > >>
> >> > >
> >> > >
> >> >
> >>
> >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: GSoC Project Proposal Draft: Code Generation in Serializers

Chiwan Park-2
I prefer to avoid Scala dependencies in flink-core. If flink-core includes Scala dependencies, Scala version suffix (_2.10 or _2.11) should be added. I think that users could be confused.

Regards,
Chiwan Park

> On Apr 17, 2016, at 3:49 PM, Márton Balassi <[hidden email]> wrote:
>
> Hi Gábor,
>
> I think that adding the Janino dep to flink-core should be fine, as it has
> quite slim dependencies [1,2] which are generally orthogonal to Flink's
> main dependency line (also it is already used elsewhere).
>
> As for mixing Scala code that is used from the Java parts of the same maven
> module I am skeptical. We have seen IDE compilation issues with projects
> using this setup and have decided that the community-wide potential IDE
> setup pain outweighs the individual implementation convenience with Scala.
>
> [1]
> https://repo1.maven.org/maven2/org/codehaus/janino/janino-parent/2.7.8/janino-parent-2.7.8.pom
> [2]
> https://repo1.maven.org/maven2/org/codehaus/janino/janino/2.7.8/janino-2.7.8.pom
>
> On Sat, Apr 16, 2016 at 5:51 PM, Gábor Horváth <[hidden email]> wrote:
>
>> Hi!
>>
>> Table API already uses code generation and the Janino compiler [1]. Is it a
>> dependency that is ok to add to flink-core? In case it is ok, I think I
>> will use the same in order to be consistent with the other code generation
>> efforts.
>>
>> I started to look at the Table API code generation [2] and it uses Scala
>> extensively. There are several Scala features that can make Java code
>> generation easier such as pattern matching and string interpolation. I did
>> not see any Scala code in flink-core yet. Is it ok to implement the code
>> generation inside the flink-core using Scala?
>>
>> Regards,
>> Gábor
>>
>> [1] http://unkrig.de/w/Janino
>> [2]
>>
>> https://github.com/apache/flink/blob/master/flink-libraries/flink-table/src/main/scala/org/apache/flink/api/table/codegen/CodeGenerator.scala
>>
>> On 18 March 2016 at 19:37, Gábor Horváth <[hidden email]> wrote:
>>
>>> Thank you! I finalized the project.
>>>
>>>
>>> On 18 March 2016 at 10:29, Márton Balassi <[hidden email]>
>>> wrote:
>>>
>>>> Thanks Gábor, now I also see it on the internal GSoC interface. I have
>>>> indicated that I wish to mentor your project, I think you can hit
>> finalize
>>>> on your project there.
>>>>
>>>> On Mon, Mar 14, 2016 at 11:16 AM, Gábor Horváth <[hidden email]>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I have updated this draft to include preliminary benchmarks, mentioned
>>>> the
>>>>> interaction of annotations with savepoints, extended it with a
>> timeline,
>>>>> and some notes about scala case classes.
>>>>>
>>>>> Regards,
>>>>> Gábor
>>>>>
>>>>> On 9 March 2016 at 16:12, Gábor Horváth <[hidden email]> wrote:
>>>>>
>>>>>> Hi!
>>>>>>
>>>>>> As far as I can see the formatting was not correct in my previous
>>>> mail. A
>>>>>> better formatted version is available here:
>>>>>>
>>>>>
>>>>
>> https://docs.google.com/document/d/1VC8lCeErx9kI5lCMPiUn625PO0rxR-iKlVqtt3hkVnk
>>>>>> Sorry for that.
>>>>>>
>>>>>> Regards,
>>>>>> Gábor
>>>>>>
>>>>>> On 9 March 2016 at 15:51, Gábor Horváth <[hidden email]>
>> wrote:
>>>>>>
>>>>>>> Hi,I did not want to send this proposal out before the I have some
>>>>>>> initial benchmarks, but this issue was mentioned on the mailing
>> list
>>>> (
>>>>>>>
>>>>>
>>>>
>> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/Tuple-performance-and-the-curious-JIT-compiler-td10666.html
>>>>> ),
>>>>>>> and I wanted to make this information available to be able to
>>>>> incorporate
>>>>>>> this into that discussion. I have written this draft with the help
>> of
>>>>> Gábor
>>>>>>> Gévay and Márton Balassi and I am open to every suggestion.
>>>>>>>
>>>>>>>
>>>>>>> The proposal draft:
>>>>>>> Code Generation in Serializers and Comparators of Apache Flink
>>>>>>>
>>>>>>> I am doing my last semester of my MSc studies and I’m a former GSoC
>>>>>>> student in the LLVM project. I plan to improve the serialization
>>>> code in
>>>>>>> Flink during this summer. The current implementation of the
>>>> serializers
>>>>> can
>>>>>>> be a performance bottleneck in some scenarios. These performance
>>>>> problems
>>>>>>> were also reported on the mailing list recently [1]. I plan to
>>>> implement
>>>>>>> code generation into the serializers to improve the performance (as
>>>>> Stephan
>>>>>>> Ewen also suggested.)
>>>>>>>
>>>>>>> TODO: I plan to include some preliminary benchmarks in this
>> section.
>>>>>>> Performance problems with the current serializers
>>>>>>>
>>>>>>>   1.
>>>>>>>
>>>>>>>   PojoSerializer uses reflection for accessing the fields, which
>> is
>>>>>>>   slow (eg. [2])
>>>>>>>
>>>>>>>
>>>>>>>   -
>>>>>>>
>>>>>>>   This is also a serious problem for the comparators
>>>>>>>
>>>>>>>
>>>>>>>   1.
>>>>>>>
>>>>>>>   When deserializing fields of primitive types (eg. int), the
>>>> reusing
>>>>>>>   overload of the corresponding field serializers cannot really do
>>>> any
>>>>> reuse,
>>>>>>>   because boxed primitive types are immutable in Java. This
>> results
>>>> in
>>>>> lots
>>>>>>>   of object creations. [3][7]
>>>>>>>   2.
>>>>>>>
>>>>>>>   The loop to call the field serializers makes virtual function
>>>> calls,
>>>>>>>   that cannot be speculatively devirtualized by the JVM or
>> predicted
>>>>> by the
>>>>>>>   CPU, because different serializer subclasses are invoked for the
>>>>> different
>>>>>>>   fields. (And the loop cannot be unrolled, because the number of
>>>>> iterations
>>>>>>>   is not a compile time constant.) See also the following
>> discussion
>>>>> on the
>>>>>>>   mailing list [1].
>>>>>>>   3.
>>>>>>>
>>>>>>>   A POJO field can have the value null, so the serializer inserts
>> 1
>>>>>>>   byte null tags, which wastes space. (Also, the type extractor
>>>> logic
>>>>> does
>>>>>>>   not distinguish between primitive types and their boxed
>> versions,
>>>> so
>>>>> even
>>>>>>>   an int field has a null tag.)
>>>>>>>   4.
>>>>>>>
>>>>>>>   Subclass tags also add a byte at the beginning of every POJO
>>>>>>>   5.
>>>>>>>
>>>>>>>   getLength() does not know the size in most cases [4]
>>>>>>>   Knowing the size of a type when serialized has numerous
>>>> performance
>>>>>>>   benefits throughout Flink:
>>>>>>>   1.
>>>>>>>
>>>>>>>      Sorters can do in-place, when the type is small [5]
>>>>>>>      2.
>>>>>>>
>>>>>>>      Chaining hash tables do not need resizes, because they know
>> how
>>>>>>>      many buckets to allocate upfront [6]
>>>>>>>      3.
>>>>>>>
>>>>>>>      Different hash table architectures could be used, eg. open
>>>>>>>      addressing with linear probing instead of some chaining
>>>>>>>      4.
>>>>>>>
>>>>>>>      It is possible to deserialize, modify, and then serialize
>> back
>>>> a
>>>>>>>      record to its original place, because it cannot happen that
>> the
>>>>> modified
>>>>>>>      version does not fit in the place allocated there for the old
>>>>> version (see
>>>>>>>      CompactingHashTable and ReduceHashTable for concrete
>> instances
>>>> of
>>>>> this
>>>>>>>      problem)
>>>>>>>
>>>>>>>
>>>>>>> Note, that 2. and 3. are problems with not just the PojoSerializer,
>>>> but
>>>>>>> also with the TupleSerializer.
>>>>>>> Solution approaches
>>>>>>>
>>>>>>>   1.
>>>>>>>
>>>>>>>   Run time code generation for every POJO
>>>>>>>
>>>>>>>
>>>>>>>   -
>>>>>>>
>>>>>>>      1. and 3 . would be automatically solved, if the serializers
>>>> for
>>>>>>>      POJOs would be generated on-the-fly (by, for example,
>>>> Javassist)
>>>>>>>      -
>>>>>>>
>>>>>>>      2. also needs code generation, and also some extra effort in
>>>> the
>>>>>>>      type extractor to distinguish between primitive types and
>> their
>>>>> boxed
>>>>>>>      versions
>>>>>>>      -
>>>>>>>
>>>>>>>      could be used for PojoComparator as well (which could greatly
>>>>>>>      increase the performance of sorting)
>>>>>>>
>>>>>>>
>>>>>>>   1.
>>>>>>>
>>>>>>>   Annotations on POJOs (by the users)
>>>>>>>
>>>>>>>
>>>>>>>   -
>>>>>>>
>>>>>>>      Concretely:
>>>>>>>      -
>>>>>>>
>>>>>>>         annotate fields that will never be nulls -> no null tag
>>>> needed
>>>>>>>         before every field!
>>>>>>>         -
>>>>>>>
>>>>>>>         make a POJO final -> no subclass tag needed
>>>>>>>         -
>>>>>>>
>>>>>>>         annotating a POJO that it will not be null -> no top level
>>>> null
>>>>>>>         tag needed
>>>>>>>         -
>>>>>>>
>>>>>>>      These would also help with the getLength problem (6.),
>> because
>>>> the
>>>>>>>      length is often not known because currently anything can be
>>>> null
>>>>> or a
>>>>>>>      subclass can appear anywhere
>>>>>>>      -
>>>>>>>
>>>>>>>      These annotations could be done without code generation, but
>>>> then
>>>>>>>      they would add some overhead when there are no annotations
>>>>> present, so this
>>>>>>>      would work better together with the code generation
>>>>>>>      -
>>>>>>>
>>>>>>>      Tuples would become a special case of POJOs, where nothing
>> can
>>>> be
>>>>>>>      null, and no subclass can appear, so maybe we could eliminate
>>>> the
>>>>>>>      TupleSerializer
>>>>>>>      -
>>>>>>>
>>>>>>>      We could annotate some internal types in Flink libraries
>> (Gelly
>>>>>>>      (Vertex, Edge), FlinkML)
>>>>>>>
>>>>>>>
>>>>>>> TODO: what is the situation with Scala case classes? Run time code
>>>>>>> generation is probably easier in Scala? (with quasiquotes)
>>>>>>>
>>>>>>> About me
>>>>>>>
>>>>>>> I am in the last year of my Computer Science MSc studies at Eotvos
>>>>> Lorand
>>>>>>> University in Budapest, and planning to start a PhD in the autumn.
>> I
>>>>> have
>>>>>>> been working for almost three years at Ericsson on static analysis
>>>> tools
>>>>>>> for C++. In 2014 I participated in GSoC, working on the LLVM
>> project,
>>>>> and I
>>>>>>> am a frequent contributor ever since. The next summer I was
>>>> interning at
>>>>>>> Apple.
>>>>>>>
>>>>>>> I learned about the Flink project not too long ago and I like it so
>>>> far.
>>>>>>> The last few weeks I was working on some tickets to familiarize
>>>> myself
>>>>> with
>>>>>>> the codebase:
>>>>>>>
>>>>>>> https://issues.apache.org/jira/browse/FLINK-3422
>>>>>>>
>>>>>>> https://issues.apache.org/jira/browse/FLINK-3322
>>>>>>>
>>>>>>> https://issues.apache.org/jira/browse/FLINK-3457
>>>>>>>
>>>>>>> My CV is available here: http://xazax.web.elte.hu/files/resume.pdf
>>>>>>> References
>>>>>>>
>>>>>>> [1]
>>>>>>>
>>>>>
>>>>
>> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/Tuple-performance-and-the-curious-JIT-compiler-td10666.html
>>>>>>>
>>>>>>> [2]
>>>>>>>
>>>>>
>>>>
>> https://github.com/apache/flink/blob/master/flink-java/src/main/java/org/apache/flink/api/java/typeutils/runtime/PojoSerializer.java#L369
>>>>>>>
>>>>>>> [3]
>>>>>>>
>>>>>
>>>>
>> https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/api/common/typeutils/base/IntSerializer.java#L73
>>>>>>>
>>>>>>> [4]
>>>>>>>
>>>>>
>>>>
>> https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/api/common/typeutils/TypeSerializer.java#L98
>>>>>>>
>>>>>>> [5]
>>>>>>>
>>>>>
>>>>
>> https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/operators/sort/FixedLengthRecordSorter.java
>>>>>>>
>>>>>>> [6]
>>>>>>>
>>>>>
>>>>
>> https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/operators/hash/CompactingHashTable.java#L861
>>>>>>> [7] https://issues.apache.org/jira/browse/FLINK-3277
>>>>>>>
>>>>>>>
>>>>>>> Best Regards,
>>>>>>>
>>>>>>> Gábor
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>>
>>

Reply | Threaded
Open this post in threaded view
|

Re: GSoC Project Proposal Draft: Code Generation in Serializers

Márton Balassi
Chiwan, just to clarify Janino is a Java project. [1]

[1] https://github.com/aunkrig/janino

On Mon, Apr 18, 2016 at 3:40 AM, Chiwan Park <[hidden email]> wrote:

> I prefer to avoid Scala dependencies in flink-core. If flink-core includes
> Scala dependencies, Scala version suffix (_2.10 or _2.11) should be added.
> I think that users could be confused.
>
> Regards,
> Chiwan Park
>
> > On Apr 17, 2016, at 3:49 PM, Márton Balassi <[hidden email]>
> wrote:
> >
> > Hi Gábor,
> >
> > I think that adding the Janino dep to flink-core should be fine, as it
> has
> > quite slim dependencies [1,2] which are generally orthogonal to Flink's
> > main dependency line (also it is already used elsewhere).
> >
> > As for mixing Scala code that is used from the Java parts of the same
> maven
> > module I am skeptical. We have seen IDE compilation issues with projects
> > using this setup and have decided that the community-wide potential IDE
> > setup pain outweighs the individual implementation convenience with
> Scala.
> >
> > [1]
> >
> https://repo1.maven.org/maven2/org/codehaus/janino/janino-parent/2.7.8/janino-parent-2.7.8.pom
> > [2]
> >
> https://repo1.maven.org/maven2/org/codehaus/janino/janino/2.7.8/janino-2.7.8.pom
> >
> > On Sat, Apr 16, 2016 at 5:51 PM, Gábor Horváth <[hidden email]>
> wrote:
> >
> >> Hi!
> >>
> >> Table API already uses code generation and the Janino compiler [1]. Is
> it a
> >> dependency that is ok to add to flink-core? In case it is ok, I think I
> >> will use the same in order to be consistent with the other code
> generation
> >> efforts.
> >>
> >> I started to look at the Table API code generation [2] and it uses Scala
> >> extensively. There are several Scala features that can make Java code
> >> generation easier such as pattern matching and string interpolation. I
> did
> >> not see any Scala code in flink-core yet. Is it ok to implement the code
> >> generation inside the flink-core using Scala?
> >>
> >> Regards,
> >> Gábor
> >>
> >> [1] http://unkrig.de/w/Janino
> >> [2]
> >>
> >>
> https://github.com/apache/flink/blob/master/flink-libraries/flink-table/src/main/scala/org/apache/flink/api/table/codegen/CodeGenerator.scala
> >>
> >> On 18 March 2016 at 19:37, Gábor Horváth <[hidden email]> wrote:
> >>
> >>> Thank you! I finalized the project.
> >>>
> >>>
> >>> On 18 March 2016 at 10:29, Márton Balassi <[hidden email]>
> >>> wrote:
> >>>
> >>>> Thanks Gábor, now I also see it on the internal GSoC interface. I have
> >>>> indicated that I wish to mentor your project, I think you can hit
> >> finalize
> >>>> on your project there.
> >>>>
> >>>> On Mon, Mar 14, 2016 at 11:16 AM, Gábor Horváth <[hidden email]>
> >>>> wrote:
> >>>>
> >>>>> Hi,
> >>>>>
> >>>>> I have updated this draft to include preliminary benchmarks,
> mentioned
> >>>> the
> >>>>> interaction of annotations with savepoints, extended it with a
> >> timeline,
> >>>>> and some notes about scala case classes.
> >>>>>
> >>>>> Regards,
> >>>>> Gábor
> >>>>>
> >>>>> On 9 March 2016 at 16:12, Gábor Horváth <[hidden email]> wrote:
> >>>>>
> >>>>>> Hi!
> >>>>>>
> >>>>>> As far as I can see the formatting was not correct in my previous
> >>>> mail. A
> >>>>>> better formatted version is available here:
> >>>>>>
> >>>>>
> >>>>
> >>
> https://docs.google.com/document/d/1VC8lCeErx9kI5lCMPiUn625PO0rxR-iKlVqtt3hkVnk
> >>>>>> Sorry for that.
> >>>>>>
> >>>>>> Regards,
> >>>>>> Gábor
> >>>>>>
> >>>>>> On 9 March 2016 at 15:51, Gábor Horváth <[hidden email]>
> >> wrote:
> >>>>>>
> >>>>>>> Hi,I did not want to send this proposal out before the I have some
> >>>>>>> initial benchmarks, but this issue was mentioned on the mailing
> >> list
> >>>> (
> >>>>>>>
> >>>>>
> >>>>
> >>
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/Tuple-performance-and-the-curious-JIT-compiler-td10666.html
> >>>>> ),
> >>>>>>> and I wanted to make this information available to be able to
> >>>>> incorporate
> >>>>>>> this into that discussion. I have written this draft with the help
> >> of
> >>>>> Gábor
> >>>>>>> Gévay and Márton Balassi and I am open to every suggestion.
> >>>>>>>
> >>>>>>>
> >>>>>>> The proposal draft:
> >>>>>>> Code Generation in Serializers and Comparators of Apache Flink
> >>>>>>>
> >>>>>>> I am doing my last semester of my MSc studies and I’m a former GSoC
> >>>>>>> student in the LLVM project. I plan to improve the serialization
> >>>> code in
> >>>>>>> Flink during this summer. The current implementation of the
> >>>> serializers
> >>>>> can
> >>>>>>> be a performance bottleneck in some scenarios. These performance
> >>>>> problems
> >>>>>>> were also reported on the mailing list recently [1]. I plan to
> >>>> implement
> >>>>>>> code generation into the serializers to improve the performance (as
> >>>>> Stephan
> >>>>>>> Ewen also suggested.)
> >>>>>>>
> >>>>>>> TODO: I plan to include some preliminary benchmarks in this
> >> section.
> >>>>>>> Performance problems with the current serializers
> >>>>>>>
> >>>>>>>   1.
> >>>>>>>
> >>>>>>>   PojoSerializer uses reflection for accessing the fields, which
> >> is
> >>>>>>>   slow (eg. [2])
> >>>>>>>
> >>>>>>>
> >>>>>>>   -
> >>>>>>>
> >>>>>>>   This is also a serious problem for the comparators
> >>>>>>>
> >>>>>>>
> >>>>>>>   1.
> >>>>>>>
> >>>>>>>   When deserializing fields of primitive types (eg. int), the
> >>>> reusing
> >>>>>>>   overload of the corresponding field serializers cannot really do
> >>>> any
> >>>>> reuse,
> >>>>>>>   because boxed primitive types are immutable in Java. This
> >> results
> >>>> in
> >>>>> lots
> >>>>>>>   of object creations. [3][7]
> >>>>>>>   2.
> >>>>>>>
> >>>>>>>   The loop to call the field serializers makes virtual function
> >>>> calls,
> >>>>>>>   that cannot be speculatively devirtualized by the JVM or
> >> predicted
> >>>>> by the
> >>>>>>>   CPU, because different serializer subclasses are invoked for the
> >>>>> different
> >>>>>>>   fields. (And the loop cannot be unrolled, because the number of
> >>>>> iterations
> >>>>>>>   is not a compile time constant.) See also the following
> >> discussion
> >>>>> on the
> >>>>>>>   mailing list [1].
> >>>>>>>   3.
> >>>>>>>
> >>>>>>>   A POJO field can have the value null, so the serializer inserts
> >> 1
> >>>>>>>   byte null tags, which wastes space. (Also, the type extractor
> >>>> logic
> >>>>> does
> >>>>>>>   not distinguish between primitive types and their boxed
> >> versions,
> >>>> so
> >>>>> even
> >>>>>>>   an int field has a null tag.)
> >>>>>>>   4.
> >>>>>>>
> >>>>>>>   Subclass tags also add a byte at the beginning of every POJO
> >>>>>>>   5.
> >>>>>>>
> >>>>>>>   getLength() does not know the size in most cases [4]
> >>>>>>>   Knowing the size of a type when serialized has numerous
> >>>> performance
> >>>>>>>   benefits throughout Flink:
> >>>>>>>   1.
> >>>>>>>
> >>>>>>>      Sorters can do in-place, when the type is small [5]
> >>>>>>>      2.
> >>>>>>>
> >>>>>>>      Chaining hash tables do not need resizes, because they know
> >> how
> >>>>>>>      many buckets to allocate upfront [6]
> >>>>>>>      3.
> >>>>>>>
> >>>>>>>      Different hash table architectures could be used, eg. open
> >>>>>>>      addressing with linear probing instead of some chaining
> >>>>>>>      4.
> >>>>>>>
> >>>>>>>      It is possible to deserialize, modify, and then serialize
> >> back
> >>>> a
> >>>>>>>      record to its original place, because it cannot happen that
> >> the
> >>>>> modified
> >>>>>>>      version does not fit in the place allocated there for the old
> >>>>> version (see
> >>>>>>>      CompactingHashTable and ReduceHashTable for concrete
> >> instances
> >>>> of
> >>>>> this
> >>>>>>>      problem)
> >>>>>>>
> >>>>>>>
> >>>>>>> Note, that 2. and 3. are problems with not just the PojoSerializer,
> >>>> but
> >>>>>>> also with the TupleSerializer.
> >>>>>>> Solution approaches
> >>>>>>>
> >>>>>>>   1.
> >>>>>>>
> >>>>>>>   Run time code generation for every POJO
> >>>>>>>
> >>>>>>>
> >>>>>>>   -
> >>>>>>>
> >>>>>>>      1. and 3 . would be automatically solved, if the serializers
> >>>> for
> >>>>>>>      POJOs would be generated on-the-fly (by, for example,
> >>>> Javassist)
> >>>>>>>      -
> >>>>>>>
> >>>>>>>      2. also needs code generation, and also some extra effort in
> >>>> the
> >>>>>>>      type extractor to distinguish between primitive types and
> >> their
> >>>>> boxed
> >>>>>>>      versions
> >>>>>>>      -
> >>>>>>>
> >>>>>>>      could be used for PojoComparator as well (which could greatly
> >>>>>>>      increase the performance of sorting)
> >>>>>>>
> >>>>>>>
> >>>>>>>   1.
> >>>>>>>
> >>>>>>>   Annotations on POJOs (by the users)
> >>>>>>>
> >>>>>>>
> >>>>>>>   -
> >>>>>>>
> >>>>>>>      Concretely:
> >>>>>>>      -
> >>>>>>>
> >>>>>>>         annotate fields that will never be nulls -> no null tag
> >>>> needed
> >>>>>>>         before every field!
> >>>>>>>         -
> >>>>>>>
> >>>>>>>         make a POJO final -> no subclass tag needed
> >>>>>>>         -
> >>>>>>>
> >>>>>>>         annotating a POJO that it will not be null -> no top level
> >>>> null
> >>>>>>>         tag needed
> >>>>>>>         -
> >>>>>>>
> >>>>>>>      These would also help with the getLength problem (6.),
> >> because
> >>>> the
> >>>>>>>      length is often not known because currently anything can be
> >>>> null
> >>>>> or a
> >>>>>>>      subclass can appear anywhere
> >>>>>>>      -
> >>>>>>>
> >>>>>>>      These annotations could be done without code generation, but
> >>>> then
> >>>>>>>      they would add some overhead when there are no annotations
> >>>>> present, so this
> >>>>>>>      would work better together with the code generation
> >>>>>>>      -
> >>>>>>>
> >>>>>>>      Tuples would become a special case of POJOs, where nothing
> >> can
> >>>> be
> >>>>>>>      null, and no subclass can appear, so maybe we could eliminate
> >>>> the
> >>>>>>>      TupleSerializer
> >>>>>>>      -
> >>>>>>>
> >>>>>>>      We could annotate some internal types in Flink libraries
> >> (Gelly
> >>>>>>>      (Vertex, Edge), FlinkML)
> >>>>>>>
> >>>>>>>
> >>>>>>> TODO: what is the situation with Scala case classes? Run time code
> >>>>>>> generation is probably easier in Scala? (with quasiquotes)
> >>>>>>>
> >>>>>>> About me
> >>>>>>>
> >>>>>>> I am in the last year of my Computer Science MSc studies at Eotvos
> >>>>> Lorand
> >>>>>>> University in Budapest, and planning to start a PhD in the autumn.
> >> I
> >>>>> have
> >>>>>>> been working for almost three years at Ericsson on static analysis
> >>>> tools
> >>>>>>> for C++. In 2014 I participated in GSoC, working on the LLVM
> >> project,
> >>>>> and I
> >>>>>>> am a frequent contributor ever since. The next summer I was
> >>>> interning at
> >>>>>>> Apple.
> >>>>>>>
> >>>>>>> I learned about the Flink project not too long ago and I like it so
> >>>> far.
> >>>>>>> The last few weeks I was working on some tickets to familiarize
> >>>> myself
> >>>>> with
> >>>>>>> the codebase:
> >>>>>>>
> >>>>>>> https://issues.apache.org/jira/browse/FLINK-3422
> >>>>>>>
> >>>>>>> https://issues.apache.org/jira/browse/FLINK-3322
> >>>>>>>
> >>>>>>> https://issues.apache.org/jira/browse/FLINK-3457
> >>>>>>>
> >>>>>>> My CV is available here: http://xazax.web.elte.hu/files/resume.pdf
> >>>>>>> References
> >>>>>>>
> >>>>>>> [1]
> >>>>>>>
> >>>>>
> >>>>
> >>
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/Tuple-performance-and-the-curious-JIT-compiler-td10666.html
> >>>>>>>
> >>>>>>> [2]
> >>>>>>>
> >>>>>
> >>>>
> >>
> https://github.com/apache/flink/blob/master/flink-java/src/main/java/org/apache/flink/api/java/typeutils/runtime/PojoSerializer.java#L369
> >>>>>>>
> >>>>>>> [3]
> >>>>>>>
> >>>>>
> >>>>
> >>
> https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/api/common/typeutils/base/IntSerializer.java#L73
> >>>>>>>
> >>>>>>> [4]
> >>>>>>>
> >>>>>
> >>>>
> >>
> https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/api/common/typeutils/TypeSerializer.java#L98
> >>>>>>>
> >>>>>>> [5]
> >>>>>>>
> >>>>>
> >>>>
> >>
> https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/operators/sort/FixedLengthRecordSorter.java
> >>>>>>>
> >>>>>>> [6]
> >>>>>>>
> >>>>>
> >>>>
> >>
> https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/operators/hash/CompactingHashTable.java#L861
> >>>>>>> [7] https://issues.apache.org/jira/browse/FLINK-3277
> >>>>>>>
> >>>>>>>
> >>>>>>> Best Regards,
> >>>>>>>
> >>>>>>> Gábor
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>>
> >>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: GSoC Project Proposal Draft: Code Generation in Serializers

Chiwan Park-2
Yes, I know Janino is a pure Java project. I meant if we add Scala code to flink-core, we should add Scala dependency to flink-core and it could be confusing.

Regards,
Chiwan Park

> On Apr 18, 2016, at 2:49 PM, Márton Balassi <[hidden email]> wrote:
>
> Chiwan, just to clarify Janino is a Java project. [1]
>
> [1] https://github.com/aunkrig/janino
>
> On Mon, Apr 18, 2016 at 3:40 AM, Chiwan Park <[hidden email]> wrote:
>
>> I prefer to avoid Scala dependencies in flink-core. If flink-core includes
>> Scala dependencies, Scala version suffix (_2.10 or _2.11) should be added.
>> I think that users could be confused.
>>
>> Regards,
>> Chiwan Park
>>
>>> On Apr 17, 2016, at 3:49 PM, Márton Balassi <[hidden email]>
>> wrote:
>>>
>>> Hi Gábor,
>>>
>>> I think that adding the Janino dep to flink-core should be fine, as it
>> has
>>> quite slim dependencies [1,2] which are generally orthogonal to Flink's
>>> main dependency line (also it is already used elsewhere).
>>>
>>> As for mixing Scala code that is used from the Java parts of the same
>> maven
>>> module I am skeptical. We have seen IDE compilation issues with projects
>>> using this setup and have decided that the community-wide potential IDE
>>> setup pain outweighs the individual implementation convenience with
>> Scala.
>>>
>>> [1]
>>>
>> https://repo1.maven.org/maven2/org/codehaus/janino/janino-parent/2.7.8/janino-parent-2.7.8.pom
>>> [2]
>>>
>> https://repo1.maven.org/maven2/org/codehaus/janino/janino/2.7.8/janino-2.7.8.pom
>>>
>>> On Sat, Apr 16, 2016 at 5:51 PM, Gábor Horváth <[hidden email]>
>> wrote:
>>>
>>>> Hi!
>>>>
>>>> Table API already uses code generation and the Janino compiler [1]. Is
>> it a
>>>> dependency that is ok to add to flink-core? In case it is ok, I think I
>>>> will use the same in order to be consistent with the other code
>> generation
>>>> efforts.
>>>>
>>>> I started to look at the Table API code generation [2] and it uses Scala
>>>> extensively. There are several Scala features that can make Java code
>>>> generation easier such as pattern matching and string interpolation. I
>> did
>>>> not see any Scala code in flink-core yet. Is it ok to implement the code
>>>> generation inside the flink-core using Scala?
>>>>
>>>> Regards,
>>>> Gábor
>>>>
>>>> [1] http://unkrig.de/w/Janino
>>>> [2]
>>>>
>>>>
>> https://github.com/apache/flink/blob/master/flink-libraries/flink-table/src/main/scala/org/apache/flink/api/table/codegen/CodeGenerator.scala
>>>>
>>>> On 18 March 2016 at 19:37, Gábor Horváth <[hidden email]> wrote:
>>>>
>>>>> Thank you! I finalized the project.
>>>>>
>>>>>
>>>>> On 18 March 2016 at 10:29, Márton Balassi <[hidden email]>
>>>>> wrote:
>>>>>
>>>>>> Thanks Gábor, now I also see it on the internal GSoC interface. I have
>>>>>> indicated that I wish to mentor your project, I think you can hit
>>>> finalize
>>>>>> on your project there.
>>>>>>
>>>>>> On Mon, Mar 14, 2016 at 11:16 AM, Gábor Horváth <[hidden email]>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I have updated this draft to include preliminary benchmarks,
>> mentioned
>>>>>> the
>>>>>>> interaction of annotations with savepoints, extended it with a
>>>> timeline,
>>>>>>> and some notes about scala case classes.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Gábor
>>>>>>>
>>>>>>> On 9 March 2016 at 16:12, Gábor Horváth <[hidden email]> wrote:
>>>>>>>
>>>>>>>> Hi!
>>>>>>>>
>>>>>>>> As far as I can see the formatting was not correct in my previous
>>>>>> mail. A
>>>>>>>> better formatted version is available here:
>>>>>>>>
>>>>>>>
>>>>>>
>>>>
>> https://docs.google.com/document/d/1VC8lCeErx9kI5lCMPiUn625PO0rxR-iKlVqtt3hkVnk
>>>>>>>> Sorry for that.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Gábor
>>>>>>>>
>>>>>>>> On 9 March 2016 at 15:51, Gábor Horváth <[hidden email]>
>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi,I did not want to send this proposal out before the I have some
>>>>>>>>> initial benchmarks, but this issue was mentioned on the mailing
>>>> list
>>>>>> (
>>>>>>>>>
>>>>>>>
>>>>>>
>>>>
>> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/Tuple-performance-and-the-curious-JIT-compiler-td10666.html
>>>>>>> ),
>>>>>>>>> and I wanted to make this information available to be able to
>>>>>>> incorporate
>>>>>>>>> this into that discussion. I have written this draft with the help
>>>> of
>>>>>>> Gábor
>>>>>>>>> Gévay and Márton Balassi and I am open to every suggestion.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> The proposal draft:
>>>>>>>>> Code Generation in Serializers and Comparators of Apache Flink
>>>>>>>>>
>>>>>>>>> I am doing my last semester of my MSc studies and I’m a former GSoC
>>>>>>>>> student in the LLVM project. I plan to improve the serialization
>>>>>> code in
>>>>>>>>> Flink during this summer. The current implementation of the
>>>>>> serializers
>>>>>>> can
>>>>>>>>> be a performance bottleneck in some scenarios. These performance
>>>>>>> problems
>>>>>>>>> were also reported on the mailing list recently [1]. I plan to
>>>>>> implement
>>>>>>>>> code generation into the serializers to improve the performance (as
>>>>>>> Stephan
>>>>>>>>> Ewen also suggested.)
>>>>>>>>>
>>>>>>>>> TODO: I plan to include some preliminary benchmarks in this
>>>> section.
>>>>>>>>> Performance problems with the current serializers
>>>>>>>>>
>>>>>>>>>  1.
>>>>>>>>>
>>>>>>>>>  PojoSerializer uses reflection for accessing the fields, which
>>>> is
>>>>>>>>>  slow (eg. [2])
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>  -
>>>>>>>>>
>>>>>>>>>  This is also a serious problem for the comparators
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>  1.
>>>>>>>>>
>>>>>>>>>  When deserializing fields of primitive types (eg. int), the
>>>>>> reusing
>>>>>>>>>  overload of the corresponding field serializers cannot really do
>>>>>> any
>>>>>>> reuse,
>>>>>>>>>  because boxed primitive types are immutable in Java. This
>>>> results
>>>>>> in
>>>>>>> lots
>>>>>>>>>  of object creations. [3][7]
>>>>>>>>>  2.
>>>>>>>>>
>>>>>>>>>  The loop to call the field serializers makes virtual function
>>>>>> calls,
>>>>>>>>>  that cannot be speculatively devirtualized by the JVM or
>>>> predicted
>>>>>>> by the
>>>>>>>>>  CPU, because different serializer subclasses are invoked for the
>>>>>>> different
>>>>>>>>>  fields. (And the loop cannot be unrolled, because the number of
>>>>>>> iterations
>>>>>>>>>  is not a compile time constant.) See also the following
>>>> discussion
>>>>>>> on the
>>>>>>>>>  mailing list [1].
>>>>>>>>>  3.
>>>>>>>>>
>>>>>>>>>  A POJO field can have the value null, so the serializer inserts
>>>> 1
>>>>>>>>>  byte null tags, which wastes space. (Also, the type extractor
>>>>>> logic
>>>>>>> does
>>>>>>>>>  not distinguish between primitive types and their boxed
>>>> versions,
>>>>>> so
>>>>>>> even
>>>>>>>>>  an int field has a null tag.)
>>>>>>>>>  4.
>>>>>>>>>
>>>>>>>>>  Subclass tags also add a byte at the beginning of every POJO
>>>>>>>>>  5.
>>>>>>>>>
>>>>>>>>>  getLength() does not know the size in most cases [4]
>>>>>>>>>  Knowing the size of a type when serialized has numerous
>>>>>> performance
>>>>>>>>>  benefits throughout Flink:
>>>>>>>>>  1.
>>>>>>>>>
>>>>>>>>>     Sorters can do in-place, when the type is small [5]
>>>>>>>>>     2.
>>>>>>>>>
>>>>>>>>>     Chaining hash tables do not need resizes, because they know
>>>> how
>>>>>>>>>     many buckets to allocate upfront [6]
>>>>>>>>>     3.
>>>>>>>>>
>>>>>>>>>     Different hash table architectures could be used, eg. open
>>>>>>>>>     addressing with linear probing instead of some chaining
>>>>>>>>>     4.
>>>>>>>>>
>>>>>>>>>     It is possible to deserialize, modify, and then serialize
>>>> back
>>>>>> a
>>>>>>>>>     record to its original place, because it cannot happen that
>>>> the
>>>>>>> modified
>>>>>>>>>     version does not fit in the place allocated there for the old
>>>>>>> version (see
>>>>>>>>>     CompactingHashTable and ReduceHashTable for concrete
>>>> instances
>>>>>> of
>>>>>>> this
>>>>>>>>>     problem)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Note, that 2. and 3. are problems with not just the PojoSerializer,
>>>>>> but
>>>>>>>>> also with the TupleSerializer.
>>>>>>>>> Solution approaches
>>>>>>>>>
>>>>>>>>>  1.
>>>>>>>>>
>>>>>>>>>  Run time code generation for every POJO
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>  -
>>>>>>>>>
>>>>>>>>>     1. and 3 . would be automatically solved, if the serializers
>>>>>> for
>>>>>>>>>     POJOs would be generated on-the-fly (by, for example,
>>>>>> Javassist)
>>>>>>>>>     -
>>>>>>>>>
>>>>>>>>>     2. also needs code generation, and also some extra effort in
>>>>>> the
>>>>>>>>>     type extractor to distinguish between primitive types and
>>>> their
>>>>>>> boxed
>>>>>>>>>     versions
>>>>>>>>>     -
>>>>>>>>>
>>>>>>>>>     could be used for PojoComparator as well (which could greatly
>>>>>>>>>     increase the performance of sorting)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>  1.
>>>>>>>>>
>>>>>>>>>  Annotations on POJOs (by the users)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>  -
>>>>>>>>>
>>>>>>>>>     Concretely:
>>>>>>>>>     -
>>>>>>>>>
>>>>>>>>>        annotate fields that will never be nulls -> no null tag
>>>>>> needed
>>>>>>>>>        before every field!
>>>>>>>>>        -
>>>>>>>>>
>>>>>>>>>        make a POJO final -> no subclass tag needed
>>>>>>>>>        -
>>>>>>>>>
>>>>>>>>>        annotating a POJO that it will not be null -> no top level
>>>>>> null
>>>>>>>>>        tag needed
>>>>>>>>>        -
>>>>>>>>>
>>>>>>>>>     These would also help with the getLength problem (6.),
>>>> because
>>>>>> the
>>>>>>>>>     length is often not known because currently anything can be
>>>>>> null
>>>>>>> or a
>>>>>>>>>     subclass can appear anywhere
>>>>>>>>>     -
>>>>>>>>>
>>>>>>>>>     These annotations could be done without code generation, but
>>>>>> then
>>>>>>>>>     they would add some overhead when there are no annotations
>>>>>>> present, so this
>>>>>>>>>     would work better together with the code generation
>>>>>>>>>     -
>>>>>>>>>
>>>>>>>>>     Tuples would become a special case of POJOs, where nothing
>>>> can
>>>>>> be
>>>>>>>>>     null, and no subclass can appear, so maybe we could eliminate
>>>>>> the
>>>>>>>>>     TupleSerializer
>>>>>>>>>     -
>>>>>>>>>
>>>>>>>>>     We could annotate some internal types in Flink libraries
>>>> (Gelly
>>>>>>>>>     (Vertex, Edge), FlinkML)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> TODO: what is the situation with Scala case classes? Run time code
>>>>>>>>> generation is probably easier in Scala? (with quasiquotes)
>>>>>>>>>
>>>>>>>>> About me
>>>>>>>>>
>>>>>>>>> I am in the last year of my Computer Science MSc studies at Eotvos
>>>>>>> Lorand
>>>>>>>>> University in Budapest, and planning to start a PhD in the autumn.
>>>> I
>>>>>>> have
>>>>>>>>> been working for almost three years at Ericsson on static analysis
>>>>>> tools
>>>>>>>>> for C++. In 2014 I participated in GSoC, working on the LLVM
>>>> project,
>>>>>>> and I
>>>>>>>>> am a frequent contributor ever since. The next summer I was
>>>>>> interning at
>>>>>>>>> Apple.
>>>>>>>>>
>>>>>>>>> I learned about the Flink project not too long ago and I like it so
>>>>>> far.
>>>>>>>>> The last few weeks I was working on some tickets to familiarize
>>>>>> myself
>>>>>>> with
>>>>>>>>> the codebase:
>>>>>>>>>
>>>>>>>>> https://issues.apache.org/jira/browse/FLINK-3422
>>>>>>>>>
>>>>>>>>> https://issues.apache.org/jira/browse/FLINK-3322
>>>>>>>>>
>>>>>>>>> https://issues.apache.org/jira/browse/FLINK-3457
>>>>>>>>>
>>>>>>>>> My CV is available here: http://xazax.web.elte.hu/files/resume.pdf
>>>>>>>>> References
>>>>>>>>>
>>>>>>>>> [1]
>>>>>>>>>
>>>>>>>
>>>>>>
>>>>
>> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/Tuple-performance-and-the-curious-JIT-compiler-td10666.html
>>>>>>>>>
>>>>>>>>> [2]
>>>>>>>>>
>>>>>>>
>>>>>>
>>>>
>> https://github.com/apache/flink/blob/master/flink-java/src/main/java/org/apache/flink/api/java/typeutils/runtime/PojoSerializer.java#L369
>>>>>>>>>
>>>>>>>>> [3]
>>>>>>>>>
>>>>>>>
>>>>>>
>>>>
>> https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/api/common/typeutils/base/IntSerializer.java#L73
>>>>>>>>>
>>>>>>>>> [4]
>>>>>>>>>
>>>>>>>
>>>>>>
>>>>
>> https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/api/common/typeutils/TypeSerializer.java#L98
>>>>>>>>>
>>>>>>>>> [5]
>>>>>>>>>
>>>>>>>
>>>>>>
>>>>
>> https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/operators/sort/FixedLengthRecordSorter.java
>>>>>>>>>
>>>>>>>>> [6]
>>>>>>>>>
>>>>>>>
>>>>>>
>>>>
>> https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/operators/hash/CompactingHashTable.java#L861
>>>>>>>>> [7] https://issues.apache.org/jira/browse/FLINK-3277
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Best Regards,
>>>>>>>>>
>>>>>>>>> Gábor
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>
>>

Reply | Threaded
Open this post in threaded view
|

Re: GSoC Project Proposal Draft: Code Generation in Serializers

Fabian Hueske-2
+1 for not mixing Java and Scala in flink-core.

Maybe it makes sense to implement the code generated serializers /
comparators as a separate module which can be plugged-in. This could be
pure Scala.
In general, I think it would be good to have some kind of "version
management" for serializers in place. With features such as safepoints that
depend on the implementation of serializers, it would be good to have a
mechanism to switch between implementations.

Best, Fabian

2016-04-18 10:01 GMT+02:00 Chiwan Park <[hidden email]>:

> Yes, I know Janino is a pure Java project. I meant if we add Scala code to
> flink-core, we should add Scala dependency to flink-core and it could be
> confusing.
>
> Regards,
> Chiwan Park
>
> > On Apr 18, 2016, at 2:49 PM, Márton Balassi <[hidden email]>
> wrote:
> >
> > Chiwan, just to clarify Janino is a Java project. [1]
> >
> > [1] https://github.com/aunkrig/janino
> >
> > On Mon, Apr 18, 2016 at 3:40 AM, Chiwan Park <[hidden email]>
> wrote:
> >
> >> I prefer to avoid Scala dependencies in flink-core. If flink-core
> includes
> >> Scala dependencies, Scala version suffix (_2.10 or _2.11) should be
> added.
> >> I think that users could be confused.
> >>
> >> Regards,
> >> Chiwan Park
> >>
> >>> On Apr 17, 2016, at 3:49 PM, Márton Balassi <[hidden email]>
> >> wrote:
> >>>
> >>> Hi Gábor,
> >>>
> >>> I think that adding the Janino dep to flink-core should be fine, as it
> >> has
> >>> quite slim dependencies [1,2] which are generally orthogonal to Flink's
> >>> main dependency line (also it is already used elsewhere).
> >>>
> >>> As for mixing Scala code that is used from the Java parts of the same
> >> maven
> >>> module I am skeptical. We have seen IDE compilation issues with
> projects
> >>> using this setup and have decided that the community-wide potential IDE
> >>> setup pain outweighs the individual implementation convenience with
> >> Scala.
> >>>
> >>> [1]
> >>>
> >>
> https://repo1.maven.org/maven2/org/codehaus/janino/janino-parent/2.7.8/janino-parent-2.7.8.pom
> >>> [2]
> >>>
> >>
> https://repo1.maven.org/maven2/org/codehaus/janino/janino/2.7.8/janino-2.7.8.pom
> >>>
> >>> On Sat, Apr 16, 2016 at 5:51 PM, Gábor Horváth <[hidden email]>
> >> wrote:
> >>>
> >>>> Hi!
> >>>>
> >>>> Table API already uses code generation and the Janino compiler [1]. Is
> >> it a
> >>>> dependency that is ok to add to flink-core? In case it is ok, I think
> I
> >>>> will use the same in order to be consistent with the other code
> >> generation
> >>>> efforts.
> >>>>
> >>>> I started to look at the Table API code generation [2] and it uses
> Scala
> >>>> extensively. There are several Scala features that can make Java code
> >>>> generation easier such as pattern matching and string interpolation. I
> >> did
> >>>> not see any Scala code in flink-core yet. Is it ok to implement the
> code
> >>>> generation inside the flink-core using Scala?
> >>>>
> >>>> Regards,
> >>>> Gábor
> >>>>
> >>>> [1] http://unkrig.de/w/Janino
> >>>> [2]
> >>>>
> >>>>
> >>
> https://github.com/apache/flink/blob/master/flink-libraries/flink-table/src/main/scala/org/apache/flink/api/table/codegen/CodeGenerator.scala
> >>>>
> >>>> On 18 March 2016 at 19:37, Gábor Horváth <[hidden email]> wrote:
> >>>>
> >>>>> Thank you! I finalized the project.
> >>>>>
> >>>>>
> >>>>> On 18 March 2016 at 10:29, Márton Balassi <[hidden email]>
> >>>>> wrote:
> >>>>>
> >>>>>> Thanks Gábor, now I also see it on the internal GSoC interface. I
> have
> >>>>>> indicated that I wish to mentor your project, I think you can hit
> >>>> finalize
> >>>>>> on your project there.
> >>>>>>
> >>>>>> On Mon, Mar 14, 2016 at 11:16 AM, Gábor Horváth <
> [hidden email]>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> Hi,
> >>>>>>>
> >>>>>>> I have updated this draft to include preliminary benchmarks,
> >> mentioned
> >>>>>> the
> >>>>>>> interaction of annotations with savepoints, extended it with a
> >>>> timeline,
> >>>>>>> and some notes about scala case classes.
> >>>>>>>
> >>>>>>> Regards,
> >>>>>>> Gábor
> >>>>>>>
> >>>>>>> On 9 March 2016 at 16:12, Gábor Horváth <[hidden email]>
> wrote:
> >>>>>>>
> >>>>>>>> Hi!
> >>>>>>>>
> >>>>>>>> As far as I can see the formatting was not correct in my previous
> >>>>>> mail. A
> >>>>>>>> better formatted version is available here:
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>
> >>
> https://docs.google.com/document/d/1VC8lCeErx9kI5lCMPiUn625PO0rxR-iKlVqtt3hkVnk
> >>>>>>>> Sorry for that.
> >>>>>>>>
> >>>>>>>> Regards,
> >>>>>>>> Gábor
> >>>>>>>>
> >>>>>>>> On 9 March 2016 at 15:51, Gábor Horváth <[hidden email]>
> >>>> wrote:
> >>>>>>>>
> >>>>>>>>> Hi,I did not want to send this proposal out before the I have
> some
> >>>>>>>>> initial benchmarks, but this issue was mentioned on the mailing
> >>>> list
> >>>>>> (
> >>>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>
> >>
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/Tuple-performance-and-the-curious-JIT-compiler-td10666.html
> >>>>>>> ),
> >>>>>>>>> and I wanted to make this information available to be able to
> >>>>>>> incorporate
> >>>>>>>>> this into that discussion. I have written this draft with the
> help
> >>>> of
> >>>>>>> Gábor
> >>>>>>>>> Gévay and Márton Balassi and I am open to every suggestion.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> The proposal draft:
> >>>>>>>>> Code Generation in Serializers and Comparators of Apache Flink
> >>>>>>>>>
> >>>>>>>>> I am doing my last semester of my MSc studies and I’m a former
> GSoC
> >>>>>>>>> student in the LLVM project. I plan to improve the serialization
> >>>>>> code in
> >>>>>>>>> Flink during this summer. The current implementation of the
> >>>>>> serializers
> >>>>>>> can
> >>>>>>>>> be a performance bottleneck in some scenarios. These performance
> >>>>>>> problems
> >>>>>>>>> were also reported on the mailing list recently [1]. I plan to
> >>>>>> implement
> >>>>>>>>> code generation into the serializers to improve the performance
> (as
> >>>>>>> Stephan
> >>>>>>>>> Ewen also suggested.)
> >>>>>>>>>
> >>>>>>>>> TODO: I plan to include some preliminary benchmarks in this
> >>>> section.
> >>>>>>>>> Performance problems with the current serializers
> >>>>>>>>>
> >>>>>>>>>  1.
> >>>>>>>>>
> >>>>>>>>>  PojoSerializer uses reflection for accessing the fields, which
> >>>> is
> >>>>>>>>>  slow (eg. [2])
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>  -
> >>>>>>>>>
> >>>>>>>>>  This is also a serious problem for the comparators
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>  1.
> >>>>>>>>>
> >>>>>>>>>  When deserializing fields of primitive types (eg. int), the
> >>>>>> reusing
> >>>>>>>>>  overload of the corresponding field serializers cannot really do
> >>>>>> any
> >>>>>>> reuse,
> >>>>>>>>>  because boxed primitive types are immutable in Java. This
> >>>> results
> >>>>>> in
> >>>>>>> lots
> >>>>>>>>>  of object creations. [3][7]
> >>>>>>>>>  2.
> >>>>>>>>>
> >>>>>>>>>  The loop to call the field serializers makes virtual function
> >>>>>> calls,
> >>>>>>>>>  that cannot be speculatively devirtualized by the JVM or
> >>>> predicted
> >>>>>>> by the
> >>>>>>>>>  CPU, because different serializer subclasses are invoked for the
> >>>>>>> different
> >>>>>>>>>  fields. (And the loop cannot be unrolled, because the number of
> >>>>>>> iterations
> >>>>>>>>>  is not a compile time constant.) See also the following
> >>>> discussion
> >>>>>>> on the
> >>>>>>>>>  mailing list [1].
> >>>>>>>>>  3.
> >>>>>>>>>
> >>>>>>>>>  A POJO field can have the value null, so the serializer inserts
> >>>> 1
> >>>>>>>>>  byte null tags, which wastes space. (Also, the type extractor
> >>>>>> logic
> >>>>>>> does
> >>>>>>>>>  not distinguish between primitive types and their boxed
> >>>> versions,
> >>>>>> so
> >>>>>>> even
> >>>>>>>>>  an int field has a null tag.)
> >>>>>>>>>  4.
> >>>>>>>>>
> >>>>>>>>>  Subclass tags also add a byte at the beginning of every POJO
> >>>>>>>>>  5.
> >>>>>>>>>
> >>>>>>>>>  getLength() does not know the size in most cases [4]
> >>>>>>>>>  Knowing the size of a type when serialized has numerous
> >>>>>> performance
> >>>>>>>>>  benefits throughout Flink:
> >>>>>>>>>  1.
> >>>>>>>>>
> >>>>>>>>>     Sorters can do in-place, when the type is small [5]
> >>>>>>>>>     2.
> >>>>>>>>>
> >>>>>>>>>     Chaining hash tables do not need resizes, because they know
> >>>> how
> >>>>>>>>>     many buckets to allocate upfront [6]
> >>>>>>>>>     3.
> >>>>>>>>>
> >>>>>>>>>     Different hash table architectures could be used, eg. open
> >>>>>>>>>     addressing with linear probing instead of some chaining
> >>>>>>>>>     4.
> >>>>>>>>>
> >>>>>>>>>     It is possible to deserialize, modify, and then serialize
> >>>> back
> >>>>>> a
> >>>>>>>>>     record to its original place, because it cannot happen that
> >>>> the
> >>>>>>> modified
> >>>>>>>>>     version does not fit in the place allocated there for the old
> >>>>>>> version (see
> >>>>>>>>>     CompactingHashTable and ReduceHashTable for concrete
> >>>> instances
> >>>>>> of
> >>>>>>> this
> >>>>>>>>>     problem)
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Note, that 2. and 3. are problems with not just the
> PojoSerializer,
> >>>>>> but
> >>>>>>>>> also with the TupleSerializer.
> >>>>>>>>> Solution approaches
> >>>>>>>>>
> >>>>>>>>>  1.
> >>>>>>>>>
> >>>>>>>>>  Run time code generation for every POJO
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>  -
> >>>>>>>>>
> >>>>>>>>>     1. and 3 . would be automatically solved, if the serializers
> >>>>>> for
> >>>>>>>>>     POJOs would be generated on-the-fly (by, for example,
> >>>>>> Javassist)
> >>>>>>>>>     -
> >>>>>>>>>
> >>>>>>>>>     2. also needs code generation, and also some extra effort in
> >>>>>> the
> >>>>>>>>>     type extractor to distinguish between primitive types and
> >>>> their
> >>>>>>> boxed
> >>>>>>>>>     versions
> >>>>>>>>>     -
> >>>>>>>>>
> >>>>>>>>>     could be used for PojoComparator as well (which could greatly
> >>>>>>>>>     increase the performance of sorting)
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>  1.
> >>>>>>>>>
> >>>>>>>>>  Annotations on POJOs (by the users)
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>  -
> >>>>>>>>>
> >>>>>>>>>     Concretely:
> >>>>>>>>>     -
> >>>>>>>>>
> >>>>>>>>>        annotate fields that will never be nulls -> no null tag
> >>>>>> needed
> >>>>>>>>>        before every field!
> >>>>>>>>>        -
> >>>>>>>>>
> >>>>>>>>>        make a POJO final -> no subclass tag needed
> >>>>>>>>>        -
> >>>>>>>>>
> >>>>>>>>>        annotating a POJO that it will not be null -> no top level
> >>>>>> null
> >>>>>>>>>        tag needed
> >>>>>>>>>        -
> >>>>>>>>>
> >>>>>>>>>     These would also help with the getLength problem (6.),
> >>>> because
> >>>>>> the
> >>>>>>>>>     length is often not known because currently anything can be
> >>>>>> null
> >>>>>>> or a
> >>>>>>>>>     subclass can appear anywhere
> >>>>>>>>>     -
> >>>>>>>>>
> >>>>>>>>>     These annotations could be done without code generation, but
> >>>>>> then
> >>>>>>>>>     they would add some overhead when there are no annotations
> >>>>>>> present, so this
> >>>>>>>>>     would work better together with the code generation
> >>>>>>>>>     -
> >>>>>>>>>
> >>>>>>>>>     Tuples would become a special case of POJOs, where nothing
> >>>> can
> >>>>>> be
> >>>>>>>>>     null, and no subclass can appear, so maybe we could eliminate
> >>>>>> the
> >>>>>>>>>     TupleSerializer
> >>>>>>>>>     -
> >>>>>>>>>
> >>>>>>>>>     We could annotate some internal types in Flink libraries
> >>>> (Gelly
> >>>>>>>>>     (Vertex, Edge), FlinkML)
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> TODO: what is the situation with Scala case classes? Run time
> code
> >>>>>>>>> generation is probably easier in Scala? (with quasiquotes)
> >>>>>>>>>
> >>>>>>>>> About me
> >>>>>>>>>
> >>>>>>>>> I am in the last year of my Computer Science MSc studies at
> Eotvos
> >>>>>>> Lorand
> >>>>>>>>> University in Budapest, and planning to start a PhD in the
> autumn.
> >>>> I
> >>>>>>> have
> >>>>>>>>> been working for almost three years at Ericsson on static
> analysis
> >>>>>> tools
> >>>>>>>>> for C++. In 2014 I participated in GSoC, working on the LLVM
> >>>> project,
> >>>>>>> and I
> >>>>>>>>> am a frequent contributor ever since. The next summer I was
> >>>>>> interning at
> >>>>>>>>> Apple.
> >>>>>>>>>
> >>>>>>>>> I learned about the Flink project not too long ago and I like it
> so
> >>>>>> far.
> >>>>>>>>> The last few weeks I was working on some tickets to familiarize
> >>>>>> myself
> >>>>>>> with
> >>>>>>>>> the codebase:
> >>>>>>>>>
> >>>>>>>>> https://issues.apache.org/jira/browse/FLINK-3422
> >>>>>>>>>
> >>>>>>>>> https://issues.apache.org/jira/browse/FLINK-3322
> >>>>>>>>>
> >>>>>>>>> https://issues.apache.org/jira/browse/FLINK-3457
> >>>>>>>>>
> >>>>>>>>> My CV is available here:
> http://xazax.web.elte.hu/files/resume.pdf
> >>>>>>>>> References
> >>>>>>>>>
> >>>>>>>>> [1]
> >>>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>
> >>
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/Tuple-performance-and-the-curious-JIT-compiler-td10666.html
> >>>>>>>>>
> >>>>>>>>> [2]
> >>>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>
> >>
> https://github.com/apache/flink/blob/master/flink-java/src/main/java/org/apache/flink/api/java/typeutils/runtime/PojoSerializer.java#L369
> >>>>>>>>>
> >>>>>>>>> [3]
> >>>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>
> >>
> https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/api/common/typeutils/base/IntSerializer.java#L73
> >>>>>>>>>
> >>>>>>>>> [4]
> >>>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>
> >>
> https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/api/common/typeutils/TypeSerializer.java#L98
> >>>>>>>>>
> >>>>>>>>> [5]
> >>>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>
> >>
> https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/operators/sort/FixedLengthRecordSorter.java
> >>>>>>>>>
> >>>>>>>>> [6]
> >>>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>
> >>
> https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/operators/hash/CompactingHashTable.java#L861
> >>>>>>>>> [7] https://issues.apache.org/jira/browse/FLINK-3277
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Best Regards,
> >>>>>>>>>
> >>>>>>>>> Gábor
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>>
> >>>>
> >>
> >>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: GSoC Project Proposal Draft: Code Generation in Serializers

Gábor Horváth
Unfortunately making code generation a separate module would introduce
cyclic dependency.
Code generation requires the TypeInfo which is available in flink-core and
flink-core requires
the generated serializers from the code generation module. Do you have a
solution for this?

I think if we can come up with a solution I will implement it as a separate
Scala module
otherwise I will stick to Java.

BR,
Gábor

On 18 April 2016 at 12:40, Fabian Hueske <[hidden email]> wrote:

> +1 for not mixing Java and Scala in flink-core.
>
> Maybe it makes sense to implement the code generated serializers /
> comparators as a separate module which can be plugged-in. This could be
> pure Scala.
> In general, I think it would be good to have some kind of "version
> management" for serializers in place. With features such as safepoints that
> depend on the implementation of serializers, it would be good to have a
> mechanism to switch between implementations.
>
> Best, Fabian
>
> 2016-04-18 10:01 GMT+02:00 Chiwan Park <[hidden email]>:
>
> > Yes, I know Janino is a pure Java project. I meant if we add Scala code
> to
> > flink-core, we should add Scala dependency to flink-core and it could be
> > confusing.
> >
> > Regards,
> > Chiwan Park
> >
> > > On Apr 18, 2016, at 2:49 PM, Márton Balassi <[hidden email]>
> > wrote:
> > >
> > > Chiwan, just to clarify Janino is a Java project. [1]
> > >
> > > [1] https://github.com/aunkrig/janino
> > >
> > > On Mon, Apr 18, 2016 at 3:40 AM, Chiwan Park <[hidden email]>
> > wrote:
> > >
> > >> I prefer to avoid Scala dependencies in flink-core. If flink-core
> > includes
> > >> Scala dependencies, Scala version suffix (_2.10 or _2.11) should be
> > added.
> > >> I think that users could be confused.
> > >>
> > >> Regards,
> > >> Chiwan Park
> > >>
> > >>> On Apr 17, 2016, at 3:49 PM, Márton Balassi <
> [hidden email]>
> > >> wrote:
> > >>>
> > >>> Hi Gábor,
> > >>>
> > >>> I think that adding the Janino dep to flink-core should be fine, as
> it
> > >> has
> > >>> quite slim dependencies [1,2] which are generally orthogonal to
> Flink's
> > >>> main dependency line (also it is already used elsewhere).
> > >>>
> > >>> As for mixing Scala code that is used from the Java parts of the same
> > >> maven
> > >>> module I am skeptical. We have seen IDE compilation issues with
> > projects
> > >>> using this setup and have decided that the community-wide potential
> IDE
> > >>> setup pain outweighs the individual implementation convenience with
> > >> Scala.
> > >>>
> > >>> [1]
> > >>>
> > >>
> >
> https://repo1.maven.org/maven2/org/codehaus/janino/janino-parent/2.7.8/janino-parent-2.7.8.pom
> > >>> [2]
> > >>>
> > >>
> >
> https://repo1.maven.org/maven2/org/codehaus/janino/janino/2.7.8/janino-2.7.8.pom
> > >>>
> > >>> On Sat, Apr 16, 2016 at 5:51 PM, Gábor Horváth <[hidden email]>
> > >> wrote:
> > >>>
> > >>>> Hi!
> > >>>>
> > >>>> Table API already uses code generation and the Janino compiler [1].
> Is
> > >> it a
> > >>>> dependency that is ok to add to flink-core? In case it is ok, I
> think
> > I
> > >>>> will use the same in order to be consistent with the other code
> > >> generation
> > >>>> efforts.
> > >>>>
> > >>>> I started to look at the Table API code generation [2] and it uses
> > Scala
> > >>>> extensively. There are several Scala features that can make Java
> code
> > >>>> generation easier such as pattern matching and string
> interpolation. I
> > >> did
> > >>>> not see any Scala code in flink-core yet. Is it ok to implement the
> > code
> > >>>> generation inside the flink-core using Scala?
> > >>>>
> > >>>> Regards,
> > >>>> Gábor
> > >>>>
> > >>>> [1] http://unkrig.de/w/Janino
> > >>>> [2]
> > >>>>
> > >>>>
> > >>
> >
> https://github.com/apache/flink/blob/master/flink-libraries/flink-table/src/main/scala/org/apache/flink/api/table/codegen/CodeGenerator.scala
> > >>>>
> > >>>> On 18 March 2016 at 19:37, Gábor Horváth <[hidden email]>
> wrote:
> > >>>>
> > >>>>> Thank you! I finalized the project.
> > >>>>>
> > >>>>>
> > >>>>> On 18 March 2016 at 10:29, Márton Balassi <
> [hidden email]>
> > >>>>> wrote:
> > >>>>>
> > >>>>>> Thanks Gábor, now I also see it on the internal GSoC interface. I
> > have
> > >>>>>> indicated that I wish to mentor your project, I think you can hit
> > >>>> finalize
> > >>>>>> on your project there.
> > >>>>>>
> > >>>>>> On Mon, Mar 14, 2016 at 11:16 AM, Gábor Horváth <
> > [hidden email]>
> > >>>>>> wrote:
> > >>>>>>
> > >>>>>>> Hi,
> > >>>>>>>
> > >>>>>>> I have updated this draft to include preliminary benchmarks,
> > >> mentioned
> > >>>>>> the
> > >>>>>>> interaction of annotations with savepoints, extended it with a
> > >>>> timeline,
> > >>>>>>> and some notes about scala case classes.
> > >>>>>>>
> > >>>>>>> Regards,
> > >>>>>>> Gábor
> > >>>>>>>
> > >>>>>>> On 9 March 2016 at 16:12, Gábor Horváth <[hidden email]>
> > wrote:
> > >>>>>>>
> > >>>>>>>> Hi!
> > >>>>>>>>
> > >>>>>>>> As far as I can see the formatting was not correct in my
> previous
> > >>>>>> mail. A
> > >>>>>>>> better formatted version is available here:
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>
> > >>
> >
> https://docs.google.com/document/d/1VC8lCeErx9kI5lCMPiUn625PO0rxR-iKlVqtt3hkVnk
> > >>>>>>>> Sorry for that.
> > >>>>>>>>
> > >>>>>>>> Regards,
> > >>>>>>>> Gábor
> > >>>>>>>>
> > >>>>>>>> On 9 March 2016 at 15:51, Gábor Horváth <[hidden email]>
> > >>>> wrote:
> > >>>>>>>>
> > >>>>>>>>> Hi,I did not want to send this proposal out before the I have
> > some
> > >>>>>>>>> initial benchmarks, but this issue was mentioned on the mailing
> > >>>> list
> > >>>>>> (
> > >>>>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>
> > >>
> >
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/Tuple-performance-and-the-curious-JIT-compiler-td10666.html
> > >>>>>>> ),
> > >>>>>>>>> and I wanted to make this information available to be able to
> > >>>>>>> incorporate
> > >>>>>>>>> this into that discussion. I have written this draft with the
> > help
> > >>>> of
> > >>>>>>> Gábor
> > >>>>>>>>> Gévay and Márton Balassi and I am open to every suggestion.
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> The proposal draft:
> > >>>>>>>>> Code Generation in Serializers and Comparators of Apache Flink
> > >>>>>>>>>
> > >>>>>>>>> I am doing my last semester of my MSc studies and I’m a former
> > GSoC
> > >>>>>>>>> student in the LLVM project. I plan to improve the
> serialization
> > >>>>>> code in
> > >>>>>>>>> Flink during this summer. The current implementation of the
> > >>>>>> serializers
> > >>>>>>> can
> > >>>>>>>>> be a performance bottleneck in some scenarios. These
> performance
> > >>>>>>> problems
> > >>>>>>>>> were also reported on the mailing list recently [1]. I plan to
> > >>>>>> implement
> > >>>>>>>>> code generation into the serializers to improve the performance
> > (as
> > >>>>>>> Stephan
> > >>>>>>>>> Ewen also suggested.)
> > >>>>>>>>>
> > >>>>>>>>> TODO: I plan to include some preliminary benchmarks in this
> > >>>> section.
> > >>>>>>>>> Performance problems with the current serializers
> > >>>>>>>>>
> > >>>>>>>>>  1.
> > >>>>>>>>>
> > >>>>>>>>>  PojoSerializer uses reflection for accessing the fields, which
> > >>>> is
> > >>>>>>>>>  slow (eg. [2])
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>  -
> > >>>>>>>>>
> > >>>>>>>>>  This is also a serious problem for the comparators
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>  1.
> > >>>>>>>>>
> > >>>>>>>>>  When deserializing fields of primitive types (eg. int), the
> > >>>>>> reusing
> > >>>>>>>>>  overload of the corresponding field serializers cannot really
> do
> > >>>>>> any
> > >>>>>>> reuse,
> > >>>>>>>>>  because boxed primitive types are immutable in Java. This
> > >>>> results
> > >>>>>> in
> > >>>>>>> lots
> > >>>>>>>>>  of object creations. [3][7]
> > >>>>>>>>>  2.
> > >>>>>>>>>
> > >>>>>>>>>  The loop to call the field serializers makes virtual function
> > >>>>>> calls,
> > >>>>>>>>>  that cannot be speculatively devirtualized by the JVM or
> > >>>> predicted
> > >>>>>>> by the
> > >>>>>>>>>  CPU, because different serializer subclasses are invoked for
> the
> > >>>>>>> different
> > >>>>>>>>>  fields. (And the loop cannot be unrolled, because the number
> of
> > >>>>>>> iterations
> > >>>>>>>>>  is not a compile time constant.) See also the following
> > >>>> discussion
> > >>>>>>> on the
> > >>>>>>>>>  mailing list [1].
> > >>>>>>>>>  3.
> > >>>>>>>>>
> > >>>>>>>>>  A POJO field can have the value null, so the serializer
> inserts
> > >>>> 1
> > >>>>>>>>>  byte null tags, which wastes space. (Also, the type extractor
> > >>>>>> logic
> > >>>>>>> does
> > >>>>>>>>>  not distinguish between primitive types and their boxed
> > >>>> versions,
> > >>>>>> so
> > >>>>>>> even
> > >>>>>>>>>  an int field has a null tag.)
> > >>>>>>>>>  4.
> > >>>>>>>>>
> > >>>>>>>>>  Subclass tags also add a byte at the beginning of every POJO
> > >>>>>>>>>  5.
> > >>>>>>>>>
> > >>>>>>>>>  getLength() does not know the size in most cases [4]
> > >>>>>>>>>  Knowing the size of a type when serialized has numerous
> > >>>>>> performance
> > >>>>>>>>>  benefits throughout Flink:
> > >>>>>>>>>  1.
> > >>>>>>>>>
> > >>>>>>>>>     Sorters can do in-place, when the type is small [5]
> > >>>>>>>>>     2.
> > >>>>>>>>>
> > >>>>>>>>>     Chaining hash tables do not need resizes, because they know
> > >>>> how
> > >>>>>>>>>     many buckets to allocate upfront [6]
> > >>>>>>>>>     3.
> > >>>>>>>>>
> > >>>>>>>>>     Different hash table architectures could be used, eg. open
> > >>>>>>>>>     addressing with linear probing instead of some chaining
> > >>>>>>>>>     4.
> > >>>>>>>>>
> > >>>>>>>>>     It is possible to deserialize, modify, and then serialize
> > >>>> back
> > >>>>>> a
> > >>>>>>>>>     record to its original place, because it cannot happen that
> > >>>> the
> > >>>>>>> modified
> > >>>>>>>>>     version does not fit in the place allocated there for the
> old
> > >>>>>>> version (see
> > >>>>>>>>>     CompactingHashTable and ReduceHashTable for concrete
> > >>>> instances
> > >>>>>> of
> > >>>>>>> this
> > >>>>>>>>>     problem)
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> Note, that 2. and 3. are problems with not just the
> > PojoSerializer,
> > >>>>>> but
> > >>>>>>>>> also with the TupleSerializer.
> > >>>>>>>>> Solution approaches
> > >>>>>>>>>
> > >>>>>>>>>  1.
> > >>>>>>>>>
> > >>>>>>>>>  Run time code generation for every POJO
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>  -
> > >>>>>>>>>
> > >>>>>>>>>     1. and 3 . would be automatically solved, if the
> serializers
> > >>>>>> for
> > >>>>>>>>>     POJOs would be generated on-the-fly (by, for example,
> > >>>>>> Javassist)
> > >>>>>>>>>     -
> > >>>>>>>>>
> > >>>>>>>>>     2. also needs code generation, and also some extra effort
> in
> > >>>>>> the
> > >>>>>>>>>     type extractor to distinguish between primitive types and
> > >>>> their
> > >>>>>>> boxed
> > >>>>>>>>>     versions
> > >>>>>>>>>     -
> > >>>>>>>>>
> > >>>>>>>>>     could be used for PojoComparator as well (which could
> greatly
> > >>>>>>>>>     increase the performance of sorting)
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>  1.
> > >>>>>>>>>
> > >>>>>>>>>  Annotations on POJOs (by the users)
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>  -
> > >>>>>>>>>
> > >>>>>>>>>     Concretely:
> > >>>>>>>>>     -
> > >>>>>>>>>
> > >>>>>>>>>        annotate fields that will never be nulls -> no null tag
> > >>>>>> needed
> > >>>>>>>>>        before every field!
> > >>>>>>>>>        -
> > >>>>>>>>>
> > >>>>>>>>>        make a POJO final -> no subclass tag needed
> > >>>>>>>>>        -
> > >>>>>>>>>
> > >>>>>>>>>        annotating a POJO that it will not be null -> no top
> level
> > >>>>>> null
> > >>>>>>>>>        tag needed
> > >>>>>>>>>        -
> > >>>>>>>>>
> > >>>>>>>>>     These would also help with the getLength problem (6.),
> > >>>> because
> > >>>>>> the
> > >>>>>>>>>     length is often not known because currently anything can be
> > >>>>>> null
> > >>>>>>> or a
> > >>>>>>>>>     subclass can appear anywhere
> > >>>>>>>>>     -
> > >>>>>>>>>
> > >>>>>>>>>     These annotations could be done without code generation,
> but
> > >>>>>> then
> > >>>>>>>>>     they would add some overhead when there are no annotations
> > >>>>>>> present, so this
> > >>>>>>>>>     would work better together with the code generation
> > >>>>>>>>>     -
> > >>>>>>>>>
> > >>>>>>>>>     Tuples would become a special case of POJOs, where nothing
> > >>>> can
> > >>>>>> be
> > >>>>>>>>>     null, and no subclass can appear, so maybe we could
> eliminate
> > >>>>>> the
> > >>>>>>>>>     TupleSerializer
> > >>>>>>>>>     -
> > >>>>>>>>>
> > >>>>>>>>>     We could annotate some internal types in Flink libraries
> > >>>> (Gelly
> > >>>>>>>>>     (Vertex, Edge), FlinkML)
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> TODO: what is the situation with Scala case classes? Run time
> > code
> > >>>>>>>>> generation is probably easier in Scala? (with quasiquotes)
> > >>>>>>>>>
> > >>>>>>>>> About me
> > >>>>>>>>>
> > >>>>>>>>> I am in the last year of my Computer Science MSc studies at
> > Eotvos
> > >>>>>>> Lorand
> > >>>>>>>>> University in Budapest, and planning to start a PhD in the
> > autumn.
> > >>>> I
> > >>>>>>> have
> > >>>>>>>>> been working for almost three years at Ericsson on static
> > analysis
> > >>>>>> tools
> > >>>>>>>>> for C++. In 2014 I participated in GSoC, working on the LLVM
> > >>>> project,
> > >>>>>>> and I
> > >>>>>>>>> am a frequent contributor ever since. The next summer I was
> > >>>>>> interning at
> > >>>>>>>>> Apple.
> > >>>>>>>>>
> > >>>>>>>>> I learned about the Flink project not too long ago and I like
> it
> > so
> > >>>>>> far.
> > >>>>>>>>> The last few weeks I was working on some tickets to familiarize
> > >>>>>> myself
> > >>>>>>> with
> > >>>>>>>>> the codebase:
> > >>>>>>>>>
> > >>>>>>>>> https://issues.apache.org/jira/browse/FLINK-3422
> > >>>>>>>>>
> > >>>>>>>>> https://issues.apache.org/jira/browse/FLINK-3322
> > >>>>>>>>>
> > >>>>>>>>> https://issues.apache.org/jira/browse/FLINK-3457
> > >>>>>>>>>
> > >>>>>>>>> My CV is available here:
> > http://xazax.web.elte.hu/files/resume.pdf
> > >>>>>>>>> References
> > >>>>>>>>>
> > >>>>>>>>> [1]
> > >>>>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>
> > >>
> >
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/Tuple-performance-and-the-curious-JIT-compiler-td10666.html
> > >>>>>>>>>
> > >>>>>>>>> [2]
> > >>>>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>
> > >>
> >
> https://github.com/apache/flink/blob/master/flink-java/src/main/java/org/apache/flink/api/java/typeutils/runtime/PojoSerializer.java#L369
> > >>>>>>>>>
> > >>>>>>>>> [3]
> > >>>>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>
> > >>
> >
> https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/api/common/typeutils/base/IntSerializer.java#L73
> > >>>>>>>>>
> > >>>>>>>>> [4]
> > >>>>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>
> > >>
> >
> https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/api/common/typeutils/TypeSerializer.java#L98
> > >>>>>>>>>
> > >>>>>>>>> [5]
> > >>>>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>
> > >>
> >
> https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/operators/sort/FixedLengthRecordSorter.java
> > >>>>>>>>>
> > >>>>>>>>> [6]
> > >>>>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>
> > >>
> >
> https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/operators/hash/CompactingHashTable.java#L861
> > >>>>>>>>> [7] https://issues.apache.org/jira/browse/FLINK-3277
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> Best Regards,
> > >>>>>>>>>
> > >>>>>>>>> Gábor
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>>
> > >>>>>
> > >>>>
> > >>
> > >>
> >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: GSoC Project Proposal Draft: Code Generation in Serializers

Fabian Hueske-2
Hi Gabor,

you are right, a codegen serializer module would depend on flink-core and
in the current design flink-core would need to know about the type infos /
serializers / comparators.

Decoupling implementations of type info, serializers, and comparators from
flink-core and resolving the cyclic dependency would be what the plugin
architecture would be for.
Maybe this can be done by some mechanism to dynamically load
TypeInformations for types with overridden serializers / comparators.
This would require some design document and discussion in the community.

Cheers, Fabian





2016-04-18 21:19 GMT+02:00 Gábor Horváth <[hidden email]>:

> Unfortunately making code generation a separate module would introduce
> cyclic dependency.
> Code generation requires the TypeInfo which is available in flink-core and
> flink-core requires
> the generated serializers from the code generation module. Do you have a
> solution for this?
>
> I think if we can come up with a solution I will implement it as a separate
> Scala module
> otherwise I will stick to Java.
>
> BR,
> Gábor
>
> On 18 April 2016 at 12:40, Fabian Hueske <[hidden email]> wrote:
>
> > +1 for not mixing Java and Scala in flink-core.
> >
> > Maybe it makes sense to implement the code generated serializers /
> > comparators as a separate module which can be plugged-in. This could be
> > pure Scala.
> > In general, I think it would be good to have some kind of "version
> > management" for serializers in place. With features such as safepoints
> that
> > depend on the implementation of serializers, it would be good to have a
> > mechanism to switch between implementations.
> >
> > Best, Fabian
> >
> > 2016-04-18 10:01 GMT+02:00 Chiwan Park <[hidden email]>:
> >
> > > Yes, I know Janino is a pure Java project. I meant if we add Scala code
> > to
> > > flink-core, we should add Scala dependency to flink-core and it could
> be
> > > confusing.
> > >
> > > Regards,
> > > Chiwan Park
> > >
> > > > On Apr 18, 2016, at 2:49 PM, Márton Balassi <
> [hidden email]>
> > > wrote:
> > > >
> > > > Chiwan, just to clarify Janino is a Java project. [1]
> > > >
> > > > [1] https://github.com/aunkrig/janino
> > > >
> > > > On Mon, Apr 18, 2016 at 3:40 AM, Chiwan Park <[hidden email]>
> > > wrote:
> > > >
> > > >> I prefer to avoid Scala dependencies in flink-core. If flink-core
> > > includes
> > > >> Scala dependencies, Scala version suffix (_2.10 or _2.11) should be
> > > added.
> > > >> I think that users could be confused.
> > > >>
> > > >> Regards,
> > > >> Chiwan Park
> > > >>
> > > >>> On Apr 17, 2016, at 3:49 PM, Márton Balassi <
> > [hidden email]>
> > > >> wrote:
> > > >>>
> > > >>> Hi Gábor,
> > > >>>
> > > >>> I think that adding the Janino dep to flink-core should be fine, as
> > it
> > > >> has
> > > >>> quite slim dependencies [1,2] which are generally orthogonal to
> > Flink's
> > > >>> main dependency line (also it is already used elsewhere).
> > > >>>
> > > >>> As for mixing Scala code that is used from the Java parts of the
> same
> > > >> maven
> > > >>> module I am skeptical. We have seen IDE compilation issues with
> > > projects
> > > >>> using this setup and have decided that the community-wide potential
> > IDE
> > > >>> setup pain outweighs the individual implementation convenience with
> > > >> Scala.
> > > >>>
> > > >>> [1]
> > > >>>
> > > >>
> > >
> >
> https://repo1.maven.org/maven2/org/codehaus/janino/janino-parent/2.7.8/janino-parent-2.7.8.pom
> > > >>> [2]
> > > >>>
> > > >>
> > >
> >
> https://repo1.maven.org/maven2/org/codehaus/janino/janino/2.7.8/janino-2.7.8.pom
> > > >>>
> > > >>> On Sat, Apr 16, 2016 at 5:51 PM, Gábor Horváth <
> [hidden email]>
> > > >> wrote:
> > > >>>
> > > >>>> Hi!
> > > >>>>
> > > >>>> Table API already uses code generation and the Janino compiler
> [1].
> > Is
> > > >> it a
> > > >>>> dependency that is ok to add to flink-core? In case it is ok, I
> > think
> > > I
> > > >>>> will use the same in order to be consistent with the other code
> > > >> generation
> > > >>>> efforts.
> > > >>>>
> > > >>>> I started to look at the Table API code generation [2] and it uses
> > > Scala
> > > >>>> extensively. There are several Scala features that can make Java
> > code
> > > >>>> generation easier such as pattern matching and string
> > interpolation. I
> > > >> did
> > > >>>> not see any Scala code in flink-core yet. Is it ok to implement
> the
> > > code
> > > >>>> generation inside the flink-core using Scala?
> > > >>>>
> > > >>>> Regards,
> > > >>>> Gábor
> > > >>>>
> > > >>>> [1] http://unkrig.de/w/Janino
> > > >>>> [2]
> > > >>>>
> > > >>>>
> > > >>
> > >
> >
> https://github.com/apache/flink/blob/master/flink-libraries/flink-table/src/main/scala/org/apache/flink/api/table/codegen/CodeGenerator.scala
> > > >>>>
> > > >>>> On 18 March 2016 at 19:37, Gábor Horváth <[hidden email]>
> > wrote:
> > > >>>>
> > > >>>>> Thank you! I finalized the project.
> > > >>>>>
> > > >>>>>
> > > >>>>> On 18 March 2016 at 10:29, Márton Balassi <
> > [hidden email]>
> > > >>>>> wrote:
> > > >>>>>
> > > >>>>>> Thanks Gábor, now I also see it on the internal GSoC interface.
> I
> > > have
> > > >>>>>> indicated that I wish to mentor your project, I think you can
> hit
> > > >>>> finalize
> > > >>>>>> on your project there.
> > > >>>>>>
> > > >>>>>> On Mon, Mar 14, 2016 at 11:16 AM, Gábor Horváth <
> > > [hidden email]>
> > > >>>>>> wrote:
> > > >>>>>>
> > > >>>>>>> Hi,
> > > >>>>>>>
> > > >>>>>>> I have updated this draft to include preliminary benchmarks,
> > > >> mentioned
> > > >>>>>> the
> > > >>>>>>> interaction of annotations with savepoints, extended it with a
> > > >>>> timeline,
> > > >>>>>>> and some notes about scala case classes.
> > > >>>>>>>
> > > >>>>>>> Regards,
> > > >>>>>>> Gábor
> > > >>>>>>>
> > > >>>>>>> On 9 March 2016 at 16:12, Gábor Horváth <[hidden email]>
> > > wrote:
> > > >>>>>>>
> > > >>>>>>>> Hi!
> > > >>>>>>>>
> > > >>>>>>>> As far as I can see the formatting was not correct in my
> > previous
> > > >>>>>> mail. A
> > > >>>>>>>> better formatted version is available here:
> > > >>>>>>>>
> > > >>>>>>>
> > > >>>>>>
> > > >>>>
> > > >>
> > >
> >
> https://docs.google.com/document/d/1VC8lCeErx9kI5lCMPiUn625PO0rxR-iKlVqtt3hkVnk
> > > >>>>>>>> Sorry for that.
> > > >>>>>>>>
> > > >>>>>>>> Regards,
> > > >>>>>>>> Gábor
> > > >>>>>>>>
> > > >>>>>>>> On 9 March 2016 at 15:51, Gábor Horváth <[hidden email]>
> > > >>>> wrote:
> > > >>>>>>>>
> > > >>>>>>>>> Hi,I did not want to send this proposal out before the I have
> > > some
> > > >>>>>>>>> initial benchmarks, but this issue was mentioned on the
> mailing
> > > >>>> list
> > > >>>>>> (
> > > >>>>>>>>>
> > > >>>>>>>
> > > >>>>>>
> > > >>>>
> > > >>
> > >
> >
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/Tuple-performance-and-the-curious-JIT-compiler-td10666.html
> > > >>>>>>> ),
> > > >>>>>>>>> and I wanted to make this information available to be able to
> > > >>>>>>> incorporate
> > > >>>>>>>>> this into that discussion. I have written this draft with the
> > > help
> > > >>>> of
> > > >>>>>>> Gábor
> > > >>>>>>>>> Gévay and Márton Balassi and I am open to every suggestion.
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>> The proposal draft:
> > > >>>>>>>>> Code Generation in Serializers and Comparators of Apache
> Flink
> > > >>>>>>>>>
> > > >>>>>>>>> I am doing my last semester of my MSc studies and I’m a
> former
> > > GSoC
> > > >>>>>>>>> student in the LLVM project. I plan to improve the
> > serialization
> > > >>>>>> code in
> > > >>>>>>>>> Flink during this summer. The current implementation of the
> > > >>>>>> serializers
> > > >>>>>>> can
> > > >>>>>>>>> be a performance bottleneck in some scenarios. These
> > performance
> > > >>>>>>> problems
> > > >>>>>>>>> were also reported on the mailing list recently [1]. I plan
> to
> > > >>>>>> implement
> > > >>>>>>>>> code generation into the serializers to improve the
> performance
> > > (as
> > > >>>>>>> Stephan
> > > >>>>>>>>> Ewen also suggested.)
> > > >>>>>>>>>
> > > >>>>>>>>> TODO: I plan to include some preliminary benchmarks in this
> > > >>>> section.
> > > >>>>>>>>> Performance problems with the current serializers
> > > >>>>>>>>>
> > > >>>>>>>>>  1.
> > > >>>>>>>>>
> > > >>>>>>>>>  PojoSerializer uses reflection for accessing the fields,
> which
> > > >>>> is
> > > >>>>>>>>>  slow (eg. [2])
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>>  -
> > > >>>>>>>>>
> > > >>>>>>>>>  This is also a serious problem for the comparators
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>>  1.
> > > >>>>>>>>>
> > > >>>>>>>>>  When deserializing fields of primitive types (eg. int), the
> > > >>>>>> reusing
> > > >>>>>>>>>  overload of the corresponding field serializers cannot
> really
> > do
> > > >>>>>> any
> > > >>>>>>> reuse,
> > > >>>>>>>>>  because boxed primitive types are immutable in Java. This
> > > >>>> results
> > > >>>>>> in
> > > >>>>>>> lots
> > > >>>>>>>>>  of object creations. [3][7]
> > > >>>>>>>>>  2.
> > > >>>>>>>>>
> > > >>>>>>>>>  The loop to call the field serializers makes virtual
> function
> > > >>>>>> calls,
> > > >>>>>>>>>  that cannot be speculatively devirtualized by the JVM or
> > > >>>> predicted
> > > >>>>>>> by the
> > > >>>>>>>>>  CPU, because different serializer subclasses are invoked for
> > the
> > > >>>>>>> different
> > > >>>>>>>>>  fields. (And the loop cannot be unrolled, because the number
> > of
> > > >>>>>>> iterations
> > > >>>>>>>>>  is not a compile time constant.) See also the following
> > > >>>> discussion
> > > >>>>>>> on the
> > > >>>>>>>>>  mailing list [1].
> > > >>>>>>>>>  3.
> > > >>>>>>>>>
> > > >>>>>>>>>  A POJO field can have the value null, so the serializer
> > inserts
> > > >>>> 1
> > > >>>>>>>>>  byte null tags, which wastes space. (Also, the type
> extractor
> > > >>>>>> logic
> > > >>>>>>> does
> > > >>>>>>>>>  not distinguish between primitive types and their boxed
> > > >>>> versions,
> > > >>>>>> so
> > > >>>>>>> even
> > > >>>>>>>>>  an int field has a null tag.)
> > > >>>>>>>>>  4.
> > > >>>>>>>>>
> > > >>>>>>>>>  Subclass tags also add a byte at the beginning of every POJO
> > > >>>>>>>>>  5.
> > > >>>>>>>>>
> > > >>>>>>>>>  getLength() does not know the size in most cases [4]
> > > >>>>>>>>>  Knowing the size of a type when serialized has numerous
> > > >>>>>> performance
> > > >>>>>>>>>  benefits throughout Flink:
> > > >>>>>>>>>  1.
> > > >>>>>>>>>
> > > >>>>>>>>>     Sorters can do in-place, when the type is small [5]
> > > >>>>>>>>>     2.
> > > >>>>>>>>>
> > > >>>>>>>>>     Chaining hash tables do not need resizes, because they
> know
> > > >>>> how
> > > >>>>>>>>>     many buckets to allocate upfront [6]
> > > >>>>>>>>>     3.
> > > >>>>>>>>>
> > > >>>>>>>>>     Different hash table architectures could be used, eg.
> open
> > > >>>>>>>>>     addressing with linear probing instead of some chaining
> > > >>>>>>>>>     4.
> > > >>>>>>>>>
> > > >>>>>>>>>     It is possible to deserialize, modify, and then serialize
> > > >>>> back
> > > >>>>>> a
> > > >>>>>>>>>     record to its original place, because it cannot happen
> that
> > > >>>> the
> > > >>>>>>> modified
> > > >>>>>>>>>     version does not fit in the place allocated there for the
> > old
> > > >>>>>>> version (see
> > > >>>>>>>>>     CompactingHashTable and ReduceHashTable for concrete
> > > >>>> instances
> > > >>>>>> of
> > > >>>>>>> this
> > > >>>>>>>>>     problem)
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>> Note, that 2. and 3. are problems with not just the
> > > PojoSerializer,
> > > >>>>>> but
> > > >>>>>>>>> also with the TupleSerializer.
> > > >>>>>>>>> Solution approaches
> > > >>>>>>>>>
> > > >>>>>>>>>  1.
> > > >>>>>>>>>
> > > >>>>>>>>>  Run time code generation for every POJO
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>>  -
> > > >>>>>>>>>
> > > >>>>>>>>>     1. and 3 . would be automatically solved, if the
> > serializers
> > > >>>>>> for
> > > >>>>>>>>>     POJOs would be generated on-the-fly (by, for example,
> > > >>>>>> Javassist)
> > > >>>>>>>>>     -
> > > >>>>>>>>>
> > > >>>>>>>>>     2. also needs code generation, and also some extra effort
> > in
> > > >>>>>> the
> > > >>>>>>>>>     type extractor to distinguish between primitive types and
> > > >>>> their
> > > >>>>>>> boxed
> > > >>>>>>>>>     versions
> > > >>>>>>>>>     -
> > > >>>>>>>>>
> > > >>>>>>>>>     could be used for PojoComparator as well (which could
> > greatly
> > > >>>>>>>>>     increase the performance of sorting)
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>>  1.
> > > >>>>>>>>>
> > > >>>>>>>>>  Annotations on POJOs (by the users)
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>>  -
> > > >>>>>>>>>
> > > >>>>>>>>>     Concretely:
> > > >>>>>>>>>     -
> > > >>>>>>>>>
> > > >>>>>>>>>        annotate fields that will never be nulls -> no null
> tag
> > > >>>>>> needed
> > > >>>>>>>>>        before every field!
> > > >>>>>>>>>        -
> > > >>>>>>>>>
> > > >>>>>>>>>        make a POJO final -> no subclass tag needed
> > > >>>>>>>>>        -
> > > >>>>>>>>>
> > > >>>>>>>>>        annotating a POJO that it will not be null -> no top
> > level
> > > >>>>>> null
> > > >>>>>>>>>        tag needed
> > > >>>>>>>>>        -
> > > >>>>>>>>>
> > > >>>>>>>>>     These would also help with the getLength problem (6.),
> > > >>>> because
> > > >>>>>> the
> > > >>>>>>>>>     length is often not known because currently anything can
> be
> > > >>>>>> null
> > > >>>>>>> or a
> > > >>>>>>>>>     subclass can appear anywhere
> > > >>>>>>>>>     -
> > > >>>>>>>>>
> > > >>>>>>>>>     These annotations could be done without code generation,
> > but
> > > >>>>>> then
> > > >>>>>>>>>     they would add some overhead when there are no
> annotations
> > > >>>>>>> present, so this
> > > >>>>>>>>>     would work better together with the code generation
> > > >>>>>>>>>     -
> > > >>>>>>>>>
> > > >>>>>>>>>     Tuples would become a special case of POJOs, where
> nothing
> > > >>>> can
> > > >>>>>> be
> > > >>>>>>>>>     null, and no subclass can appear, so maybe we could
> > eliminate
> > > >>>>>> the
> > > >>>>>>>>>     TupleSerializer
> > > >>>>>>>>>     -
> > > >>>>>>>>>
> > > >>>>>>>>>     We could annotate some internal types in Flink libraries
> > > >>>> (Gelly
> > > >>>>>>>>>     (Vertex, Edge), FlinkML)
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>> TODO: what is the situation with Scala case classes? Run time
> > > code
> > > >>>>>>>>> generation is probably easier in Scala? (with quasiquotes)
> > > >>>>>>>>>
> > > >>>>>>>>> About me
> > > >>>>>>>>>
> > > >>>>>>>>> I am in the last year of my Computer Science MSc studies at
> > > Eotvos
> > > >>>>>>> Lorand
> > > >>>>>>>>> University in Budapest, and planning to start a PhD in the
> > > autumn.
> > > >>>> I
> > > >>>>>>> have
> > > >>>>>>>>> been working for almost three years at Ericsson on static
> > > analysis
> > > >>>>>> tools
> > > >>>>>>>>> for C++. In 2014 I participated in GSoC, working on the LLVM
> > > >>>> project,
> > > >>>>>>> and I
> > > >>>>>>>>> am a frequent contributor ever since. The next summer I was
> > > >>>>>> interning at
> > > >>>>>>>>> Apple.
> > > >>>>>>>>>
> > > >>>>>>>>> I learned about the Flink project not too long ago and I like
> > it
> > > so
> > > >>>>>> far.
> > > >>>>>>>>> The last few weeks I was working on some tickets to
> familiarize
> > > >>>>>> myself
> > > >>>>>>> with
> > > >>>>>>>>> the codebase:
> > > >>>>>>>>>
> > > >>>>>>>>> https://issues.apache.org/jira/browse/FLINK-3422
> > > >>>>>>>>>
> > > >>>>>>>>> https://issues.apache.org/jira/browse/FLINK-3322
> > > >>>>>>>>>
> > > >>>>>>>>> https://issues.apache.org/jira/browse/FLINK-3457
> > > >>>>>>>>>
> > > >>>>>>>>> My CV is available here:
> > > http://xazax.web.elte.hu/files/resume.pdf
> > > >>>>>>>>> References
> > > >>>>>>>>>
> > > >>>>>>>>> [1]
> > > >>>>>>>>>
> > > >>>>>>>
> > > >>>>>>
> > > >>>>
> > > >>
> > >
> >
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/Tuple-performance-and-the-curious-JIT-compiler-td10666.html
> > > >>>>>>>>>
> > > >>>>>>>>> [2]
> > > >>>>>>>>>
> > > >>>>>>>
> > > >>>>>>
> > > >>>>
> > > >>
> > >
> >
> https://github.com/apache/flink/blob/master/flink-java/src/main/java/org/apache/flink/api/java/typeutils/runtime/PojoSerializer.java#L369
> > > >>>>>>>>>
> > > >>>>>>>>> [3]
> > > >>>>>>>>>
> > > >>>>>>>
> > > >>>>>>
> > > >>>>
> > > >>
> > >
> >
> https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/api/common/typeutils/base/IntSerializer.java#L73
> > > >>>>>>>>>
> > > >>>>>>>>> [4]
> > > >>>>>>>>>
> > > >>>>>>>
> > > >>>>>>
> > > >>>>
> > > >>
> > >
> >
> https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/api/common/typeutils/TypeSerializer.java#L98
> > > >>>>>>>>>
> > > >>>>>>>>> [5]
> > > >>>>>>>>>
> > > >>>>>>>
> > > >>>>>>
> > > >>>>
> > > >>
> > >
> >
> https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/operators/sort/FixedLengthRecordSorter.java
> > > >>>>>>>>>
> > > >>>>>>>>> [6]
> > > >>>>>>>>>
> > > >>>>>>>
> > > >>>>>>
> > > >>>>
> > > >>
> > >
> >
> https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/operators/hash/CompactingHashTable.java#L861
> > > >>>>>>>>> [7] https://issues.apache.org/jira/browse/FLINK-3277
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>> Best Regards,
> > > >>>>>>>>>
> > > >>>>>>>>> Gábor
> > > >>>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>
> > > >>>>>>
> > > >>>>>
> > > >>>>>
> > > >>>>
> > > >>
> > > >>
> > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: GSoC Project Proposal Draft: Code Generation in Serializers

Gábor Horváth
Hi Fabian,

I agree that it would be awesome to move this to its own module/plugin.
However in order to be able to write the code generation in Scala I would
need to rewrite the type information to use Scala as well. I think I will
not
have time to do this during the summer, so I think I will stick to Java and
this modularization can be done later.

Thanks,
Gábor

On 19 April 2016 at 11:50, Fabian Hueske <[hidden email]> wrote:

> Hi Gabor,
>
> you are right, a codegen serializer module would depend on flink-core and
> in the current design flink-core would need to know about the type infos /
> serializers / comparators.
>
> Decoupling implementations of type info, serializers, and comparators from
> flink-core and resolving the cyclic dependency would be what the plugin
> architecture would be for.
> Maybe this can be done by some mechanism to dynamically load
> TypeInformations for types with overridden serializers / comparators.
> This would require some design document and discussion in the community.
>
> Cheers, Fabian
>
>
>
>
>
> 2016-04-18 21:19 GMT+02:00 Gábor Horváth <[hidden email]>:
>
> > Unfortunately making code generation a separate module would introduce
> > cyclic dependency.
> > Code generation requires the TypeInfo which is available in flink-core
> and
> > flink-core requires
> > the generated serializers from the code generation module. Do you have a
> > solution for this?
> >
> > I think if we can come up with a solution I will implement it as a
> separate
> > Scala module
> > otherwise I will stick to Java.
> >
> > BR,
> > Gábor
> >
> > On 18 April 2016 at 12:40, Fabian Hueske <[hidden email]> wrote:
> >
> > > +1 for not mixing Java and Scala in flink-core.
> > >
> > > Maybe it makes sense to implement the code generated serializers /
> > > comparators as a separate module which can be plugged-in. This could be
> > > pure Scala.
> > > In general, I think it would be good to have some kind of "version
> > > management" for serializers in place. With features such as safepoints
> > that
> > > depend on the implementation of serializers, it would be good to have a
> > > mechanism to switch between implementations.
> > >
> > > Best, Fabian
> > >
> > > 2016-04-18 10:01 GMT+02:00 Chiwan Park <[hidden email]>:
> > >
> > > > Yes, I know Janino is a pure Java project. I meant if we add Scala
> code
> > > to
> > > > flink-core, we should add Scala dependency to flink-core and it could
> > be
> > > > confusing.
> > > >
> > > > Regards,
> > > > Chiwan Park
> > > >
> > > > > On Apr 18, 2016, at 2:49 PM, Márton Balassi <
> > [hidden email]>
> > > > wrote:
> > > > >
> > > > > Chiwan, just to clarify Janino is a Java project. [1]
> > > > >
> > > > > [1] https://github.com/aunkrig/janino
> > > > >
> > > > > On Mon, Apr 18, 2016 at 3:40 AM, Chiwan Park <
> [hidden email]>
> > > > wrote:
> > > > >
> > > > >> I prefer to avoid Scala dependencies in flink-core. If flink-core
> > > > includes
> > > > >> Scala dependencies, Scala version suffix (_2.10 or _2.11) should
> be
> > > > added.
> > > > >> I think that users could be confused.
> > > > >>
> > > > >> Regards,
> > > > >> Chiwan Park
> > > > >>
> > > > >>> On Apr 17, 2016, at 3:49 PM, Márton Balassi <
> > > [hidden email]>
> > > > >> wrote:
> > > > >>>
> > > > >>> Hi Gábor,
> > > > >>>
> > > > >>> I think that adding the Janino dep to flink-core should be fine,
> as
> > > it
> > > > >> has
> > > > >>> quite slim dependencies [1,2] which are generally orthogonal to
> > > Flink's
> > > > >>> main dependency line (also it is already used elsewhere).
> > > > >>>
> > > > >>> As for mixing Scala code that is used from the Java parts of the
> > same
> > > > >> maven
> > > > >>> module I am skeptical. We have seen IDE compilation issues with
> > > > projects
> > > > >>> using this setup and have decided that the community-wide
> potential
> > > IDE
> > > > >>> setup pain outweighs the individual implementation convenience
> with
> > > > >> Scala.
> > > > >>>
> > > > >>> [1]
> > > > >>>
> > > > >>
> > > >
> > >
> >
> https://repo1.maven.org/maven2/org/codehaus/janino/janino-parent/2.7.8/janino-parent-2.7.8.pom
> > > > >>> [2]
> > > > >>>
> > > > >>
> > > >
> > >
> >
> https://repo1.maven.org/maven2/org/codehaus/janino/janino/2.7.8/janino-2.7.8.pom
> > > > >>>
> > > > >>> On Sat, Apr 16, 2016 at 5:51 PM, Gábor Horváth <
> > [hidden email]>
> > > > >> wrote:
> > > > >>>
> > > > >>>> Hi!
> > > > >>>>
> > > > >>>> Table API already uses code generation and the Janino compiler
> > [1].
> > > Is
> > > > >> it a
> > > > >>>> dependency that is ok to add to flink-core? In case it is ok, I
> > > think
> > > > I
> > > > >>>> will use the same in order to be consistent with the other code
> > > > >> generation
> > > > >>>> efforts.
> > > > >>>>
> > > > >>>> I started to look at the Table API code generation [2] and it
> uses
> > > > Scala
> > > > >>>> extensively. There are several Scala features that can make Java
> > > code
> > > > >>>> generation easier such as pattern matching and string
> > > interpolation. I
> > > > >> did
> > > > >>>> not see any Scala code in flink-core yet. Is it ok to implement
> > the
> > > > code
> > > > >>>> generation inside the flink-core using Scala?
> > > > >>>>
> > > > >>>> Regards,
> > > > >>>> Gábor
> > > > >>>>
> > > > >>>> [1] http://unkrig.de/w/Janino
> > > > >>>> [2]
> > > > >>>>
> > > > >>>>
> > > > >>
> > > >
> > >
> >
> https://github.com/apache/flink/blob/master/flink-libraries/flink-table/src/main/scala/org/apache/flink/api/table/codegen/CodeGenerator.scala
> > > > >>>>
> > > > >>>> On 18 March 2016 at 19:37, Gábor Horváth <[hidden email]>
> > > wrote:
> > > > >>>>
> > > > >>>>> Thank you! I finalized the project.
> > > > >>>>>
> > > > >>>>>
> > > > >>>>> On 18 March 2016 at 10:29, Márton Balassi <
> > > [hidden email]>
> > > > >>>>> wrote:
> > > > >>>>>
> > > > >>>>>> Thanks Gábor, now I also see it on the internal GSoC
> interface.
> > I
> > > > have
> > > > >>>>>> indicated that I wish to mentor your project, I think you can
> > hit
> > > > >>>> finalize
> > > > >>>>>> on your project there.
> > > > >>>>>>
> > > > >>>>>> On Mon, Mar 14, 2016 at 11:16 AM, Gábor Horváth <
> > > > [hidden email]>
> > > > >>>>>> wrote:
> > > > >>>>>>
> > > > >>>>>>> Hi,
> > > > >>>>>>>
> > > > >>>>>>> I have updated this draft to include preliminary benchmarks,
> > > > >> mentioned
> > > > >>>>>> the
> > > > >>>>>>> interaction of annotations with savepoints, extended it with
> a
> > > > >>>> timeline,
> > > > >>>>>>> and some notes about scala case classes.
> > > > >>>>>>>
> > > > >>>>>>> Regards,
> > > > >>>>>>> Gábor
> > > > >>>>>>>
> > > > >>>>>>> On 9 March 2016 at 16:12, Gábor Horváth <[hidden email]
> >
> > > > wrote:
> > > > >>>>>>>
> > > > >>>>>>>> Hi!
> > > > >>>>>>>>
> > > > >>>>>>>> As far as I can see the formatting was not correct in my
> > > previous
> > > > >>>>>> mail. A
> > > > >>>>>>>> better formatted version is available here:
> > > > >>>>>>>>
> > > > >>>>>>>
> > > > >>>>>>
> > > > >>>>
> > > > >>
> > > >
> > >
> >
> https://docs.google.com/document/d/1VC8lCeErx9kI5lCMPiUn625PO0rxR-iKlVqtt3hkVnk
> > > > >>>>>>>> Sorry for that.
> > > > >>>>>>>>
> > > > >>>>>>>> Regards,
> > > > >>>>>>>> Gábor
> > > > >>>>>>>>
> > > > >>>>>>>> On 9 March 2016 at 15:51, Gábor Horváth <
> [hidden email]>
> > > > >>>> wrote:
> > > > >>>>>>>>
> > > > >>>>>>>>> Hi,I did not want to send this proposal out before the I
> have
> > > > some
> > > > >>>>>>>>> initial benchmarks, but this issue was mentioned on the
> > mailing
> > > > >>>> list
> > > > >>>>>> (
> > > > >>>>>>>>>
> > > > >>>>>>>
> > > > >>>>>>
> > > > >>>>
> > > > >>
> > > >
> > >
> >
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/Tuple-performance-and-the-curious-JIT-compiler-td10666.html
> > > > >>>>>>> ),
> > > > >>>>>>>>> and I wanted to make this information available to be able
> to
> > > > >>>>>>> incorporate
> > > > >>>>>>>>> this into that discussion. I have written this draft with
> the
> > > > help
> > > > >>>> of
> > > > >>>>>>> Gábor
> > > > >>>>>>>>> Gévay and Márton Balassi and I am open to every suggestion.
> > > > >>>>>>>>>
> > > > >>>>>>>>>
> > > > >>>>>>>>> The proposal draft:
> > > > >>>>>>>>> Code Generation in Serializers and Comparators of Apache
> > Flink
> > > > >>>>>>>>>
> > > > >>>>>>>>> I am doing my last semester of my MSc studies and I’m a
> > former
> > > > GSoC
> > > > >>>>>>>>> student in the LLVM project. I plan to improve the
> > > serialization
> > > > >>>>>> code in
> > > > >>>>>>>>> Flink during this summer. The current implementation of the
> > > > >>>>>> serializers
> > > > >>>>>>> can
> > > > >>>>>>>>> be a performance bottleneck in some scenarios. These
> > > performance
> > > > >>>>>>> problems
> > > > >>>>>>>>> were also reported on the mailing list recently [1]. I plan
> > to
> > > > >>>>>> implement
> > > > >>>>>>>>> code generation into the serializers to improve the
> > performance
> > > > (as
> > > > >>>>>>> Stephan
> > > > >>>>>>>>> Ewen also suggested.)
> > > > >>>>>>>>>
> > > > >>>>>>>>> TODO: I plan to include some preliminary benchmarks in this
> > > > >>>> section.
> > > > >>>>>>>>> Performance problems with the current serializers
> > > > >>>>>>>>>
> > > > >>>>>>>>>  1.
> > > > >>>>>>>>>
> > > > >>>>>>>>>  PojoSerializer uses reflection for accessing the fields,
> > which
> > > > >>>> is
> > > > >>>>>>>>>  slow (eg. [2])
> > > > >>>>>>>>>
> > > > >>>>>>>>>
> > > > >>>>>>>>>  -
> > > > >>>>>>>>>
> > > > >>>>>>>>>  This is also a serious problem for the comparators
> > > > >>>>>>>>>
> > > > >>>>>>>>>
> > > > >>>>>>>>>  1.
> > > > >>>>>>>>>
> > > > >>>>>>>>>  When deserializing fields of primitive types (eg. int),
> the
> > > > >>>>>> reusing
> > > > >>>>>>>>>  overload of the corresponding field serializers cannot
> > really
> > > do
> > > > >>>>>> any
> > > > >>>>>>> reuse,
> > > > >>>>>>>>>  because boxed primitive types are immutable in Java. This
> > > > >>>> results
> > > > >>>>>> in
> > > > >>>>>>> lots
> > > > >>>>>>>>>  of object creations. [3][7]
> > > > >>>>>>>>>  2.
> > > > >>>>>>>>>
> > > > >>>>>>>>>  The loop to call the field serializers makes virtual
> > function
> > > > >>>>>> calls,
> > > > >>>>>>>>>  that cannot be speculatively devirtualized by the JVM or
> > > > >>>> predicted
> > > > >>>>>>> by the
> > > > >>>>>>>>>  CPU, because different serializer subclasses are invoked
> for
> > > the
> > > > >>>>>>> different
> > > > >>>>>>>>>  fields. (And the loop cannot be unrolled, because the
> number
> > > of
> > > > >>>>>>> iterations
> > > > >>>>>>>>>  is not a compile time constant.) See also the following
> > > > >>>> discussion
> > > > >>>>>>> on the
> > > > >>>>>>>>>  mailing list [1].
> > > > >>>>>>>>>  3.
> > > > >>>>>>>>>
> > > > >>>>>>>>>  A POJO field can have the value null, so the serializer
> > > inserts
> > > > >>>> 1
> > > > >>>>>>>>>  byte null tags, which wastes space. (Also, the type
> > extractor
> > > > >>>>>> logic
> > > > >>>>>>> does
> > > > >>>>>>>>>  not distinguish between primitive types and their boxed
> > > > >>>> versions,
> > > > >>>>>> so
> > > > >>>>>>> even
> > > > >>>>>>>>>  an int field has a null tag.)
> > > > >>>>>>>>>  4.
> > > > >>>>>>>>>
> > > > >>>>>>>>>  Subclass tags also add a byte at the beginning of every
> POJO
> > > > >>>>>>>>>  5.
> > > > >>>>>>>>>
> > > > >>>>>>>>>  getLength() does not know the size in most cases [4]
> > > > >>>>>>>>>  Knowing the size of a type when serialized has numerous
> > > > >>>>>> performance
> > > > >>>>>>>>>  benefits throughout Flink:
> > > > >>>>>>>>>  1.
> > > > >>>>>>>>>
> > > > >>>>>>>>>     Sorters can do in-place, when the type is small [5]
> > > > >>>>>>>>>     2.
> > > > >>>>>>>>>
> > > > >>>>>>>>>     Chaining hash tables do not need resizes, because they
> > know
> > > > >>>> how
> > > > >>>>>>>>>     many buckets to allocate upfront [6]
> > > > >>>>>>>>>     3.
> > > > >>>>>>>>>
> > > > >>>>>>>>>     Different hash table architectures could be used, eg.
> > open
> > > > >>>>>>>>>     addressing with linear probing instead of some chaining
> > > > >>>>>>>>>     4.
> > > > >>>>>>>>>
> > > > >>>>>>>>>     It is possible to deserialize, modify, and then
> serialize
> > > > >>>> back
> > > > >>>>>> a
> > > > >>>>>>>>>     record to its original place, because it cannot happen
> > that
> > > > >>>> the
> > > > >>>>>>> modified
> > > > >>>>>>>>>     version does not fit in the place allocated there for
> the
> > > old
> > > > >>>>>>> version (see
> > > > >>>>>>>>>     CompactingHashTable and ReduceHashTable for concrete
> > > > >>>> instances
> > > > >>>>>> of
> > > > >>>>>>> this
> > > > >>>>>>>>>     problem)
> > > > >>>>>>>>>
> > > > >>>>>>>>>
> > > > >>>>>>>>> Note, that 2. and 3. are problems with not just the
> > > > PojoSerializer,
> > > > >>>>>> but
> > > > >>>>>>>>> also with the TupleSerializer.
> > > > >>>>>>>>> Solution approaches
> > > > >>>>>>>>>
> > > > >>>>>>>>>  1.
> > > > >>>>>>>>>
> > > > >>>>>>>>>  Run time code generation for every POJO
> > > > >>>>>>>>>
> > > > >>>>>>>>>
> > > > >>>>>>>>>  -
> > > > >>>>>>>>>
> > > > >>>>>>>>>     1. and 3 . would be automatically solved, if the
> > > serializers
> > > > >>>>>> for
> > > > >>>>>>>>>     POJOs would be generated on-the-fly (by, for example,
> > > > >>>>>> Javassist)
> > > > >>>>>>>>>     -
> > > > >>>>>>>>>
> > > > >>>>>>>>>     2. also needs code generation, and also some extra
> effort
> > > in
> > > > >>>>>> the
> > > > >>>>>>>>>     type extractor to distinguish between primitive types
> and
> > > > >>>> their
> > > > >>>>>>> boxed
> > > > >>>>>>>>>     versions
> > > > >>>>>>>>>     -
> > > > >>>>>>>>>
> > > > >>>>>>>>>     could be used for PojoComparator as well (which could
> > > greatly
> > > > >>>>>>>>>     increase the performance of sorting)
> > > > >>>>>>>>>
> > > > >>>>>>>>>
> > > > >>>>>>>>>  1.
> > > > >>>>>>>>>
> > > > >>>>>>>>>  Annotations on POJOs (by the users)
> > > > >>>>>>>>>
> > > > >>>>>>>>>
> > > > >>>>>>>>>  -
> > > > >>>>>>>>>
> > > > >>>>>>>>>     Concretely:
> > > > >>>>>>>>>     -
> > > > >>>>>>>>>
> > > > >>>>>>>>>        annotate fields that will never be nulls -> no null
> > tag
> > > > >>>>>> needed
> > > > >>>>>>>>>        before every field!
> > > > >>>>>>>>>        -
> > > > >>>>>>>>>
> > > > >>>>>>>>>        make a POJO final -> no subclass tag needed
> > > > >>>>>>>>>        -
> > > > >>>>>>>>>
> > > > >>>>>>>>>        annotating a POJO that it will not be null -> no top
> > > level
> > > > >>>>>> null
> > > > >>>>>>>>>        tag needed
> > > > >>>>>>>>>        -
> > > > >>>>>>>>>
> > > > >>>>>>>>>     These would also help with the getLength problem (6.),
> > > > >>>> because
> > > > >>>>>> the
> > > > >>>>>>>>>     length is often not known because currently anything
> can
> > be
> > > > >>>>>> null
> > > > >>>>>>> or a
> > > > >>>>>>>>>     subclass can appear anywhere
> > > > >>>>>>>>>     -
> > > > >>>>>>>>>
> > > > >>>>>>>>>     These annotations could be done without code
> generation,
> > > but
> > > > >>>>>> then
> > > > >>>>>>>>>     they would add some overhead when there are no
> > annotations
> > > > >>>>>>> present, so this
> > > > >>>>>>>>>     would work better together with the code generation
> > > > >>>>>>>>>     -
> > > > >>>>>>>>>
> > > > >>>>>>>>>     Tuples would become a special case of POJOs, where
> > nothing
> > > > >>>> can
> > > > >>>>>> be
> > > > >>>>>>>>>     null, and no subclass can appear, so maybe we could
> > > eliminate
> > > > >>>>>> the
> > > > >>>>>>>>>     TupleSerializer
> > > > >>>>>>>>>     -
> > > > >>>>>>>>>
> > > > >>>>>>>>>     We could annotate some internal types in Flink
> libraries
> > > > >>>> (Gelly
> > > > >>>>>>>>>     (Vertex, Edge), FlinkML)
> > > > >>>>>>>>>
> > > > >>>>>>>>>
> > > > >>>>>>>>> TODO: what is the situation with Scala case classes? Run
> time
> > > > code
> > > > >>>>>>>>> generation is probably easier in Scala? (with quasiquotes)
> > > > >>>>>>>>>
> > > > >>>>>>>>> About me
> > > > >>>>>>>>>
> > > > >>>>>>>>> I am in the last year of my Computer Science MSc studies at
> > > > Eotvos
> > > > >>>>>>> Lorand
> > > > >>>>>>>>> University in Budapest, and planning to start a PhD in the
> > > > autumn.
> > > > >>>> I
> > > > >>>>>>> have
> > > > >>>>>>>>> been working for almost three years at Ericsson on static
> > > > analysis
> > > > >>>>>> tools
> > > > >>>>>>>>> for C++. In 2014 I participated in GSoC, working on the
> LLVM
> > > > >>>> project,
> > > > >>>>>>> and I
> > > > >>>>>>>>> am a frequent contributor ever since. The next summer I was
> > > > >>>>>> interning at
> > > > >>>>>>>>> Apple.
> > > > >>>>>>>>>
> > > > >>>>>>>>> I learned about the Flink project not too long ago and I
> like
> > > it
> > > > so
> > > > >>>>>> far.
> > > > >>>>>>>>> The last few weeks I was working on some tickets to
> > familiarize
> > > > >>>>>> myself
> > > > >>>>>>> with
> > > > >>>>>>>>> the codebase:
> > > > >>>>>>>>>
> > > > >>>>>>>>> https://issues.apache.org/jira/browse/FLINK-3422
> > > > >>>>>>>>>
> > > > >>>>>>>>> https://issues.apache.org/jira/browse/FLINK-3322
> > > > >>>>>>>>>
> > > > >>>>>>>>> https://issues.apache.org/jira/browse/FLINK-3457
> > > > >>>>>>>>>
> > > > >>>>>>>>> My CV is available here:
> > > > http://xazax.web.elte.hu/files/resume.pdf
> > > > >>>>>>>>> References
> > > > >>>>>>>>>
> > > > >>>>>>>>> [1]
> > > > >>>>>>>>>
> > > > >>>>>>>
> > > > >>>>>>
> > > > >>>>
> > > > >>
> > > >
> > >
> >
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/Tuple-performance-and-the-curious-JIT-compiler-td10666.html
> > > > >>>>>>>>>
> > > > >>>>>>>>> [2]
> > > > >>>>>>>>>
> > > > >>>>>>>
> > > > >>>>>>
> > > > >>>>
> > > > >>
> > > >
> > >
> >
> https://github.com/apache/flink/blob/master/flink-java/src/main/java/org/apache/flink/api/java/typeutils/runtime/PojoSerializer.java#L369
> > > > >>>>>>>>>
> > > > >>>>>>>>> [3]
> > > > >>>>>>>>>
> > > > >>>>>>>
> > > > >>>>>>
> > > > >>>>
> > > > >>
> > > >
> > >
> >
> https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/api/common/typeutils/base/IntSerializer.java#L73
> > > > >>>>>>>>>
> > > > >>>>>>>>> [4]
> > > > >>>>>>>>>
> > > > >>>>>>>
> > > > >>>>>>
> > > > >>>>
> > > > >>
> > > >
> > >
> >
> https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/api/common/typeutils/TypeSerializer.java#L98
> > > > >>>>>>>>>
> > > > >>>>>>>>> [5]
> > > > >>>>>>>>>
> > > > >>>>>>>
> > > > >>>>>>
> > > > >>>>
> > > > >>
> > > >
> > >
> >
> https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/operators/sort/FixedLengthRecordSorter.java
> > > > >>>>>>>>>
> > > > >>>>>>>>> [6]
> > > > >>>>>>>>>
> > > > >>>>>>>
> > > > >>>>>>
> > > > >>>>
> > > > >>
> > > >
> > >
> >
> https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/operators/hash/CompactingHashTable.java#L861
> > > > >>>>>>>>> [7] https://issues.apache.org/jira/browse/FLINK-3277
> > > > >>>>>>>>>
> > > > >>>>>>>>>
> > > > >>>>>>>>> Best Regards,
> > > > >>>>>>>>>
> > > > >>>>>>>>> Gábor
> > > > >>>>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>>
> > > > >>>>>>
> > > > >>>>>
> > > > >>>>>
> > > > >>>>
> > > > >>
> > > > >>
> > > >
> > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: GSoC Project Proposal Draft: Code Generation in Serializers

Fabian Hueske-2
Why would you need to rewrite the TypeInformation in Scala?
I think we need a way to replace Serializer implementations anyway unless
the generated serializers are compatible to the current ones.

2016-04-20 15:53 GMT+02:00 Gábor Horváth <[hidden email]>:

> Hi Fabian,
>
> I agree that it would be awesome to move this to its own module/plugin.
> However in order to be able to write the code generation in Scala I would
> need to rewrite the type information to use Scala as well. I think I will
> not
> have time to do this during the summer, so I think I will stick to Java and
> this modularization can be done later.
>
> Thanks,
> Gábor
>
> On 19 April 2016 at 11:50, Fabian Hueske <[hidden email]> wrote:
>
> > Hi Gabor,
> >
> > you are right, a codegen serializer module would depend on flink-core and
> > in the current design flink-core would need to know about the type infos
> /
> > serializers / comparators.
> >
> > Decoupling implementations of type info, serializers, and comparators
> from
> > flink-core and resolving the cyclic dependency would be what the plugin
> > architecture would be for.
> > Maybe this can be done by some mechanism to dynamically load
> > TypeInformations for types with overridden serializers / comparators.
> > This would require some design document and discussion in the community.
> >
> > Cheers, Fabian
> >
> >
> >
> >
> >
> > 2016-04-18 21:19 GMT+02:00 Gábor Horváth <[hidden email]>:
> >
> > > Unfortunately making code generation a separate module would introduce
> > > cyclic dependency.
> > > Code generation requires the TypeInfo which is available in flink-core
> > and
> > > flink-core requires
> > > the generated serializers from the code generation module. Do you have
> a
> > > solution for this?
> > >
> > > I think if we can come up with a solution I will implement it as a
> > separate
> > > Scala module
> > > otherwise I will stick to Java.
> > >
> > > BR,
> > > Gábor
> > >
> > > On 18 April 2016 at 12:40, Fabian Hueske <[hidden email]> wrote:
> > >
> > > > +1 for not mixing Java and Scala in flink-core.
> > > >
> > > > Maybe it makes sense to implement the code generated serializers /
> > > > comparators as a separate module which can be plugged-in. This could
> be
> > > > pure Scala.
> > > > In general, I think it would be good to have some kind of "version
> > > > management" for serializers in place. With features such as
> safepoints
> > > that
> > > > depend on the implementation of serializers, it would be good to
> have a
> > > > mechanism to switch between implementations.
> > > >
> > > > Best, Fabian
> > > >
> > > > 2016-04-18 10:01 GMT+02:00 Chiwan Park <[hidden email]>:
> > > >
> > > > > Yes, I know Janino is a pure Java project. I meant if we add Scala
> > code
> > > > to
> > > > > flink-core, we should add Scala dependency to flink-core and it
> could
> > > be
> > > > > confusing.
> > > > >
> > > > > Regards,
> > > > > Chiwan Park
> > > > >
> > > > > > On Apr 18, 2016, at 2:49 PM, Márton Balassi <
> > > [hidden email]>
> > > > > wrote:
> > > > > >
> > > > > > Chiwan, just to clarify Janino is a Java project. [1]
> > > > > >
> > > > > > [1] https://github.com/aunkrig/janino
> > > > > >
> > > > > > On Mon, Apr 18, 2016 at 3:40 AM, Chiwan Park <
> > [hidden email]>
> > > > > wrote:
> > > > > >
> > > > > >> I prefer to avoid Scala dependencies in flink-core. If
> flink-core
> > > > > includes
> > > > > >> Scala dependencies, Scala version suffix (_2.10 or _2.11) should
> > be
> > > > > added.
> > > > > >> I think that users could be confused.
> > > > > >>
> > > > > >> Regards,
> > > > > >> Chiwan Park
> > > > > >>
> > > > > >>> On Apr 17, 2016, at 3:49 PM, Márton Balassi <
> > > > [hidden email]>
> > > > > >> wrote:
> > > > > >>>
> > > > > >>> Hi Gábor,
> > > > > >>>
> > > > > >>> I think that adding the Janino dep to flink-core should be
> fine,
> > as
> > > > it
> > > > > >> has
> > > > > >>> quite slim dependencies [1,2] which are generally orthogonal to
> > > > Flink's
> > > > > >>> main dependency line (also it is already used elsewhere).
> > > > > >>>
> > > > > >>> As for mixing Scala code that is used from the Java parts of
> the
> > > same
> > > > > >> maven
> > > > > >>> module I am skeptical. We have seen IDE compilation issues with
> > > > > projects
> > > > > >>> using this setup and have decided that the community-wide
> > potential
> > > > IDE
> > > > > >>> setup pain outweighs the individual implementation convenience
> > with
> > > > > >> Scala.
> > > > > >>>
> > > > > >>> [1]
> > > > > >>>
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://repo1.maven.org/maven2/org/codehaus/janino/janino-parent/2.7.8/janino-parent-2.7.8.pom
> > > > > >>> [2]
> > > > > >>>
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://repo1.maven.org/maven2/org/codehaus/janino/janino/2.7.8/janino-2.7.8.pom
> > > > > >>>
> > > > > >>> On Sat, Apr 16, 2016 at 5:51 PM, Gábor Horváth <
> > > [hidden email]>
> > > > > >> wrote:
> > > > > >>>
> > > > > >>>> Hi!
> > > > > >>>>
> > > > > >>>> Table API already uses code generation and the Janino compiler
> > > [1].
> > > > Is
> > > > > >> it a
> > > > > >>>> dependency that is ok to add to flink-core? In case it is ok,
> I
> > > > think
> > > > > I
> > > > > >>>> will use the same in order to be consistent with the other
> code
> > > > > >> generation
> > > > > >>>> efforts.
> > > > > >>>>
> > > > > >>>> I started to look at the Table API code generation [2] and it
> > uses
> > > > > Scala
> > > > > >>>> extensively. There are several Scala features that can make
> Java
> > > > code
> > > > > >>>> generation easier such as pattern matching and string
> > > > interpolation. I
> > > > > >> did
> > > > > >>>> not see any Scala code in flink-core yet. Is it ok to
> implement
> > > the
> > > > > code
> > > > > >>>> generation inside the flink-core using Scala?
> > > > > >>>>
> > > > > >>>> Regards,
> > > > > >>>> Gábor
> > > > > >>>>
> > > > > >>>> [1] http://unkrig.de/w/Janino
> > > > > >>>> [2]
> > > > > >>>>
> > > > > >>>>
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://github.com/apache/flink/blob/master/flink-libraries/flink-table/src/main/scala/org/apache/flink/api/table/codegen/CodeGenerator.scala
> > > > > >>>>
> > > > > >>>> On 18 March 2016 at 19:37, Gábor Horváth <[hidden email]
> >
> > > > wrote:
> > > > > >>>>
> > > > > >>>>> Thank you! I finalized the project.
> > > > > >>>>>
> > > > > >>>>>
> > > > > >>>>> On 18 March 2016 at 10:29, Márton Balassi <
> > > > [hidden email]>
> > > > > >>>>> wrote:
> > > > > >>>>>
> > > > > >>>>>> Thanks Gábor, now I also see it on the internal GSoC
> > interface.
> > > I
> > > > > have
> > > > > >>>>>> indicated that I wish to mentor your project, I think you
> can
> > > hit
> > > > > >>>> finalize
> > > > > >>>>>> on your project there.
> > > > > >>>>>>
> > > > > >>>>>> On Mon, Mar 14, 2016 at 11:16 AM, Gábor Horváth <
> > > > > [hidden email]>
> > > > > >>>>>> wrote:
> > > > > >>>>>>
> > > > > >>>>>>> Hi,
> > > > > >>>>>>>
> > > > > >>>>>>> I have updated this draft to include preliminary
> benchmarks,
> > > > > >> mentioned
> > > > > >>>>>> the
> > > > > >>>>>>> interaction of annotations with savepoints, extended it
> with
> > a
> > > > > >>>> timeline,
> > > > > >>>>>>> and some notes about scala case classes.
> > > > > >>>>>>>
> > > > > >>>>>>> Regards,
> > > > > >>>>>>> Gábor
> > > > > >>>>>>>
> > > > > >>>>>>> On 9 March 2016 at 16:12, Gábor Horváth <
> [hidden email]
> > >
> > > > > wrote:
> > > > > >>>>>>>
> > > > > >>>>>>>> Hi!
> > > > > >>>>>>>>
> > > > > >>>>>>>> As far as I can see the formatting was not correct in my
> > > > previous
> > > > > >>>>>> mail. A
> > > > > >>>>>>>> better formatted version is available here:
> > > > > >>>>>>>>
> > > > > >>>>>>>
> > > > > >>>>>>
> > > > > >>>>
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1VC8lCeErx9kI5lCMPiUn625PO0rxR-iKlVqtt3hkVnk
> > > > > >>>>>>>> Sorry for that.
> > > > > >>>>>>>>
> > > > > >>>>>>>> Regards,
> > > > > >>>>>>>> Gábor
> > > > > >>>>>>>>
> > > > > >>>>>>>> On 9 March 2016 at 15:51, Gábor Horváth <
> > [hidden email]>
> > > > > >>>> wrote:
> > > > > >>>>>>>>
> > > > > >>>>>>>>> Hi,I did not want to send this proposal out before the I
> > have
> > > > > some
> > > > > >>>>>>>>> initial benchmarks, but this issue was mentioned on the
> > > mailing
> > > > > >>>> list
> > > > > >>>>>> (
> > > > > >>>>>>>>>
> > > > > >>>>>>>
> > > > > >>>>>>
> > > > > >>>>
> > > > > >>
> > > > >
> > > >
> > >
> >
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/Tuple-performance-and-the-curious-JIT-compiler-td10666.html
> > > > > >>>>>>> ),
> > > > > >>>>>>>>> and I wanted to make this information available to be
> able
> > to
> > > > > >>>>>>> incorporate
> > > > > >>>>>>>>> this into that discussion. I have written this draft with
> > the
> > > > > help
> > > > > >>>> of
> > > > > >>>>>>> Gábor
> > > > > >>>>>>>>> Gévay and Márton Balassi and I am open to every
> suggestion.
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> The proposal draft:
> > > > > >>>>>>>>> Code Generation in Serializers and Comparators of Apache
> > > Flink
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> I am doing my last semester of my MSc studies and I’m a
> > > former
> > > > > GSoC
> > > > > >>>>>>>>> student in the LLVM project. I plan to improve the
> > > > serialization
> > > > > >>>>>> code in
> > > > > >>>>>>>>> Flink during this summer. The current implementation of
> the
> > > > > >>>>>> serializers
> > > > > >>>>>>> can
> > > > > >>>>>>>>> be a performance bottleneck in some scenarios. These
> > > > performance
> > > > > >>>>>>> problems
> > > > > >>>>>>>>> were also reported on the mailing list recently [1]. I
> plan
> > > to
> > > > > >>>>>> implement
> > > > > >>>>>>>>> code generation into the serializers to improve the
> > > performance
> > > > > (as
> > > > > >>>>>>> Stephan
> > > > > >>>>>>>>> Ewen also suggested.)
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> TODO: I plan to include some preliminary benchmarks in
> this
> > > > > >>>> section.
> > > > > >>>>>>>>> Performance problems with the current serializers
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>  1.
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>  PojoSerializer uses reflection for accessing the fields,
> > > which
> > > > > >>>> is
> > > > > >>>>>>>>>  slow (eg. [2])
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>  -
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>  This is also a serious problem for the comparators
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>  1.
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>  When deserializing fields of primitive types (eg. int),
> > the
> > > > > >>>>>> reusing
> > > > > >>>>>>>>>  overload of the corresponding field serializers cannot
> > > really
> > > > do
> > > > > >>>>>> any
> > > > > >>>>>>> reuse,
> > > > > >>>>>>>>>  because boxed primitive types are immutable in Java.
> This
> > > > > >>>> results
> > > > > >>>>>> in
> > > > > >>>>>>> lots
> > > > > >>>>>>>>>  of object creations. [3][7]
> > > > > >>>>>>>>>  2.
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>  The loop to call the field serializers makes virtual
> > > function
> > > > > >>>>>> calls,
> > > > > >>>>>>>>>  that cannot be speculatively devirtualized by the JVM or
> > > > > >>>> predicted
> > > > > >>>>>>> by the
> > > > > >>>>>>>>>  CPU, because different serializer subclasses are invoked
> > for
> > > > the
> > > > > >>>>>>> different
> > > > > >>>>>>>>>  fields. (And the loop cannot be unrolled, because the
> > number
> > > > of
> > > > > >>>>>>> iterations
> > > > > >>>>>>>>>  is not a compile time constant.) See also the following
> > > > > >>>> discussion
> > > > > >>>>>>> on the
> > > > > >>>>>>>>>  mailing list [1].
> > > > > >>>>>>>>>  3.
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>  A POJO field can have the value null, so the serializer
> > > > inserts
> > > > > >>>> 1
> > > > > >>>>>>>>>  byte null tags, which wastes space. (Also, the type
> > > extractor
> > > > > >>>>>> logic
> > > > > >>>>>>> does
> > > > > >>>>>>>>>  not distinguish between primitive types and their boxed
> > > > > >>>> versions,
> > > > > >>>>>> so
> > > > > >>>>>>> even
> > > > > >>>>>>>>>  an int field has a null tag.)
> > > > > >>>>>>>>>  4.
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>  Subclass tags also add a byte at the beginning of every
> > POJO
> > > > > >>>>>>>>>  5.
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>  getLength() does not know the size in most cases [4]
> > > > > >>>>>>>>>  Knowing the size of a type when serialized has numerous
> > > > > >>>>>> performance
> > > > > >>>>>>>>>  benefits throughout Flink:
> > > > > >>>>>>>>>  1.
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>     Sorters can do in-place, when the type is small [5]
> > > > > >>>>>>>>>     2.
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>     Chaining hash tables do not need resizes, because
> they
> > > know
> > > > > >>>> how
> > > > > >>>>>>>>>     many buckets to allocate upfront [6]
> > > > > >>>>>>>>>     3.
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>     Different hash table architectures could be used, eg.
> > > open
> > > > > >>>>>>>>>     addressing with linear probing instead of some
> chaining
> > > > > >>>>>>>>>     4.
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>     It is possible to deserialize, modify, and then
> > serialize
> > > > > >>>> back
> > > > > >>>>>> a
> > > > > >>>>>>>>>     record to its original place, because it cannot
> happen
> > > that
> > > > > >>>> the
> > > > > >>>>>>> modified
> > > > > >>>>>>>>>     version does not fit in the place allocated there for
> > the
> > > > old
> > > > > >>>>>>> version (see
> > > > > >>>>>>>>>     CompactingHashTable and ReduceHashTable for concrete
> > > > > >>>> instances
> > > > > >>>>>> of
> > > > > >>>>>>> this
> > > > > >>>>>>>>>     problem)
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> Note, that 2. and 3. are problems with not just the
> > > > > PojoSerializer,
> > > > > >>>>>> but
> > > > > >>>>>>>>> also with the TupleSerializer.
> > > > > >>>>>>>>> Solution approaches
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>  1.
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>  Run time code generation for every POJO
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>  -
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>     1. and 3 . would be automatically solved, if the
> > > > serializers
> > > > > >>>>>> for
> > > > > >>>>>>>>>     POJOs would be generated on-the-fly (by, for example,
> > > > > >>>>>> Javassist)
> > > > > >>>>>>>>>     -
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>     2. also needs code generation, and also some extra
> > effort
> > > > in
> > > > > >>>>>> the
> > > > > >>>>>>>>>     type extractor to distinguish between primitive types
> > and
> > > > > >>>> their
> > > > > >>>>>>> boxed
> > > > > >>>>>>>>>     versions
> > > > > >>>>>>>>>     -
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>     could be used for PojoComparator as well (which could
> > > > greatly
> > > > > >>>>>>>>>     increase the performance of sorting)
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>  1.
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>  Annotations on POJOs (by the users)
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>  -
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>     Concretely:
> > > > > >>>>>>>>>     -
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>        annotate fields that will never be nulls -> no
> null
> > > tag
> > > > > >>>>>> needed
> > > > > >>>>>>>>>        before every field!
> > > > > >>>>>>>>>        -
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>        make a POJO final -> no subclass tag needed
> > > > > >>>>>>>>>        -
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>        annotating a POJO that it will not be null -> no
> top
> > > > level
> > > > > >>>>>> null
> > > > > >>>>>>>>>        tag needed
> > > > > >>>>>>>>>        -
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>     These would also help with the getLength problem
> (6.),
> > > > > >>>> because
> > > > > >>>>>> the
> > > > > >>>>>>>>>     length is often not known because currently anything
> > can
> > > be
> > > > > >>>>>> null
> > > > > >>>>>>> or a
> > > > > >>>>>>>>>     subclass can appear anywhere
> > > > > >>>>>>>>>     -
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>     These annotations could be done without code
> > generation,
> > > > but
> > > > > >>>>>> then
> > > > > >>>>>>>>>     they would add some overhead when there are no
> > > annotations
> > > > > >>>>>>> present, so this
> > > > > >>>>>>>>>     would work better together with the code generation
> > > > > >>>>>>>>>     -
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>     Tuples would become a special case of POJOs, where
> > > nothing
> > > > > >>>> can
> > > > > >>>>>> be
> > > > > >>>>>>>>>     null, and no subclass can appear, so maybe we could
> > > > eliminate
> > > > > >>>>>> the
> > > > > >>>>>>>>>     TupleSerializer
> > > > > >>>>>>>>>     -
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>     We could annotate some internal types in Flink
> > libraries
> > > > > >>>> (Gelly
> > > > > >>>>>>>>>     (Vertex, Edge), FlinkML)
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> TODO: what is the situation with Scala case classes? Run
> > time
> > > > > code
> > > > > >>>>>>>>> generation is probably easier in Scala? (with
> quasiquotes)
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> About me
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> I am in the last year of my Computer Science MSc studies
> at
> > > > > Eotvos
> > > > > >>>>>>> Lorand
> > > > > >>>>>>>>> University in Budapest, and planning to start a PhD in
> the
> > > > > autumn.
> > > > > >>>> I
> > > > > >>>>>>> have
> > > > > >>>>>>>>> been working for almost three years at Ericsson on static
> > > > > analysis
> > > > > >>>>>> tools
> > > > > >>>>>>>>> for C++. In 2014 I participated in GSoC, working on the
> > LLVM
> > > > > >>>> project,
> > > > > >>>>>>> and I
> > > > > >>>>>>>>> am a frequent contributor ever since. The next summer I
> was
> > > > > >>>>>> interning at
> > > > > >>>>>>>>> Apple.
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> I learned about the Flink project not too long ago and I
> > like
> > > > it
> > > > > so
> > > > > >>>>>> far.
> > > > > >>>>>>>>> The last few weeks I was working on some tickets to
> > > familiarize
> > > > > >>>>>> myself
> > > > > >>>>>>> with
> > > > > >>>>>>>>> the codebase:
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> https://issues.apache.org/jira/browse/FLINK-3422
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> https://issues.apache.org/jira/browse/FLINK-3322
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> https://issues.apache.org/jira/browse/FLINK-3457
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> My CV is available here:
> > > > > http://xazax.web.elte.hu/files/resume.pdf
> > > > > >>>>>>>>> References
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> [1]
> > > > > >>>>>>>>>
> > > > > >>>>>>>
> > > > > >>>>>>
> > > > > >>>>
> > > > > >>
> > > > >
> > > >
> > >
> >
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/Tuple-performance-and-the-curious-JIT-compiler-td10666.html
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> [2]
> > > > > >>>>>>>>>
> > > > > >>>>>>>
> > > > > >>>>>>
> > > > > >>>>
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://github.com/apache/flink/blob/master/flink-java/src/main/java/org/apache/flink/api/java/typeutils/runtime/PojoSerializer.java#L369
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> [3]
> > > > > >>>>>>>>>
> > > > > >>>>>>>
> > > > > >>>>>>
> > > > > >>>>
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/api/common/typeutils/base/IntSerializer.java#L73
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> [4]
> > > > > >>>>>>>>>
> > > > > >>>>>>>
> > > > > >>>>>>
> > > > > >>>>
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/api/common/typeutils/TypeSerializer.java#L98
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> [5]
> > > > > >>>>>>>>>
> > > > > >>>>>>>
> > > > > >>>>>>
> > > > > >>>>
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/operators/sort/FixedLengthRecordSorter.java
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> [6]
> > > > > >>>>>>>>>
> > > > > >>>>>>>
> > > > > >>>>>>
> > > > > >>>>
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/operators/hash/CompactingHashTable.java#L861
> > > > > >>>>>>>>> [7] https://issues.apache.org/jira/browse/FLINK-3277
> > > > > >>>>>>>>>
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> Best Regards,
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> Gábor
> > > > > >>>>>>>>>
> > > > > >>>>>>>>
> > > > > >>>>>>>>
> > > > > >>>>>>>
> > > > > >>>>>>
> > > > > >>>>>
> > > > > >>>>>
> > > > > >>>>
> > > > > >>
> > > > > >>
> > > > >
> > > > >
> > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: GSoC Project Proposal Draft: Code Generation in Serializers

Gábor Horváth
On the second thought I think you are right. I had the impression that
there is cyclic dependency between TypeInformation and the serializers but
that is not the case. So there is no rewrite needed for TypeInformation in
order to be able to use Scala for serializers.

According to the proposal unless someone utilize the annotations the
generated serializers would be compatible to the current ones. There could
be a configuration option whether to try to make the layout more compact
based on annotations.

On 20 April 2016 at 16:03, Fabian Hueske <[hidden email]> wrote:

> Why would you need to rewrite the TypeInformation in Scala?
> I think we need a way to replace Serializer implementations anyway unless
> the generated serializers are compatible to the current ones.
>
> 2016-04-20 15:53 GMT+02:00 Gábor Horváth <[hidden email]>:
>
> > Hi Fabian,
> >
> > I agree that it would be awesome to move this to its own module/plugin.
> > However in order to be able to write the code generation in Scala I would
> > need to rewrite the type information to use Scala as well. I think I will
> > not
> > have time to do this during the summer, so I think I will stick to Java
> and
> > this modularization can be done later.
> >
> > Thanks,
> > Gábor
> >
> > On 19 April 2016 at 11:50, Fabian Hueske <[hidden email]> wrote:
> >
> > > Hi Gabor,
> > >
> > > you are right, a codegen serializer module would depend on flink-core
> and
> > > in the current design flink-core would need to know about the type
> infos
> > /
> > > serializers / comparators.
> > >
> > > Decoupling implementations of type info, serializers, and comparators
> > from
> > > flink-core and resolving the cyclic dependency would be what the plugin
> > > architecture would be for.
> > > Maybe this can be done by some mechanism to dynamically load
> > > TypeInformations for types with overridden serializers / comparators.
> > > This would require some design document and discussion in the
> community.
> > >
> > > Cheers, Fabian
> > >
> > >
> > >
> > >
> > >
> > > 2016-04-18 21:19 GMT+02:00 Gábor Horváth <[hidden email]>:
> > >
> > > > Unfortunately making code generation a separate module would
> introduce
> > > > cyclic dependency.
> > > > Code generation requires the TypeInfo which is available in
> flink-core
> > > and
> > > > flink-core requires
> > > > the generated serializers from the code generation module. Do you
> have
> > a
> > > > solution for this?
> > > >
> > > > I think if we can come up with a solution I will implement it as a
> > > separate
> > > > Scala module
> > > > otherwise I will stick to Java.
> > > >
> > > > BR,
> > > > Gábor
> > > >
> > > > On 18 April 2016 at 12:40, Fabian Hueske <[hidden email]> wrote:
> > > >
> > > > > +1 for not mixing Java and Scala in flink-core.
> > > > >
> > > > > Maybe it makes sense to implement the code generated serializers /
> > > > > comparators as a separate module which can be plugged-in. This
> could
> > be
> > > > > pure Scala.
> > > > > In general, I think it would be good to have some kind of "version
> > > > > management" for serializers in place. With features such as
> > safepoints
> > > > that
> > > > > depend on the implementation of serializers, it would be good to
> > have a
> > > > > mechanism to switch between implementations.
> > > > >
> > > > > Best, Fabian
> > > > >
> > > > > 2016-04-18 10:01 GMT+02:00 Chiwan Park <[hidden email]>:
> > > > >
> > > > > > Yes, I know Janino is a pure Java project. I meant if we add
> Scala
> > > code
> > > > > to
> > > > > > flink-core, we should add Scala dependency to flink-core and it
> > could
> > > > be
> > > > > > confusing.
> > > > > >
> > > > > > Regards,
> > > > > > Chiwan Park
> > > > > >
> > > > > > > On Apr 18, 2016, at 2:49 PM, Márton Balassi <
> > > > [hidden email]>
> > > > > > wrote:
> > > > > > >
> > > > > > > Chiwan, just to clarify Janino is a Java project. [1]
> > > > > > >
> > > > > > > [1] https://github.com/aunkrig/janino
> > > > > > >
> > > > > > > On Mon, Apr 18, 2016 at 3:40 AM, Chiwan Park <
> > > [hidden email]>
> > > > > > wrote:
> > > > > > >
> > > > > > >> I prefer to avoid Scala dependencies in flink-core. If
> > flink-core
> > > > > > includes
> > > > > > >> Scala dependencies, Scala version suffix (_2.10 or _2.11)
> should
> > > be
> > > > > > added.
> > > > > > >> I think that users could be confused.
> > > > > > >>
> > > > > > >> Regards,
> > > > > > >> Chiwan Park
> > > > > > >>
> > > > > > >>> On Apr 17, 2016, at 3:49 PM, Márton Balassi <
> > > > > [hidden email]>
> > > > > > >> wrote:
> > > > > > >>>
> > > > > > >>> Hi Gábor,
> > > > > > >>>
> > > > > > >>> I think that adding the Janino dep to flink-core should be
> > fine,
> > > as
> > > > > it
> > > > > > >> has
> > > > > > >>> quite slim dependencies [1,2] which are generally orthogonal
> to
> > > > > Flink's
> > > > > > >>> main dependency line (also it is already used elsewhere).
> > > > > > >>>
> > > > > > >>> As for mixing Scala code that is used from the Java parts of
> > the
> > > > same
> > > > > > >> maven
> > > > > > >>> module I am skeptical. We have seen IDE compilation issues
> with
> > > > > > projects
> > > > > > >>> using this setup and have decided that the community-wide
> > > potential
> > > > > IDE
> > > > > > >>> setup pain outweighs the individual implementation
> convenience
> > > with
> > > > > > >> Scala.
> > > > > > >>>
> > > > > > >>> [1]
> > > > > > >>>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> https://repo1.maven.org/maven2/org/codehaus/janino/janino-parent/2.7.8/janino-parent-2.7.8.pom
> > > > > > >>> [2]
> > > > > > >>>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> https://repo1.maven.org/maven2/org/codehaus/janino/janino/2.7.8/janino-2.7.8.pom
> > > > > > >>>
> > > > > > >>> On Sat, Apr 16, 2016 at 5:51 PM, Gábor Horváth <
> > > > [hidden email]>
> > > > > > >> wrote:
> > > > > > >>>
> > > > > > >>>> Hi!
> > > > > > >>>>
> > > > > > >>>> Table API already uses code generation and the Janino
> compiler
> > > > [1].
> > > > > Is
> > > > > > >> it a
> > > > > > >>>> dependency that is ok to add to flink-core? In case it is
> ok,
> > I
> > > > > think
> > > > > > I
> > > > > > >>>> will use the same in order to be consistent with the other
> > code
> > > > > > >> generation
> > > > > > >>>> efforts.
> > > > > > >>>>
> > > > > > >>>> I started to look at the Table API code generation [2] and
> it
> > > uses
> > > > > > Scala
> > > > > > >>>> extensively. There are several Scala features that can make
> > Java
> > > > > code
> > > > > > >>>> generation easier such as pattern matching and string
> > > > > interpolation. I
> > > > > > >> did
> > > > > > >>>> not see any Scala code in flink-core yet. Is it ok to
> > implement
> > > > the
> > > > > > code
> > > > > > >>>> generation inside the flink-core using Scala?
> > > > > > >>>>
> > > > > > >>>> Regards,
> > > > > > >>>> Gábor
> > > > > > >>>>
> > > > > > >>>> [1] http://unkrig.de/w/Janino
> > > > > > >>>> [2]
> > > > > > >>>>
> > > > > > >>>>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/flink/blob/master/flink-libraries/flink-table/src/main/scala/org/apache/flink/api/table/codegen/CodeGenerator.scala
> > > > > > >>>>
> > > > > > >>>> On 18 March 2016 at 19:37, Gábor Horváth <
> [hidden email]
> > >
> > > > > wrote:
> > > > > > >>>>
> > > > > > >>>>> Thank you! I finalized the project.
> > > > > > >>>>>
> > > > > > >>>>>
> > > > > > >>>>> On 18 March 2016 at 10:29, Márton Balassi <
> > > > > [hidden email]>
> > > > > > >>>>> wrote:
> > > > > > >>>>>
> > > > > > >>>>>> Thanks Gábor, now I also see it on the internal GSoC
> > > interface.
> > > > I
> > > > > > have
> > > > > > >>>>>> indicated that I wish to mentor your project, I think you
> > can
> > > > hit
> > > > > > >>>> finalize
> > > > > > >>>>>> on your project there.
> > > > > > >>>>>>
> > > > > > >>>>>> On Mon, Mar 14, 2016 at 11:16 AM, Gábor Horváth <
> > > > > > [hidden email]>
> > > > > > >>>>>> wrote:
> > > > > > >>>>>>
> > > > > > >>>>>>> Hi,
> > > > > > >>>>>>>
> > > > > > >>>>>>> I have updated this draft to include preliminary
> > benchmarks,
> > > > > > >> mentioned
> > > > > > >>>>>> the
> > > > > > >>>>>>> interaction of annotations with savepoints, extended it
> > with
> > > a
> > > > > > >>>> timeline,
> > > > > > >>>>>>> and some notes about scala case classes.
> > > > > > >>>>>>>
> > > > > > >>>>>>> Regards,
> > > > > > >>>>>>> Gábor
> > > > > > >>>>>>>
> > > > > > >>>>>>> On 9 March 2016 at 16:12, Gábor Horváth <
> > [hidden email]
> > > >
> > > > > > wrote:
> > > > > > >>>>>>>
> > > > > > >>>>>>>> Hi!
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> As far as I can see the formatting was not correct in my
> > > > > previous
> > > > > > >>>>>> mail. A
> > > > > > >>>>>>>> better formatted version is available here:
> > > > > > >>>>>>>>
> > > > > > >>>>>>>
> > > > > > >>>>>>
> > > > > > >>>>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1VC8lCeErx9kI5lCMPiUn625PO0rxR-iKlVqtt3hkVnk
> > > > > > >>>>>>>> Sorry for that.
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> Regards,
> > > > > > >>>>>>>> Gábor
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> On 9 March 2016 at 15:51, Gábor Horváth <
> > > [hidden email]>
> > > > > > >>>> wrote:
> > > > > > >>>>>>>>
> > > > > > >>>>>>>>> Hi,I did not want to send this proposal out before the
> I
> > > have
> > > > > > some
> > > > > > >>>>>>>>> initial benchmarks, but this issue was mentioned on the
> > > > mailing
> > > > > > >>>> list
> > > > > > >>>>>> (
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>
> > > > > > >>>>>>
> > > > > > >>>>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/Tuple-performance-and-the-curious-JIT-compiler-td10666.html
> > > > > > >>>>>>> ),
> > > > > > >>>>>>>>> and I wanted to make this information available to be
> > able
> > > to
> > > > > > >>>>>>> incorporate
> > > > > > >>>>>>>>> this into that discussion. I have written this draft
> with
> > > the
> > > > > > help
> > > > > > >>>> of
> > > > > > >>>>>>> Gábor
> > > > > > >>>>>>>>> Gévay and Márton Balassi and I am open to every
> > suggestion.
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> The proposal draft:
> > > > > > >>>>>>>>> Code Generation in Serializers and Comparators of
> Apache
> > > > Flink
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> I am doing my last semester of my MSc studies and I’m a
> > > > former
> > > > > > GSoC
> > > > > > >>>>>>>>> student in the LLVM project. I plan to improve the
> > > > > serialization
> > > > > > >>>>>> code in
> > > > > > >>>>>>>>> Flink during this summer. The current implementation of
> > the
> > > > > > >>>>>> serializers
> > > > > > >>>>>>> can
> > > > > > >>>>>>>>> be a performance bottleneck in some scenarios. These
> > > > > performance
> > > > > > >>>>>>> problems
> > > > > > >>>>>>>>> were also reported on the mailing list recently [1]. I
> > plan
> > > > to
> > > > > > >>>>>> implement
> > > > > > >>>>>>>>> code generation into the serializers to improve the
> > > > performance
> > > > > > (as
> > > > > > >>>>>>> Stephan
> > > > > > >>>>>>>>> Ewen also suggested.)
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> TODO: I plan to include some preliminary benchmarks in
> > this
> > > > > > >>>> section.
> > > > > > >>>>>>>>> Performance problems with the current serializers
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>>  1.
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>>  PojoSerializer uses reflection for accessing the
> fields,
> > > > which
> > > > > > >>>> is
> > > > > > >>>>>>>>>  slow (eg. [2])
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>>  -
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>>  This is also a serious problem for the comparators
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>>  1.
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>>  When deserializing fields of primitive types (eg.
> int),
> > > the
> > > > > > >>>>>> reusing
> > > > > > >>>>>>>>>  overload of the corresponding field serializers cannot
> > > > really
> > > > > do
> > > > > > >>>>>> any
> > > > > > >>>>>>> reuse,
> > > > > > >>>>>>>>>  because boxed primitive types are immutable in Java.
> > This
> > > > > > >>>> results
> > > > > > >>>>>> in
> > > > > > >>>>>>> lots
> > > > > > >>>>>>>>>  of object creations. [3][7]
> > > > > > >>>>>>>>>  2.
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>>  The loop to call the field serializers makes virtual
> > > > function
> > > > > > >>>>>> calls,
> > > > > > >>>>>>>>>  that cannot be speculatively devirtualized by the JVM
> or
> > > > > > >>>> predicted
> > > > > > >>>>>>> by the
> > > > > > >>>>>>>>>  CPU, because different serializer subclasses are
> invoked
> > > for
> > > > > the
> > > > > > >>>>>>> different
> > > > > > >>>>>>>>>  fields. (And the loop cannot be unrolled, because the
> > > number
> > > > > of
> > > > > > >>>>>>> iterations
> > > > > > >>>>>>>>>  is not a compile time constant.) See also the
> following
> > > > > > >>>> discussion
> > > > > > >>>>>>> on the
> > > > > > >>>>>>>>>  mailing list [1].
> > > > > > >>>>>>>>>  3.
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>>  A POJO field can have the value null, so the
> serializer
> > > > > inserts
> > > > > > >>>> 1
> > > > > > >>>>>>>>>  byte null tags, which wastes space. (Also, the type
> > > > extractor
> > > > > > >>>>>> logic
> > > > > > >>>>>>> does
> > > > > > >>>>>>>>>  not distinguish between primitive types and their
> boxed
> > > > > > >>>> versions,
> > > > > > >>>>>> so
> > > > > > >>>>>>> even
> > > > > > >>>>>>>>>  an int field has a null tag.)
> > > > > > >>>>>>>>>  4.
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>>  Subclass tags also add a byte at the beginning of
> every
> > > POJO
> > > > > > >>>>>>>>>  5.
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>>  getLength() does not know the size in most cases [4]
> > > > > > >>>>>>>>>  Knowing the size of a type when serialized has
> numerous
> > > > > > >>>>>> performance
> > > > > > >>>>>>>>>  benefits throughout Flink:
> > > > > > >>>>>>>>>  1.
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>>     Sorters can do in-place, when the type is small [5]
> > > > > > >>>>>>>>>     2.
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>>     Chaining hash tables do not need resizes, because
> > they
> > > > know
> > > > > > >>>> how
> > > > > > >>>>>>>>>     many buckets to allocate upfront [6]
> > > > > > >>>>>>>>>     3.
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>>     Different hash table architectures could be used,
> eg.
> > > > open
> > > > > > >>>>>>>>>     addressing with linear probing instead of some
> > chaining
> > > > > > >>>>>>>>>     4.
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>>     It is possible to deserialize, modify, and then
> > > serialize
> > > > > > >>>> back
> > > > > > >>>>>> a
> > > > > > >>>>>>>>>     record to its original place, because it cannot
> > happen
> > > > that
> > > > > > >>>> the
> > > > > > >>>>>>> modified
> > > > > > >>>>>>>>>     version does not fit in the place allocated there
> for
> > > the
> > > > > old
> > > > > > >>>>>>> version (see
> > > > > > >>>>>>>>>     CompactingHashTable and ReduceHashTable for
> concrete
> > > > > > >>>> instances
> > > > > > >>>>>> of
> > > > > > >>>>>>> this
> > > > > > >>>>>>>>>     problem)
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> Note, that 2. and 3. are problems with not just the
> > > > > > PojoSerializer,
> > > > > > >>>>>> but
> > > > > > >>>>>>>>> also with the TupleSerializer.
> > > > > > >>>>>>>>> Solution approaches
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>>  1.
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>>  Run time code generation for every POJO
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>>  -
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>>     1. and 3 . would be automatically solved, if the
> > > > > serializers
> > > > > > >>>>>> for
> > > > > > >>>>>>>>>     POJOs would be generated on-the-fly (by, for
> example,
> > > > > > >>>>>> Javassist)
> > > > > > >>>>>>>>>     -
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>>     2. also needs code generation, and also some extra
> > > effort
> > > > > in
> > > > > > >>>>>> the
> > > > > > >>>>>>>>>     type extractor to distinguish between primitive
> types
> > > and
> > > > > > >>>> their
> > > > > > >>>>>>> boxed
> > > > > > >>>>>>>>>     versions
> > > > > > >>>>>>>>>     -
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>>     could be used for PojoComparator as well (which
> could
> > > > > greatly
> > > > > > >>>>>>>>>     increase the performance of sorting)
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>>  1.
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>>  Annotations on POJOs (by the users)
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>>  -
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>>     Concretely:
> > > > > > >>>>>>>>>     -
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>>        annotate fields that will never be nulls -> no
> > null
> > > > tag
> > > > > > >>>>>> needed
> > > > > > >>>>>>>>>        before every field!
> > > > > > >>>>>>>>>        -
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>>        make a POJO final -> no subclass tag needed
> > > > > > >>>>>>>>>        -
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>>        annotating a POJO that it will not be null -> no
> > top
> > > > > level
> > > > > > >>>>>> null
> > > > > > >>>>>>>>>        tag needed
> > > > > > >>>>>>>>>        -
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>>     These would also help with the getLength problem
> > (6.),
> > > > > > >>>> because
> > > > > > >>>>>> the
> > > > > > >>>>>>>>>     length is often not known because currently
> anything
> > > can
> > > > be
> > > > > > >>>>>> null
> > > > > > >>>>>>> or a
> > > > > > >>>>>>>>>     subclass can appear anywhere
> > > > > > >>>>>>>>>     -
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>>     These annotations could be done without code
> > > generation,
> > > > > but
> > > > > > >>>>>> then
> > > > > > >>>>>>>>>     they would add some overhead when there are no
> > > > annotations
> > > > > > >>>>>>> present, so this
> > > > > > >>>>>>>>>     would work better together with the code generation
> > > > > > >>>>>>>>>     -
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>>     Tuples would become a special case of POJOs, where
> > > > nothing
> > > > > > >>>> can
> > > > > > >>>>>> be
> > > > > > >>>>>>>>>     null, and no subclass can appear, so maybe we could
> > > > > eliminate
> > > > > > >>>>>> the
> > > > > > >>>>>>>>>     TupleSerializer
> > > > > > >>>>>>>>>     -
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>>     We could annotate some internal types in Flink
> > > libraries
> > > > > > >>>> (Gelly
> > > > > > >>>>>>>>>     (Vertex, Edge), FlinkML)
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> TODO: what is the situation with Scala case classes?
> Run
> > > time
> > > > > > code
> > > > > > >>>>>>>>> generation is probably easier in Scala? (with
> > quasiquotes)
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> About me
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> I am in the last year of my Computer Science MSc
> studies
> > at
> > > > > > Eotvos
> > > > > > >>>>>>> Lorand
> > > > > > >>>>>>>>> University in Budapest, and planning to start a PhD in
> > the
> > > > > > autumn.
> > > > > > >>>> I
> > > > > > >>>>>>> have
> > > > > > >>>>>>>>> been working for almost three years at Ericsson on
> static
> > > > > > analysis
> > > > > > >>>>>> tools
> > > > > > >>>>>>>>> for C++. In 2014 I participated in GSoC, working on the
> > > LLVM
> > > > > > >>>> project,
> > > > > > >>>>>>> and I
> > > > > > >>>>>>>>> am a frequent contributor ever since. The next summer I
> > was
> > > > > > >>>>>> interning at
> > > > > > >>>>>>>>> Apple.
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> I learned about the Flink project not too long ago and
> I
> > > like
> > > > > it
> > > > > > so
> > > > > > >>>>>> far.
> > > > > > >>>>>>>>> The last few weeks I was working on some tickets to
> > > > familiarize
> > > > > > >>>>>> myself
> > > > > > >>>>>>> with
> > > > > > >>>>>>>>> the codebase:
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> https://issues.apache.org/jira/browse/FLINK-3422
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> https://issues.apache.org/jira/browse/FLINK-3322
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> https://issues.apache.org/jira/browse/FLINK-3457
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> My CV is available here:
> > > > > > http://xazax.web.elte.hu/files/resume.pdf
> > > > > > >>>>>>>>> References
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> [1]
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>
> > > > > > >>>>>>
> > > > > > >>>>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/Tuple-performance-and-the-curious-JIT-compiler-td10666.html
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> [2]
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>
> > > > > > >>>>>>
> > > > > > >>>>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/flink/blob/master/flink-java/src/main/java/org/apache/flink/api/java/typeutils/runtime/PojoSerializer.java#L369
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> [3]
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>
> > > > > > >>>>>>
> > > > > > >>>>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/api/common/typeutils/base/IntSerializer.java#L73
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> [4]
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>
> > > > > > >>>>>>
> > > > > > >>>>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/api/common/typeutils/TypeSerializer.java#L98
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> [5]
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>
> > > > > > >>>>>>
> > > > > > >>>>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/operators/sort/FixedLengthRecordSorter.java
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> [6]
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>
> > > > > > >>>>>>
> > > > > > >>>>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/operators/hash/CompactingHashTable.java#L861
> > > > > > >>>>>>>>> [7] https://issues.apache.org/jira/browse/FLINK-3277
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> Best Regards,
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> Gábor
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>
> > > > > > >>>>>>>>
> > > > > > >>>>>>>
> > > > > > >>>>>>
> > > > > > >>>>>
> > > > > > >>>>>
> > > > > > >>>>
> > > > > > >>
> > > > > > >>
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: GSoC Project Proposal Draft: Code Generation in Serializers

Gábor Horváth
Hi,

The GSoC project proposal was accepted! Thank you for all your support. I
will do my best to live up to the challenges and deliver everything that
way planned for this summer.

Best Regards,
Gábor

On 20 April 2016 at 16:18, Gábor Horváth <[hidden email]> wrote:

> On the second thought I think you are right. I had the impression that
> there is cyclic dependency between TypeInformation and the serializers but
> that is not the case. So there is no rewrite needed for TypeInformation in
> order to be able to use Scala for serializers.
>
> According to the proposal unless someone utilize the annotations the
> generated serializers would be compatible to the current ones. There could
> be a configuration option whether to try to make the layout more compact
> based on annotations.
>
> On 20 April 2016 at 16:03, Fabian Hueske <[hidden email]> wrote:
>
>> Why would you need to rewrite the TypeInformation in Scala?
>> I think we need a way to replace Serializer implementations anyway unless
>> the generated serializers are compatible to the current ones.
>>
>> 2016-04-20 15:53 GMT+02:00 Gábor Horváth <[hidden email]>:
>>
>> > Hi Fabian,
>> >
>> > I agree that it would be awesome to move this to its own module/plugin.
>> > However in order to be able to write the code generation in Scala I
>> would
>> > need to rewrite the type information to use Scala as well. I think I
>> will
>> > not
>> > have time to do this during the summer, so I think I will stick to Java
>> and
>> > this modularization can be done later.
>> >
>> > Thanks,
>> > Gábor
>> >
>> > On 19 April 2016 at 11:50, Fabian Hueske <[hidden email]> wrote:
>> >
>> > > Hi Gabor,
>> > >
>> > > you are right, a codegen serializer module would depend on flink-core
>> and
>> > > in the current design flink-core would need to know about the type
>> infos
>> > /
>> > > serializers / comparators.
>> > >
>> > > Decoupling implementations of type info, serializers, and comparators
>> > from
>> > > flink-core and resolving the cyclic dependency would be what the
>> plugin
>> > > architecture would be for.
>> > > Maybe this can be done by some mechanism to dynamically load
>> > > TypeInformations for types with overridden serializers / comparators.
>> > > This would require some design document and discussion in the
>> community.
>> > >
>> > > Cheers, Fabian
>> > >
>> > >
>> > >
>> > >
>> > >
>> > > 2016-04-18 21:19 GMT+02:00 Gábor Horváth <[hidden email]>:
>> > >
>> > > > Unfortunately making code generation a separate module would
>> introduce
>> > > > cyclic dependency.
>> > > > Code generation requires the TypeInfo which is available in
>> flink-core
>> > > and
>> > > > flink-core requires
>> > > > the generated serializers from the code generation module. Do you
>> have
>> > a
>> > > > solution for this?
>> > > >
>> > > > I think if we can come up with a solution I will implement it as a
>> > > separate
>> > > > Scala module
>> > > > otherwise I will stick to Java.
>> > > >
>> > > > BR,
>> > > > Gábor
>> > > >
>> > > > On 18 April 2016 at 12:40, Fabian Hueske <[hidden email]> wrote:
>> > > >
>> > > > > +1 for not mixing Java and Scala in flink-core.
>> > > > >
>> > > > > Maybe it makes sense to implement the code generated serializers /
>> > > > > comparators as a separate module which can be plugged-in. This
>> could
>> > be
>> > > > > pure Scala.
>> > > > > In general, I think it would be good to have some kind of "version
>> > > > > management" for serializers in place. With features such as
>> > safepoints
>> > > > that
>> > > > > depend on the implementation of serializers, it would be good to
>> > have a
>> > > > > mechanism to switch between implementations.
>> > > > >
>> > > > > Best, Fabian
>> > > > >
>> > > > > 2016-04-18 10:01 GMT+02:00 Chiwan Park <[hidden email]>:
>> > > > >
>> > > > > > Yes, I know Janino is a pure Java project. I meant if we add
>> Scala
>> > > code
>> > > > > to
>> > > > > > flink-core, we should add Scala dependency to flink-core and it
>> > could
>> > > > be
>> > > > > > confusing.
>> > > > > >
>> > > > > > Regards,
>> > > > > > Chiwan Park
>> > > > > >
>> > > > > > > On Apr 18, 2016, at 2:49 PM, Márton Balassi <
>> > > > [hidden email]>
>> > > > > > wrote:
>> > > > > > >
>> > > > > > > Chiwan, just to clarify Janino is a Java project. [1]
>> > > > > > >
>> > > > > > > [1] https://github.com/aunkrig/janino
>> > > > > > >
>> > > > > > > On Mon, Apr 18, 2016 at 3:40 AM, Chiwan Park <
>> > > [hidden email]>
>> > > > > > wrote:
>> > > > > > >
>> > > > > > >> I prefer to avoid Scala dependencies in flink-core. If
>> > flink-core
>> > > > > > includes
>> > > > > > >> Scala dependencies, Scala version suffix (_2.10 or _2.11)
>> should
>> > > be
>> > > > > > added.
>> > > > > > >> I think that users could be confused.
>> > > > > > >>
>> > > > > > >> Regards,
>> > > > > > >> Chiwan Park
>> > > > > > >>
>> > > > > > >>> On Apr 17, 2016, at 3:49 PM, Márton Balassi <
>> > > > > [hidden email]>
>> > > > > > >> wrote:
>> > > > > > >>>
>> > > > > > >>> Hi Gábor,
>> > > > > > >>>
>> > > > > > >>> I think that adding the Janino dep to flink-core should be
>> > fine,
>> > > as
>> > > > > it
>> > > > > > >> has
>> > > > > > >>> quite slim dependencies [1,2] which are generally
>> orthogonal to
>> > > > > Flink's
>> > > > > > >>> main dependency line (also it is already used elsewhere).
>> > > > > > >>>
>> > > > > > >>> As for mixing Scala code that is used from the Java parts of
>> > the
>> > > > same
>> > > > > > >> maven
>> > > > > > >>> module I am skeptical. We have seen IDE compilation issues
>> with
>> > > > > > projects
>> > > > > > >>> using this setup and have decided that the community-wide
>> > > potential
>> > > > > IDE
>> > > > > > >>> setup pain outweighs the individual implementation
>> convenience
>> > > with
>> > > > > > >> Scala.
>> > > > > > >>>
>> > > > > > >>> [1]
>> > > > > > >>>
>> > > > > > >>
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://repo1.maven.org/maven2/org/codehaus/janino/janino-parent/2.7.8/janino-parent-2.7.8.pom
>> > > > > > >>> [2]
>> > > > > > >>>
>> > > > > > >>
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://repo1.maven.org/maven2/org/codehaus/janino/janino/2.7.8/janino-2.7.8.pom
>> > > > > > >>>
>> > > > > > >>> On Sat, Apr 16, 2016 at 5:51 PM, Gábor Horváth <
>> > > > [hidden email]>
>> > > > > > >> wrote:
>> > > > > > >>>
>> > > > > > >>>> Hi!
>> > > > > > >>>>
>> > > > > > >>>> Table API already uses code generation and the Janino
>> compiler
>> > > > [1].
>> > > > > Is
>> > > > > > >> it a
>> > > > > > >>>> dependency that is ok to add to flink-core? In case it is
>> ok,
>> > I
>> > > > > think
>> > > > > > I
>> > > > > > >>>> will use the same in order to be consistent with the other
>> > code
>> > > > > > >> generation
>> > > > > > >>>> efforts.
>> > > > > > >>>>
>> > > > > > >>>> I started to look at the Table API code generation [2] and
>> it
>> > > uses
>> > > > > > Scala
>> > > > > > >>>> extensively. There are several Scala features that can make
>> > Java
>> > > > > code
>> > > > > > >>>> generation easier such as pattern matching and string
>> > > > > interpolation. I
>> > > > > > >> did
>> > > > > > >>>> not see any Scala code in flink-core yet. Is it ok to
>> > implement
>> > > > the
>> > > > > > code
>> > > > > > >>>> generation inside the flink-core using Scala?
>> > > > > > >>>>
>> > > > > > >>>> Regards,
>> > > > > > >>>> Gábor
>> > > > > > >>>>
>> > > > > > >>>> [1] http://unkrig.de/w/Janino
>> > > > > > >>>> [2]
>> > > > > > >>>>
>> > > > > > >>>>
>> > > > > > >>
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://github.com/apache/flink/blob/master/flink-libraries/flink-table/src/main/scala/org/apache/flink/api/table/codegen/CodeGenerator.scala
>> > > > > > >>>>
>> > > > > > >>>> On 18 March 2016 at 19:37, Gábor Horváth <
>> [hidden email]
>> > >
>> > > > > wrote:
>> > > > > > >>>>
>> > > > > > >>>>> Thank you! I finalized the project.
>> > > > > > >>>>>
>> > > > > > >>>>>
>> > > > > > >>>>> On 18 March 2016 at 10:29, Márton Balassi <
>> > > > > [hidden email]>
>> > > > > > >>>>> wrote:
>> > > > > > >>>>>
>> > > > > > >>>>>> Thanks Gábor, now I also see it on the internal GSoC
>> > > interface.
>> > > > I
>> > > > > > have
>> > > > > > >>>>>> indicated that I wish to mentor your project, I think you
>> > can
>> > > > hit
>> > > > > > >>>> finalize
>> > > > > > >>>>>> on your project there.
>> > > > > > >>>>>>
>> > > > > > >>>>>> On Mon, Mar 14, 2016 at 11:16 AM, Gábor Horváth <
>> > > > > > [hidden email]>
>> > > > > > >>>>>> wrote:
>> > > > > > >>>>>>
>> > > > > > >>>>>>> Hi,
>> > > > > > >>>>>>>
>> > > > > > >>>>>>> I have updated this draft to include preliminary
>> > benchmarks,
>> > > > > > >> mentioned
>> > > > > > >>>>>> the
>> > > > > > >>>>>>> interaction of annotations with savepoints, extended it
>> > with
>> > > a
>> > > > > > >>>> timeline,
>> > > > > > >>>>>>> and some notes about scala case classes.
>> > > > > > >>>>>>>
>> > > > > > >>>>>>> Regards,
>> > > > > > >>>>>>> Gábor
>> > > > > > >>>>>>>
>> > > > > > >>>>>>> On 9 March 2016 at 16:12, Gábor Horváth <
>> > [hidden email]
>> > > >
>> > > > > > wrote:
>> > > > > > >>>>>>>
>> > > > > > >>>>>>>> Hi!
>> > > > > > >>>>>>>>
>> > > > > > >>>>>>>> As far as I can see the formatting was not correct in
>> my
>> > > > > previous
>> > > > > > >>>>>> mail. A
>> > > > > > >>>>>>>> better formatted version is available here:
>> > > > > > >>>>>>>>
>> > > > > > >>>>>>>
>> > > > > > >>>>>>
>> > > > > > >>>>
>> > > > > > >>
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://docs.google.com/document/d/1VC8lCeErx9kI5lCMPiUn625PO0rxR-iKlVqtt3hkVnk
>> > > > > > >>>>>>>> Sorry for that.
>> > > > > > >>>>>>>>
>> > > > > > >>>>>>>> Regards,
>> > > > > > >>>>>>>> Gábor
>> > > > > > >>>>>>>>
>> > > > > > >>>>>>>> On 9 March 2016 at 15:51, Gábor Horváth <
>> > > [hidden email]>
>> > > > > > >>>> wrote:
>> > > > > > >>>>>>>>
>> > > > > > >>>>>>>>> Hi,I did not want to send this proposal out before
>> the I
>> > > have
>> > > > > > some
>> > > > > > >>>>>>>>> initial benchmarks, but this issue was mentioned on
>> the
>> > > > mailing
>> > > > > > >>>> list
>> > > > > > >>>>>> (
>> > > > > > >>>>>>>>>
>> > > > > > >>>>>>>
>> > > > > > >>>>>>
>> > > > > > >>>>
>> > > > > > >>
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/Tuple-performance-and-the-curious-JIT-compiler-td10666.html
>> > > > > > >>>>>>> ),
>> > > > > > >>>>>>>>> and I wanted to make this information available to be
>> > able
>> > > to
>> > > > > > >>>>>>> incorporate
>> > > > > > >>>>>>>>> this into that discussion. I have written this draft
>> with
>> > > the
>> > > > > > help
>> > > > > > >>>> of
>> > > > > > >>>>>>> Gábor
>> > > > > > >>>>>>>>> Gévay and Márton Balassi and I am open to every
>> > suggestion.
>> > > > > > >>>>>>>>>
>> > > > > > >>>>>>>>>
>> > > > > > >>>>>>>>> The proposal draft:
>> > > > > > >>>>>>>>> Code Generation in Serializers and Comparators of
>> Apache
>> > > > Flink
>> > > > > > >>>>>>>>>
>> > > > > > >>>>>>>>> I am doing my last semester of my MSc studies and I’m
>> a
>> > > > former
>> > > > > > GSoC
>> > > > > > >>>>>>>>> student in the LLVM project. I plan to improve the
>> > > > > serialization
>> > > > > > >>>>>> code in
>> > > > > > >>>>>>>>> Flink during this summer. The current implementation
>> of
>> > the
>> > > > > > >>>>>> serializers
>> > > > > > >>>>>>> can
>> > > > > > >>>>>>>>> be a performance bottleneck in some scenarios. These
>> > > > > performance
>> > > > > > >>>>>>> problems
>> > > > > > >>>>>>>>> were also reported on the mailing list recently [1]. I
>> > plan
>> > > > to
>> > > > > > >>>>>> implement
>> > > > > > >>>>>>>>> code generation into the serializers to improve the
>> > > > performance
>> > > > > > (as
>> > > > > > >>>>>>> Stephan
>> > > > > > >>>>>>>>> Ewen also suggested.)
>> > > > > > >>>>>>>>>
>> > > > > > >>>>>>>>> TODO: I plan to include some preliminary benchmarks in
>> > this
>> > > > > > >>>> section.
>> > > > > > >>>>>>>>> Performance problems with the current serializers
>> > > > > > >>>>>>>>>
>> > > > > > >>>>>>>>>  1.
>> > > > > > >>>>>>>>>
>> > > > > > >>>>>>>>>  PojoSerializer uses reflection for accessing the
>> fields,
>> > > > which
>> > > > > > >>>> is
>> > > > > > >>>>>>>>>  slow (eg. [2])
>> > > > > > >>>>>>>>>
>> > > > > > >>>>>>>>>
>> > > > > > >>>>>>>>>  -
>> > > > > > >>>>>>>>>
>> > > > > > >>>>>>>>>  This is also a serious problem for the comparators
>> > > > > > >>>>>>>>>
>> > > > > > >>>>>>>>>
>> > > > > > >>>>>>>>>  1.
>> > > > > > >>>>>>>>>
>> > > > > > >>>>>>>>>  When deserializing fields of primitive types (eg.
>> int),
>> > > the
>> > > > > > >>>>>> reusing
>> > > > > > >>>>>>>>>  overload of the corresponding field serializers
>> cannot
>> > > > really
>> > > > > do
>> > > > > > >>>>>> any
>> > > > > > >>>>>>> reuse,
>> > > > > > >>>>>>>>>  because boxed primitive types are immutable in Java.
>> > This
>> > > > > > >>>> results
>> > > > > > >>>>>> in
>> > > > > > >>>>>>> lots
>> > > > > > >>>>>>>>>  of object creations. [3][7]
>> > > > > > >>>>>>>>>  2.
>> > > > > > >>>>>>>>>
>> > > > > > >>>>>>>>>  The loop to call the field serializers makes virtual
>> > > > function
>> > > > > > >>>>>> calls,
>> > > > > > >>>>>>>>>  that cannot be speculatively devirtualized by the
>> JVM or
>> > > > > > >>>> predicted
>> > > > > > >>>>>>> by the
>> > > > > > >>>>>>>>>  CPU, because different serializer subclasses are
>> invoked
>> > > for
>> > > > > the
>> > > > > > >>>>>>> different
>> > > > > > >>>>>>>>>  fields. (And the loop cannot be unrolled, because the
>> > > number
>> > > > > of
>> > > > > > >>>>>>> iterations
>> > > > > > >>>>>>>>>  is not a compile time constant.) See also the
>> following
>> > > > > > >>>> discussion
>> > > > > > >>>>>>> on the
>> > > > > > >>>>>>>>>  mailing list [1].
>> > > > > > >>>>>>>>>  3.
>> > > > > > >>>>>>>>>
>> > > > > > >>>>>>>>>  A POJO field can have the value null, so the
>> serializer
>> > > > > inserts
>> > > > > > >>>> 1
>> > > > > > >>>>>>>>>  byte null tags, which wastes space. (Also, the type
>> > > > extractor
>> > > > > > >>>>>> logic
>> > > > > > >>>>>>> does
>> > > > > > >>>>>>>>>  not distinguish between primitive types and their
>> boxed
>> > > > > > >>>> versions,
>> > > > > > >>>>>> so
>> > > > > > >>>>>>> even
>> > > > > > >>>>>>>>>  an int field has a null tag.)
>> > > > > > >>>>>>>>>  4.
>> > > > > > >>>>>>>>>
>> > > > > > >>>>>>>>>  Subclass tags also add a byte at the beginning of
>> every
>> > > POJO
>> > > > > > >>>>>>>>>  5.
>> > > > > > >>>>>>>>>
>> > > > > > >>>>>>>>>  getLength() does not know the size in most cases [4]
>> > > > > > >>>>>>>>>  Knowing the size of a type when serialized has
>> numerous
>> > > > > > >>>>>> performance
>> > > > > > >>>>>>>>>  benefits throughout Flink:
>> > > > > > >>>>>>>>>  1.
>> > > > > > >>>>>>>>>
>> > > > > > >>>>>>>>>     Sorters can do in-place, when the type is small
>> [5]
>> > > > > > >>>>>>>>>     2.
>> > > > > > >>>>>>>>>
>> > > > > > >>>>>>>>>     Chaining hash tables do not need resizes, because
>> > they
>> > > > know
>> > > > > > >>>> how
>> > > > > > >>>>>>>>>     many buckets to allocate upfront [6]
>> > > > > > >>>>>>>>>     3.
>> > > > > > >>>>>>>>>
>> > > > > > >>>>>>>>>     Different hash table architectures could be used,
>> eg.
>> > > > open
>> > > > > > >>>>>>>>>     addressing with linear probing instead of some
>> > chaining
>> > > > > > >>>>>>>>>     4.
>> > > > > > >>>>>>>>>
>> > > > > > >>>>>>>>>     It is possible to deserialize, modify, and then
>> > > serialize
>> > > > > > >>>> back
>> > > > > > >>>>>> a
>> > > > > > >>>>>>>>>     record to its original place, because it cannot
>> > happen
>> > > > that
>> > > > > > >>>> the
>> > > > > > >>>>>>> modified
>> > > > > > >>>>>>>>>     version does not fit in the place allocated there
>> for
>> > > the
>> > > > > old
>> > > > > > >>>>>>> version (see
>> > > > > > >>>>>>>>>     CompactingHashTable and ReduceHashTable for
>> concrete
>> > > > > > >>>> instances
>> > > > > > >>>>>> of
>> > > > > > >>>>>>> this
>> > > > > > >>>>>>>>>     problem)
>> > > > > > >>>>>>>>>
>> > > > > > >>>>>>>>>
>> > > > > > >>>>>>>>> Note, that 2. and 3. are problems with not just the
>> > > > > > PojoSerializer,
>> > > > > > >>>>>> but
>> > > > > > >>>>>>>>> also with the TupleSerializer.
>> > > > > > >>>>>>>>> Solution approaches
>> > > > > > >>>>>>>>>
>> > > > > > >>>>>>>>>  1.
>> > > > > > >>>>>>>>>
>> > > > > > >>>>>>>>>  Run time code generation for every POJO
>> > > > > > >>>>>>>>>
>> > > > > > >>>>>>>>>
>> > > > > > >>>>>>>>>  -
>> > > > > > >>>>>>>>>
>> > > > > > >>>>>>>>>     1. and 3 . would be automatically solved, if the
>> > > > > serializers
>> > > > > > >>>>>> for
>> > > > > > >>>>>>>>>     POJOs would be generated on-the-fly (by, for
>> example,
>> > > > > > >>>>>> Javassist)
>> > > > > > >>>>>>>>>     -
>> > > > > > >>>>>>>>>
>> > > > > > >>>>>>>>>     2. also needs code generation, and also some extra
>> > > effort
>> > > > > in
>> > > > > > >>>>>> the
>> > > > > > >>>>>>>>>     type extractor to distinguish between primitive
>> types
>> > > and
>> > > > > > >>>> their
>> > > > > > >>>>>>> boxed
>> > > > > > >>>>>>>>>     versions
>> > > > > > >>>>>>>>>     -
>> > > > > > >>>>>>>>>
>> > > > > > >>>>>>>>>     could be used for PojoComparator as well (which
>> could
>> > > > > greatly
>> > > > > > >>>>>>>>>     increase the performance of sorting)
>> > > > > > >>>>>>>>>
>> > > > > > >>>>>>>>>
>> > > > > > >>>>>>>>>  1.
>> > > > > > >>>>>>>>>
>> > > > > > >>>>>>>>>  Annotations on POJOs (by the users)
>> > > > > > >>>>>>>>>
>> > > > > > >>>>>>>>>
>> > > > > > >>>>>>>>>  -
>> > > > > > >>>>>>>>>
>> > > > > > >>>>>>>>>     Concretely:
>> > > > > > >>>>>>>>>     -
>> > > > > > >>>>>>>>>
>> > > > > > >>>>>>>>>        annotate fields that will never be nulls -> no
>> > null
>> > > > tag
>> > > > > > >>>>>> needed
>> > > > > > >>>>>>>>>        before every field!
>> > > > > > >>>>>>>>>        -
>> > > > > > >>>>>>>>>
>> > > > > > >>>>>>>>>        make a POJO final -> no subclass tag needed
>> > > > > > >>>>>>>>>        -
>> > > > > > >>>>>>>>>
>> > > > > > >>>>>>>>>        annotating a POJO that it will not be null ->
>> no
>> > top
>> > > > > level
>> > > > > > >>>>>> null
>> > > > > > >>>>>>>>>        tag needed
>> > > > > > >>>>>>>>>        -
>> > > > > > >>>>>>>>>
>> > > > > > >>>>>>>>>     These would also help with the getLength problem
>> > (6.),
>> > > > > > >>>> because
>> > > > > > >>>>>> the
>> > > > > > >>>>>>>>>     length is often not known because currently
>> anything
>> > > can
>> > > > be
>> > > > > > >>>>>> null
>> > > > > > >>>>>>> or a
>> > > > > > >>>>>>>>>     subclass can appear anywhere
>> > > > > > >>>>>>>>>     -
>> > > > > > >>>>>>>>>
>> > > > > > >>>>>>>>>     These annotations could be done without code
>> > > generation,
>> > > > > but
>> > > > > > >>>>>> then
>> > > > > > >>>>>>>>>     they would add some overhead when there are no
>> > > > annotations
>> > > > > > >>>>>>> present, so this
>> > > > > > >>>>>>>>>     would work better together with the code
>> generation
>> > > > > > >>>>>>>>>     -
>> > > > > > >>>>>>>>>
>> > > > > > >>>>>>>>>     Tuples would become a special case of POJOs, where
>> > > > nothing
>> > > > > > >>>> can
>> > > > > > >>>>>> be
>> > > > > > >>>>>>>>>     null, and no subclass can appear, so maybe we
>> could
>> > > > > eliminate
>> > > > > > >>>>>> the
>> > > > > > >>>>>>>>>     TupleSerializer
>> > > > > > >>>>>>>>>     -
>> > > > > > >>>>>>>>>
>> > > > > > >>>>>>>>>     We could annotate some internal types in Flink
>> > > libraries
>> > > > > > >>>> (Gelly
>> > > > > > >>>>>>>>>     (Vertex, Edge), FlinkML)
>> > > > > > >>>>>>>>>
>> > > > > > >>>>>>>>>
>> > > > > > >>>>>>>>> TODO: what is the situation with Scala case classes?
>> Run
>> > > time
>> > > > > > code
>> > > > > > >>>>>>>>> generation is probably easier in Scala? (with
>> > quasiquotes)
>> > > > > > >>>>>>>>>
>> > > > > > >>>>>>>>> About me
>> > > > > > >>>>>>>>>
>> > > > > > >>>>>>>>> I am in the last year of my Computer Science MSc
>> studies
>> > at
>> > > > > > Eotvos
>> > > > > > >>>>>>> Lorand
>> > > > > > >>>>>>>>> University in Budapest, and planning to start a PhD in
>> > the
>> > > > > > autumn.
>> > > > > > >>>> I
>> > > > > > >>>>>>> have
>> > > > > > >>>>>>>>> been working for almost three years at Ericsson on
>> static
>> > > > > > analysis
>> > > > > > >>>>>> tools
>> > > > > > >>>>>>>>> for C++. In 2014 I participated in GSoC, working on
>> the
>> > > LLVM
>> > > > > > >>>> project,
>> > > > > > >>>>>>> and I
>> > > > > > >>>>>>>>> am a frequent contributor ever since. The next summer
>> I
>> > was
>> > > > > > >>>>>> interning at
>> > > > > > >>>>>>>>> Apple.
>> > > > > > >>>>>>>>>
>> > > > > > >>>>>>>>> I learned about the Flink project not too long ago
>> and I
>> > > like
>> > > > > it
>> > > > > > so
>> > > > > > >>>>>> far.
>> > > > > > >>>>>>>>> The last few weeks I was working on some tickets to
>> > > > familiarize
>> > > > > > >>>>>> myself
>> > > > > > >>>>>>> with
>> > > > > > >>>>>>>>> the codebase:
>> > > > > > >>>>>>>>>
>> > > > > > >>>>>>>>> https://issues.apache.org/jira/browse/FLINK-3422
>> > > > > > >>>>>>>>>
>> > > > > > >>>>>>>>> https://issues.apache.org/jira/browse/FLINK-3322
>> > > > > > >>>>>>>>>
>> > > > > > >>>>>>>>> https://issues.apache.org/jira/browse/FLINK-3457
>> > > > > > >>>>>>>>>
>> > > > > > >>>>>>>>> My CV is available here:
>> > > > > > http://xazax.web.elte.hu/files/resume.pdf
>> > > > > > >>>>>>>>> References
>> > > > > > >>>>>>>>>
>> > > > > > >>>>>>>>> [1]
>> > > > > > >>>>>>>>>
>> > > > > > >>>>>>>
>> > > > > > >>>>>>
>> > > > > > >>>>
>> > > > > > >>
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/Tuple-performance-and-the-curious-JIT-compiler-td10666.html
>> > > > > > >>>>>>>>>
>> > > > > > >>>>>>>>> [2]
>> > > > > > >>>>>>>>>
>> > > > > > >>>>>>>
>> > > > > > >>>>>>
>> > > > > > >>>>
>> > > > > > >>
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://github.com/apache/flink/blob/master/flink-java/src/main/java/org/apache/flink/api/java/typeutils/runtime/PojoSerializer.java#L369
>> > > > > > >>>>>>>>>
>> > > > > > >>>>>>>>> [3]
>> > > > > > >>>>>>>>>
>> > > > > > >>>>>>>
>> > > > > > >>>>>>
>> > > > > > >>>>
>> > > > > > >>
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/api/common/typeutils/base/IntSerializer.java#L73
>> > > > > > >>>>>>>>>
>> > > > > > >>>>>>>>> [4]
>> > > > > > >>>>>>>>>
>> > > > > > >>>>>>>
>> > > > > > >>>>>>
>> > > > > > >>>>
>> > > > > > >>
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/api/common/typeutils/TypeSerializer.java#L98
>> > > > > > >>>>>>>>>
>> > > > > > >>>>>>>>> [5]
>> > > > > > >>>>>>>>>
>> > > > > > >>>>>>>
>> > > > > > >>>>>>
>> > > > > > >>>>
>> > > > > > >>
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/operators/sort/FixedLengthRecordSorter.java
>> > > > > > >>>>>>>>>
>> > > > > > >>>>>>>>> [6]
>> > > > > > >>>>>>>>>
>> > > > > > >>>>>>>
>> > > > > > >>>>>>
>> > > > > > >>>>
>> > > > > > >>
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/operators/hash/CompactingHashTable.java#L861
>> > > > > > >>>>>>>>> [7] https://issues.apache.org/jira/browse/FLINK-3277
>> > > > > > >>>>>>>>>
>> > > > > > >>>>>>>>>
>> > > > > > >>>>>>>>> Best Regards,
>> > > > > > >>>>>>>>>
>> > > > > > >>>>>>>>> Gábor
>> > > > > > >>>>>>>>>
>> > > > > > >>>>>>>>
>> > > > > > >>>>>>>>
>> > > > > > >>>>>>>
>> > > > > > >>>>>>
>> > > > > > >>>>>
>> > > > > > >>>>>
>> > > > > > >>>>
>> > > > > > >>
>> > > > > > >>
>> > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: GSoC Project Proposal Draft: Code Generation in Serializers

Gábor Horváth
Hi!

I would like to give you some status updates on the Google Summer of Code
project. I started to implement the proposed features [1].

Status of code generation in general:
* I can compile the generated code using Janino compiler
* I can load the compiled classes and use them
* For some mysterious reason, during deserializing a Janino compiled
object, the readObject method is not invoked. When the same code is
compiled using another compiler it works as intended. I am investigating
this issue. In case any of you have some idea what the problem might be,
don't keep it secret :) While I am trying to solve this issue, I also
continue to work on code generation. I can still test the generated code,
registering it manually.

Status of generated POJO serializers:
* I could use the generated code on the WordCountPojo example.
* Everything is implemented except for copying stateful serializers and
serializing subclasses.
* There are several possible performance advantages of the generated
serializers:
  - The serialization/deserialization of the fields are not in a loop,
giving the JVM better chance to inline and devirtualize
  - Null checks are eliminated for primitive types
  - Subclass checks are eliminated for final classes

Status of generated POJO comparators:
* I started to implement them.

I did some preliminary benchmarks with the generated code using the
WordCountPojo example.
* In the baseline (using the default Flink serializers)
PojoSerializer.deserialize was one of the hottest methods (with over 11
percent sample rate).
* Using the generated serializers, the percentage of samples from
deserialize method went down below 3 percent.
* Very significant amount of the time is spent in comparators, so there are
some potential performance gains there as well.

What's next? I am trying to solve the problem with the readObject method,
and in the meantime I try to get the generated comparators working on the
WordCountPojo example. Once that is done, I will make a more detailed
performance case study. After that I will add support for handling
subclasses in the generated code. At that point the generated code will
have all the required features.

Note: I did not change the serialization format, so the generated code can
work with the default serializers. This is crucial for backward
compatibility with save points.

Regards,
Gábor

[1] https://github.com/Xazax-hun/flink/commits/serializer_codegen

On 23 April 2016 at 10:33, Gábor Horváth <[hidden email]> wrote:

> Hi,
>
> The GSoC project proposal was accepted! Thank you for all your support. I
> will do my best to live up to the challenges and deliver everything that
> way planned for this summer.
>
> Best Regards,
> Gábor
>
> On 20 April 2016 at 16:18, Gábor Horváth <[hidden email]> wrote:
>
>> On the second thought I think you are right. I had the impression that
>> there is cyclic dependency between TypeInformation and the serializers but
>> that is not the case. So there is no rewrite needed for TypeInformation in
>> order to be able to use Scala for serializers.
>>
>> According to the proposal unless someone utilize the annotations the
>> generated serializers would be compatible to the current ones. There could
>> be a configuration option whether to try to make the layout more compact
>> based on annotations.
>>
>> On 20 April 2016 at 16:03, Fabian Hueske <[hidden email]> wrote:
>>
>>> Why would you need to rewrite the TypeInformation in Scala?
>>> I think we need a way to replace Serializer implementations anyway unless
>>> the generated serializers are compatible to the current ones.
>>>
>>> 2016-04-20 15:53 GMT+02:00 Gábor Horváth <[hidden email]>:
>>>
>>> > Hi Fabian,
>>> >
>>> > I agree that it would be awesome to move this to its own module/plugin.
>>> > However in order to be able to write the code generation in Scala I
>>> would
>>> > need to rewrite the type information to use Scala as well. I think I
>>> will
>>> > not
>>> > have time to do this during the summer, so I think I will stick to
>>> Java and
>>> > this modularization can be done later.
>>> >
>>> > Thanks,
>>> > Gábor
>>> >
>>> > On 19 April 2016 at 11:50, Fabian Hueske <[hidden email]> wrote:
>>> >
>>> > > Hi Gabor,
>>> > >
>>> > > you are right, a codegen serializer module would depend on
>>> flink-core and
>>> > > in the current design flink-core would need to know about the type
>>> infos
>>> > /
>>> > > serializers / comparators.
>>> > >
>>> > > Decoupling implementations of type info, serializers, and comparators
>>> > from
>>> > > flink-core and resolving the cyclic dependency would be what the
>>> plugin
>>> > > architecture would be for.
>>> > > Maybe this can be done by some mechanism to dynamically load
>>> > > TypeInformations for types with overridden serializers / comparators.
>>> > > This would require some design document and discussion in the
>>> community.
>>> > >
>>> > > Cheers, Fabian
>>> > >
>>> > >
>>> > >
>>> > >
>>> > >
>>> > > 2016-04-18 21:19 GMT+02:00 Gábor Horváth <[hidden email]>:
>>> > >
>>> > > > Unfortunately making code generation a separate module would
>>> introduce
>>> > > > cyclic dependency.
>>> > > > Code generation requires the TypeInfo which is available in
>>> flink-core
>>> > > and
>>> > > > flink-core requires
>>> > > > the generated serializers from the code generation module. Do you
>>> have
>>> > a
>>> > > > solution for this?
>>> > > >
>>> > > > I think if we can come up with a solution I will implement it as a
>>> > > separate
>>> > > > Scala module
>>> > > > otherwise I will stick to Java.
>>> > > >
>>> > > > BR,
>>> > > > Gábor
>>> > > >
>>> > > > On 18 April 2016 at 12:40, Fabian Hueske <[hidden email]>
>>> wrote:
>>> > > >
>>> > > > > +1 for not mixing Java and Scala in flink-core.
>>> > > > >
>>> > > > > Maybe it makes sense to implement the code generated serializers
>>> /
>>> > > > > comparators as a separate module which can be plugged-in. This
>>> could
>>> > be
>>> > > > > pure Scala.
>>> > > > > In general, I think it would be good to have some kind of
>>> "version
>>> > > > > management" for serializers in place. With features such as
>>> > safepoints
>>> > > > that
>>> > > > > depend on the implementation of serializers, it would be good to
>>> > have a
>>> > > > > mechanism to switch between implementations.
>>> > > > >
>>> > > > > Best, Fabian
>>> > > > >
>>> > > > > 2016-04-18 10:01 GMT+02:00 Chiwan Park <[hidden email]>:
>>> > > > >
>>> > > > > > Yes, I know Janino is a pure Java project. I meant if we add
>>> Scala
>>> > > code
>>> > > > > to
>>> > > > > > flink-core, we should add Scala dependency to flink-core and it
>>> > could
>>> > > > be
>>> > > > > > confusing.
>>> > > > > >
>>> > > > > > Regards,
>>> > > > > > Chiwan Park
>>> > > > > >
>>> > > > > > > On Apr 18, 2016, at 2:49 PM, Márton Balassi <
>>> > > > [hidden email]>
>>> > > > > > wrote:
>>> > > > > > >
>>> > > > > > > Chiwan, just to clarify Janino is a Java project. [1]
>>> > > > > > >
>>> > > > > > > [1] https://github.com/aunkrig/janino
>>> > > > > > >
>>> > > > > > > On Mon, Apr 18, 2016 at 3:40 AM, Chiwan Park <
>>> > > [hidden email]>
>>> > > > > > wrote:
>>> > > > > > >
>>> > > > > > >> I prefer to avoid Scala dependencies in flink-core. If
>>> > flink-core
>>> > > > > > includes
>>> > > > > > >> Scala dependencies, Scala version suffix (_2.10 or _2.11)
>>> should
>>> > > be
>>> > > > > > added.
>>> > > > > > >> I think that users could be confused.
>>> > > > > > >>
>>> > > > > > >> Regards,
>>> > > > > > >> Chiwan Park
>>> > > > > > >>
>>> > > > > > >>> On Apr 17, 2016, at 3:49 PM, Márton Balassi <
>>> > > > > [hidden email]>
>>> > > > > > >> wrote:
>>> > > > > > >>>
>>> > > > > > >>> Hi Gábor,
>>> > > > > > >>>
>>> > > > > > >>> I think that adding the Janino dep to flink-core should be
>>> > fine,
>>> > > as
>>> > > > > it
>>> > > > > > >> has
>>> > > > > > >>> quite slim dependencies [1,2] which are generally
>>> orthogonal to
>>> > > > > Flink's
>>> > > > > > >>> main dependency line (also it is already used elsewhere).
>>> > > > > > >>>
>>> > > > > > >>> As for mixing Scala code that is used from the Java parts
>>> of
>>> > the
>>> > > > same
>>> > > > > > >> maven
>>> > > > > > >>> module I am skeptical. We have seen IDE compilation issues
>>> with
>>> > > > > > projects
>>> > > > > > >>> using this setup and have decided that the community-wide
>>> > > potential
>>> > > > > IDE
>>> > > > > > >>> setup pain outweighs the individual implementation
>>> convenience
>>> > > with
>>> > > > > > >> Scala.
>>> > > > > > >>>
>>> > > > > > >>> [1]
>>> > > > > > >>>
>>> > > > > > >>
>>> > > > > >
>>> > > > >
>>> > > >
>>> > >
>>> >
>>> https://repo1.maven.org/maven2/org/codehaus/janino/janino-parent/2.7.8/janino-parent-2.7.8.pom
>>> > > > > > >>> [2]
>>> > > > > > >>>
>>> > > > > > >>
>>> > > > > >
>>> > > > >
>>> > > >
>>> > >
>>> >
>>> https://repo1.maven.org/maven2/org/codehaus/janino/janino/2.7.8/janino-2.7.8.pom
>>> > > > > > >>>
>>> > > > > > >>> On Sat, Apr 16, 2016 at 5:51 PM, Gábor Horváth <
>>> > > > [hidden email]>
>>> > > > > > >> wrote:
>>> > > > > > >>>
>>> > > > > > >>>> Hi!
>>> > > > > > >>>>
>>> > > > > > >>>> Table API already uses code generation and the Janino
>>> compiler
>>> > > > [1].
>>> > > > > Is
>>> > > > > > >> it a
>>> > > > > > >>>> dependency that is ok to add to flink-core? In case it is
>>> ok,
>>> > I
>>> > > > > think
>>> > > > > > I
>>> > > > > > >>>> will use the same in order to be consistent with the other
>>> > code
>>> > > > > > >> generation
>>> > > > > > >>>> efforts.
>>> > > > > > >>>>
>>> > > > > > >>>> I started to look at the Table API code generation [2]
>>> and it
>>> > > uses
>>> > > > > > Scala
>>> > > > > > >>>> extensively. There are several Scala features that can
>>> make
>>> > Java
>>> > > > > code
>>> > > > > > >>>> generation easier such as pattern matching and string
>>> > > > > interpolation. I
>>> > > > > > >> did
>>> > > > > > >>>> not see any Scala code in flink-core yet. Is it ok to
>>> > implement
>>> > > > the
>>> > > > > > code
>>> > > > > > >>>> generation inside the flink-core using Scala?
>>> > > > > > >>>>
>>> > > > > > >>>> Regards,
>>> > > > > > >>>> Gábor
>>> > > > > > >>>>
>>> > > > > > >>>> [1] http://unkrig.de/w/Janino
>>> > > > > > >>>> [2]
>>> > > > > > >>>>
>>> > > > > > >>>>
>>> > > > > > >>
>>> > > > > >
>>> > > > >
>>> > > >
>>> > >
>>> >
>>> https://github.com/apache/flink/blob/master/flink-libraries/flink-table/src/main/scala/org/apache/flink/api/table/codegen/CodeGenerator.scala
>>> > > > > > >>>>
>>> > > > > > >>>> On 18 March 2016 at 19:37, Gábor Horváth <
>>> [hidden email]
>>> > >
>>> > > > > wrote:
>>> > > > > > >>>>
>>> > > > > > >>>>> Thank you! I finalized the project.
>>> > > > > > >>>>>
>>> > > > > > >>>>>
>>> > > > > > >>>>> On 18 March 2016 at 10:29, Márton Balassi <
>>> > > > > [hidden email]>
>>> > > > > > >>>>> wrote:
>>> > > > > > >>>>>
>>> > > > > > >>>>>> Thanks Gábor, now I also see it on the internal GSoC
>>> > > interface.
>>> > > > I
>>> > > > > > have
>>> > > > > > >>>>>> indicated that I wish to mentor your project, I think
>>> you
>>> > can
>>> > > > hit
>>> > > > > > >>>> finalize
>>> > > > > > >>>>>> on your project there.
>>> > > > > > >>>>>>
>>> > > > > > >>>>>> On Mon, Mar 14, 2016 at 11:16 AM, Gábor Horváth <
>>> > > > > > [hidden email]>
>>> > > > > > >>>>>> wrote:
>>> > > > > > >>>>>>
>>> > > > > > >>>>>>> Hi,
>>> > > > > > >>>>>>>
>>> > > > > > >>>>>>> I have updated this draft to include preliminary
>>> > benchmarks,
>>> > > > > > >> mentioned
>>> > > > > > >>>>>> the
>>> > > > > > >>>>>>> interaction of annotations with savepoints, extended it
>>> > with
>>> > > a
>>> > > > > > >>>> timeline,
>>> > > > > > >>>>>>> and some notes about scala case classes.
>>> > > > > > >>>>>>>
>>> > > > > > >>>>>>> Regards,
>>> > > > > > >>>>>>> Gábor
>>> > > > > > >>>>>>>
>>> > > > > > >>>>>>> On 9 March 2016 at 16:12, Gábor Horváth <
>>> > [hidden email]
>>> > > >
>>> > > > > > wrote:
>>> > > > > > >>>>>>>
>>> > > > > > >>>>>>>> Hi!
>>> > > > > > >>>>>>>>
>>> > > > > > >>>>>>>> As far as I can see the formatting was not correct in
>>> my
>>> > > > > previous
>>> > > > > > >>>>>> mail. A
>>> > > > > > >>>>>>>> better formatted version is available here:
>>> > > > > > >>>>>>>>
>>> > > > > > >>>>>>>
>>> > > > > > >>>>>>
>>> > > > > > >>>>
>>> > > > > > >>
>>> > > > > >
>>> > > > >
>>> > > >
>>> > >
>>> >
>>> https://docs.google.com/document/d/1VC8lCeErx9kI5lCMPiUn625PO0rxR-iKlVqtt3hkVnk
>>> > > > > > >>>>>>>> Sorry for that.
>>> > > > > > >>>>>>>>
>>> > > > > > >>>>>>>> Regards,
>>> > > > > > >>>>>>>> Gábor
>>> > > > > > >>>>>>>>
>>> > > > > > >>>>>>>> On 9 March 2016 at 15:51, Gábor Horváth <
>>> > > [hidden email]>
>>> > > > > > >>>> wrote:
>>> > > > > > >>>>>>>>
>>> > > > > > >>>>>>>>> Hi,I did not want to send this proposal out before
>>> the I
>>> > > have
>>> > > > > > some
>>> > > > > > >>>>>>>>> initial benchmarks, but this issue was mentioned on
>>> the
>>> > > > mailing
>>> > > > > > >>>> list
>>> > > > > > >>>>>> (
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>
>>> > > > > > >>>>>>
>>> > > > > > >>>>
>>> > > > > > >>
>>> > > > > >
>>> > > > >
>>> > > >
>>> > >
>>> >
>>> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/Tuple-performance-and-the-curious-JIT-compiler-td10666.html
>>> > > > > > >>>>>>> ),
>>> > > > > > >>>>>>>>> and I wanted to make this information available to be
>>> > able
>>> > > to
>>> > > > > > >>>>>>> incorporate
>>> > > > > > >>>>>>>>> this into that discussion. I have written this draft
>>> with
>>> > > the
>>> > > > > > help
>>> > > > > > >>>> of
>>> > > > > > >>>>>>> Gábor
>>> > > > > > >>>>>>>>> Gévay and Márton Balassi and I am open to every
>>> > suggestion.
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>> The proposal draft:
>>> > > > > > >>>>>>>>> Code Generation in Serializers and Comparators of
>>> Apache
>>> > > > Flink
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>> I am doing my last semester of my MSc studies and
>>> I’m a
>>> > > > former
>>> > > > > > GSoC
>>> > > > > > >>>>>>>>> student in the LLVM project. I plan to improve the
>>> > > > > serialization
>>> > > > > > >>>>>> code in
>>> > > > > > >>>>>>>>> Flink during this summer. The current implementation
>>> of
>>> > the
>>> > > > > > >>>>>> serializers
>>> > > > > > >>>>>>> can
>>> > > > > > >>>>>>>>> be a performance bottleneck in some scenarios. These
>>> > > > > performance
>>> > > > > > >>>>>>> problems
>>> > > > > > >>>>>>>>> were also reported on the mailing list recently [1].
>>> I
>>> > plan
>>> > > > to
>>> > > > > > >>>>>> implement
>>> > > > > > >>>>>>>>> code generation into the serializers to improve the
>>> > > > performance
>>> > > > > > (as
>>> > > > > > >>>>>>> Stephan
>>> > > > > > >>>>>>>>> Ewen also suggested.)
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>> TODO: I plan to include some preliminary benchmarks
>>> in
>>> > this
>>> > > > > > >>>> section.
>>> > > > > > >>>>>>>>> Performance problems with the current serializers
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>>  1.
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>>  PojoSerializer uses reflection for accessing the
>>> fields,
>>> > > > which
>>> > > > > > >>>> is
>>> > > > > > >>>>>>>>>  slow (eg. [2])
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>>  -
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>>  This is also a serious problem for the comparators
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>>  1.
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>>  When deserializing fields of primitive types (eg.
>>> int),
>>> > > the
>>> > > > > > >>>>>> reusing
>>> > > > > > >>>>>>>>>  overload of the corresponding field serializers
>>> cannot
>>> > > > really
>>> > > > > do
>>> > > > > > >>>>>> any
>>> > > > > > >>>>>>> reuse,
>>> > > > > > >>>>>>>>>  because boxed primitive types are immutable in Java.
>>> > This
>>> > > > > > >>>> results
>>> > > > > > >>>>>> in
>>> > > > > > >>>>>>> lots
>>> > > > > > >>>>>>>>>  of object creations. [3][7]
>>> > > > > > >>>>>>>>>  2.
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>>  The loop to call the field serializers makes virtual
>>> > > > function
>>> > > > > > >>>>>> calls,
>>> > > > > > >>>>>>>>>  that cannot be speculatively devirtualized by the
>>> JVM or
>>> > > > > > >>>> predicted
>>> > > > > > >>>>>>> by the
>>> > > > > > >>>>>>>>>  CPU, because different serializer subclasses are
>>> invoked
>>> > > for
>>> > > > > the
>>> > > > > > >>>>>>> different
>>> > > > > > >>>>>>>>>  fields. (And the loop cannot be unrolled, because
>>> the
>>> > > number
>>> > > > > of
>>> > > > > > >>>>>>> iterations
>>> > > > > > >>>>>>>>>  is not a compile time constant.) See also the
>>> following
>>> > > > > > >>>> discussion
>>> > > > > > >>>>>>> on the
>>> > > > > > >>>>>>>>>  mailing list [1].
>>> > > > > > >>>>>>>>>  3.
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>>  A POJO field can have the value null, so the
>>> serializer
>>> > > > > inserts
>>> > > > > > >>>> 1
>>> > > > > > >>>>>>>>>  byte null tags, which wastes space. (Also, the type
>>> > > > extractor
>>> > > > > > >>>>>> logic
>>> > > > > > >>>>>>> does
>>> > > > > > >>>>>>>>>  not distinguish between primitive types and their
>>> boxed
>>> > > > > > >>>> versions,
>>> > > > > > >>>>>> so
>>> > > > > > >>>>>>> even
>>> > > > > > >>>>>>>>>  an int field has a null tag.)
>>> > > > > > >>>>>>>>>  4.
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>>  Subclass tags also add a byte at the beginning of
>>> every
>>> > > POJO
>>> > > > > > >>>>>>>>>  5.
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>>  getLength() does not know the size in most cases [4]
>>> > > > > > >>>>>>>>>  Knowing the size of a type when serialized has
>>> numerous
>>> > > > > > >>>>>> performance
>>> > > > > > >>>>>>>>>  benefits throughout Flink:
>>> > > > > > >>>>>>>>>  1.
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>>     Sorters can do in-place, when the type is small
>>> [5]
>>> > > > > > >>>>>>>>>     2.
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>>     Chaining hash tables do not need resizes, because
>>> > they
>>> > > > know
>>> > > > > > >>>> how
>>> > > > > > >>>>>>>>>     many buckets to allocate upfront [6]
>>> > > > > > >>>>>>>>>     3.
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>>     Different hash table architectures could be
>>> used, eg.
>>> > > > open
>>> > > > > > >>>>>>>>>     addressing with linear probing instead of some
>>> > chaining
>>> > > > > > >>>>>>>>>     4.
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>>     It is possible to deserialize, modify, and then
>>> > > serialize
>>> > > > > > >>>> back
>>> > > > > > >>>>>> a
>>> > > > > > >>>>>>>>>     record to its original place, because it cannot
>>> > happen
>>> > > > that
>>> > > > > > >>>> the
>>> > > > > > >>>>>>> modified
>>> > > > > > >>>>>>>>>     version does not fit in the place allocated
>>> there for
>>> > > the
>>> > > > > old
>>> > > > > > >>>>>>> version (see
>>> > > > > > >>>>>>>>>     CompactingHashTable and ReduceHashTable for
>>> concrete
>>> > > > > > >>>> instances
>>> > > > > > >>>>>> of
>>> > > > > > >>>>>>> this
>>> > > > > > >>>>>>>>>     problem)
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>> Note, that 2. and 3. are problems with not just the
>>> > > > > > PojoSerializer,
>>> > > > > > >>>>>> but
>>> > > > > > >>>>>>>>> also with the TupleSerializer.
>>> > > > > > >>>>>>>>> Solution approaches
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>>  1.
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>>  Run time code generation for every POJO
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>>  -
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>>     1. and 3 . would be automatically solved, if the
>>> > > > > serializers
>>> > > > > > >>>>>> for
>>> > > > > > >>>>>>>>>     POJOs would be generated on-the-fly (by, for
>>> example,
>>> > > > > > >>>>>> Javassist)
>>> > > > > > >>>>>>>>>     -
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>>     2. also needs code generation, and also some
>>> extra
>>> > > effort
>>> > > > > in
>>> > > > > > >>>>>> the
>>> > > > > > >>>>>>>>>     type extractor to distinguish between primitive
>>> types
>>> > > and
>>> > > > > > >>>> their
>>> > > > > > >>>>>>> boxed
>>> > > > > > >>>>>>>>>     versions
>>> > > > > > >>>>>>>>>     -
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>>     could be used for PojoComparator as well (which
>>> could
>>> > > > > greatly
>>> > > > > > >>>>>>>>>     increase the performance of sorting)
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>>  1.
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>>  Annotations on POJOs (by the users)
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>>  -
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>>     Concretely:
>>> > > > > > >>>>>>>>>     -
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>>        annotate fields that will never be nulls -> no
>>> > null
>>> > > > tag
>>> > > > > > >>>>>> needed
>>> > > > > > >>>>>>>>>        before every field!
>>> > > > > > >>>>>>>>>        -
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>>        make a POJO final -> no subclass tag needed
>>> > > > > > >>>>>>>>>        -
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>>        annotating a POJO that it will not be null ->
>>> no
>>> > top
>>> > > > > level
>>> > > > > > >>>>>> null
>>> > > > > > >>>>>>>>>        tag needed
>>> > > > > > >>>>>>>>>        -
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>>     These would also help with the getLength problem
>>> > (6.),
>>> > > > > > >>>> because
>>> > > > > > >>>>>> the
>>> > > > > > >>>>>>>>>     length is often not known because currently
>>> anything
>>> > > can
>>> > > > be
>>> > > > > > >>>>>> null
>>> > > > > > >>>>>>> or a
>>> > > > > > >>>>>>>>>     subclass can appear anywhere
>>> > > > > > >>>>>>>>>     -
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>>     These annotations could be done without code
>>> > > generation,
>>> > > > > but
>>> > > > > > >>>>>> then
>>> > > > > > >>>>>>>>>     they would add some overhead when there are no
>>> > > > annotations
>>> > > > > > >>>>>>> present, so this
>>> > > > > > >>>>>>>>>     would work better together with the code
>>> generation
>>> > > > > > >>>>>>>>>     -
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>>     Tuples would become a special case of POJOs,
>>> where
>>> > > > nothing
>>> > > > > > >>>> can
>>> > > > > > >>>>>> be
>>> > > > > > >>>>>>>>>     null, and no subclass can appear, so maybe we
>>> could
>>> > > > > eliminate
>>> > > > > > >>>>>> the
>>> > > > > > >>>>>>>>>     TupleSerializer
>>> > > > > > >>>>>>>>>     -
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>>     We could annotate some internal types in Flink
>>> > > libraries
>>> > > > > > >>>> (Gelly
>>> > > > > > >>>>>>>>>     (Vertex, Edge), FlinkML)
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>> TODO: what is the situation with Scala case classes?
>>> Run
>>> > > time
>>> > > > > > code
>>> > > > > > >>>>>>>>> generation is probably easier in Scala? (with
>>> > quasiquotes)
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>> About me
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>> I am in the last year of my Computer Science MSc
>>> studies
>>> > at
>>> > > > > > Eotvos
>>> > > > > > >>>>>>> Lorand
>>> > > > > > >>>>>>>>> University in Budapest, and planning to start a PhD
>>> in
>>> > the
>>> > > > > > autumn.
>>> > > > > > >>>> I
>>> > > > > > >>>>>>> have
>>> > > > > > >>>>>>>>> been working for almost three years at Ericsson on
>>> static
>>> > > > > > analysis
>>> > > > > > >>>>>> tools
>>> > > > > > >>>>>>>>> for C++. In 2014 I participated in GSoC, working on
>>> the
>>> > > LLVM
>>> > > > > > >>>> project,
>>> > > > > > >>>>>>> and I
>>> > > > > > >>>>>>>>> am a frequent contributor ever since. The next
>>> summer I
>>> > was
>>> > > > > > >>>>>> interning at
>>> > > > > > >>>>>>>>> Apple.
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>> I learned about the Flink project not too long ago
>>> and I
>>> > > like
>>> > > > > it
>>> > > > > > so
>>> > > > > > >>>>>> far.
>>> > > > > > >>>>>>>>> The last few weeks I was working on some tickets to
>>> > > > familiarize
>>> > > > > > >>>>>> myself
>>> > > > > > >>>>>>> with
>>> > > > > > >>>>>>>>> the codebase:
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>> https://issues.apache.org/jira/browse/FLINK-3422
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>> https://issues.apache.org/jira/browse/FLINK-3322
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>> https://issues.apache.org/jira/browse/FLINK-3457
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>> My CV is available here:
>>> > > > > > http://xazax.web.elte.hu/files/resume.pdf
>>> > > > > > >>>>>>>>> References
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>> [1]
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>
>>> > > > > > >>>>>>
>>> > > > > > >>>>
>>> > > > > > >>
>>> > > > > >
>>> > > > >
>>> > > >
>>> > >
>>> >
>>> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/Tuple-performance-and-the-curious-JIT-compiler-td10666.html
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>> [2]
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>
>>> > > > > > >>>>>>
>>> > > > > > >>>>
>>> > > > > > >>
>>> > > > > >
>>> > > > >
>>> > > >
>>> > >
>>> >
>>> https://github.com/apache/flink/blob/master/flink-java/src/main/java/org/apache/flink/api/java/typeutils/runtime/PojoSerializer.java#L369
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>> [3]
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>
>>> > > > > > >>>>>>
>>> > > > > > >>>>
>>> > > > > > >>
>>> > > > > >
>>> > > > >
>>> > > >
>>> > >
>>> >
>>> https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/api/common/typeutils/base/IntSerializer.java#L73
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>> [4]
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>
>>> > > > > > >>>>>>
>>> > > > > > >>>>
>>> > > > > > >>
>>> > > > > >
>>> > > > >
>>> > > >
>>> > >
>>> >
>>> https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/api/common/typeutils/TypeSerializer.java#L98
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>> [5]
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>
>>> > > > > > >>>>>>
>>> > > > > > >>>>
>>> > > > > > >>
>>> > > > > >
>>> > > > >
>>> > > >
>>> > >
>>> >
>>> https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/operators/sort/FixedLengthRecordSorter.java
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>> [6]
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>
>>> > > > > > >>>>>>
>>> > > > > > >>>>
>>> > > > > > >>
>>> > > > > >
>>> > > > >
>>> > > >
>>> > >
>>> >
>>> https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/operators/hash/CompactingHashTable.java#L861
>>> > > > > > >>>>>>>>> [7] https://issues.apache.org/jira/browse/FLINK-3277
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>> Best Regards,
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>> Gábor
>>> > > > > > >>>>>>>>>
>>> > > > > > >>>>>>>>
>>> > > > > > >>>>>>>>
>>> > > > > > >>>>>>>
>>> > > > > > >>>>>>
>>> > > > > > >>>>>
>>> > > > > > >>>>>
>>> > > > > > >>>>
>>> > > > > > >>
>>> > > > > > >>
>>> > > > > >
>>> > > > > >
>>> > > > >
>>> > > >
>>> > >
>>> >
>>>
>>
>>
>