Handling custom types with Kryo

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

Handling custom types with Kryo

Stephan Ewen
Hi all!

We have various pending pull requests that add support for certain types by
adding extra kryo serializers.

I think we need to decide how we want to handle the support for extra
types, because more are certainly to come.

As I understand it, we have three broad options:

(1)
Add as many serializers to Kryo by default as possible.
 Pro:
    - Many types work out of the box
 Contra:
    - We may eventually overload the kryo registry with serializers
      that are not needed for most cases and suffer in performance
    - It is hard to guess which types work out of the box (intransparent)


(2)
We create a collection of serializers and a registration util.
--------
val env = ExecutionEnvironemnt.getExecutionEnviroment()

Serializers.registerProtoBufSerializers(env);
Serializers.registerJavaUtilSerializers(env);
---------
Pro:
  - Easy for users
  - We can grow the set of supported types very large without overloading
Kryo
  - It is transparent what gets registered

Contra:
  - Not quite as convenient as if things just run


(3)
We do nothing and let the user create and register whatever is needed.

We could have a library and utility for serializers for certain libraries.
Users could use this to conveniently add serializers for the libraries they
use.
Pro:
  - Simple for us ;-)
Contra:
  - More repeated work for users


========================

For approach (1) and (2), there is an orthogonal question of whether we
want to simply register default serializers (that enable that types work)
or also register types for tags, to speed up the serialization of those
types.


Greetings,
Stephan
Reply | Threaded
Open this post in threaded view
|

Re: Handling custom types with Kryo

Robert Metzger
Hi,

thank you for putting our discussion to the mailing list. This is indeed
where such discussions belong. For the others, we started discussing here:
https://github.com/apache/flink/pull/304

I think there is one additional approach, which is probably close to (1):
We only register those serializers by default which we don't see in the
pre-flight phase (I think right now thats only GenericData.Array from
Avro).
We would come across all the other classes (Jodatime, Protobuf, Avro,
Thrift, ...) when traversing the class hierarchy, as proposed in
FLINK-1417. With this approach, users get the best out-of-the box
experience and the number of registered classes / serializers is kept at a
minimum.
We can still offer means to register additional serializers (I think thats
already merged to master).

My main concern with this particular issue is a good out of the box user
experience. If there is an issue with type serialization, users will notice
it very early. (In my experience people often have their existing datatypes
they use with other systems, and they want to continue using them)
Therefore, I want to put some effort into making it as good as possible. I
would actually sacrifice performance over stability/usability here. Our
system is flexible enough to replace it later with a more efficient
serialization if that becomes an issue. But maybe my suggestion above is
already sufficient.

We could also think about introducing a configuration variable which allows
users to disable the default serializers.


Regarding the second question: Is there a downside registering all types
for tagging? We reduce the overall I/O which is good for performance.

Best,
Robert



On Mon, Jan 19, 2015 at 8:24 PM, Stephan Ewen <[hidden email]> wrote:

> Hi all!
>
> We have various pending pull requests that add support for certain types by
> adding extra kryo serializers.
>
> I think we need to decide how we want to handle the support for extra
> types, because more are certainly to come.
>
> As I understand it, we have three broad options:
>
> (1)
> Add as many serializers to Kryo by default as possible.
>  Pro:
>     - Many types work out of the box
>  Contra:
>     - We may eventually overload the kryo registry with serializers
>       that are not needed for most cases and suffer in performance
>     - It is hard to guess which types work out of the box (intransparent)
>
>
> (2)
> We create a collection of serializers and a registration util.
> --------
> val env = ExecutionEnvironemnt.getExecutionEnviroment()
>
> Serializers.registerProtoBufSerializers(env);
> Serializers.registerJavaUtilSerializers(env);
> ---------
> Pro:
>   - Easy for users
>   - We can grow the set of supported types very large without overloading
> Kryo
>   - It is transparent what gets registered
>
> Contra:
>   - Not quite as convenient as if things just run
>
>
> (3)
> We do nothing and let the user create and register whatever is needed.
>
> We could have a library and utility for serializers for certain libraries.
> Users could use this to conveniently add serializers for the libraries they
> use.
> Pro:
>   - Simple for us ;-)
> Contra:
>   - More repeated work for users
>
>
> ========================
>
> For approach (1) and (2), there is an orthogonal question of whether we
> want to simply register default serializers (that enable that types work)
> or also register types for tags, to speed up the serialization of those
> types.
>
>
> Greetings,
> Stephan
>
Reply | Threaded
Open this post in threaded view
|

Re: Handling custom types with Kryo

Stephan Ewen
Yes, I agree that the Avro serializer should be available by default. That
is one case of a typical type that should work out of the box, given that
we support Avro file formats.

Let me summarize how I understood that suggestion:

 - We make Avro available by default by registering a default serializer
for the SpecificBase

 - We create a library of serializers. We do not register them by default.

 - Via FLINK-1417, we analyze the types. For any (nested) type that we
encounter for which we have a serializer in the library, we register that
serializer as the default serializer. Also, for every (nested) type we
encounter, we register a tag at Kryo.

I like that, it should give a nice and smooth user experience.

Greetings,
Stephan




On Mon, Jan 19, 2015 at 12:32 PM, Robert Metzger <[hidden email]>
wrote:

> Hi,
>
> thank you for putting our discussion to the mailing list. This is indeed
> where such discussions belong. For the others, we started discussing here:
> https://github.com/apache/flink/pull/304
>
> I think there is one additional approach, which is probably close to (1):
> We only register those serializers by default which we don't see in the
> pre-flight phase (I think right now thats only GenericData.Array from
> Avro).
> We would come across all the other classes (Jodatime, Protobuf, Avro,
> Thrift, ...) when traversing the class hierarchy, as proposed in
> FLINK-1417. With this approach, users get the best out-of-the box
> experience and the number of registered classes / serializers is kept at a
> minimum.
> We can still offer means to register additional serializers (I think thats
> already merged to master).
>
> My main concern with this particular issue is a good out of the box user
> experience. If there is an issue with type serialization, users will notice
> it very early. (In my experience people often have their existing datatypes
> they use with other systems, and they want to continue using them)
> Therefore, I want to put some effort into making it as good as possible. I
> would actually sacrifice performance over stability/usability here. Our
> system is flexible enough to replace it later with a more efficient
> serialization if that becomes an issue. But maybe my suggestion above is
> already sufficient.
>
> We could also think about introducing a configuration variable which allows
> users to disable the default serializers.
>
>
> Regarding the second question: Is there a downside registering all types
> for tagging? We reduce the overall I/O which is good for performance.
>
> Best,
> Robert
>
>
>
> On Mon, Jan 19, 2015 at 8:24 PM, Stephan Ewen <[hidden email]> wrote:
>
> > Hi all!
> >
> > We have various pending pull requests that add support for certain types
> by
> > adding extra kryo serializers.
> >
> > I think we need to decide how we want to handle the support for extra
> > types, because more are certainly to come.
> >
> > As I understand it, we have three broad options:
> >
> > (1)
> > Add as many serializers to Kryo by default as possible.
> >  Pro:
> >     - Many types work out of the box
> >  Contra:
> >     - We may eventually overload the kryo registry with serializers
> >       that are not needed for most cases and suffer in performance
> >     - It is hard to guess which types work out of the box (intransparent)
> >
> >
> > (2)
> > We create a collection of serializers and a registration util.
> > --------
> > val env = ExecutionEnvironemnt.getExecutionEnviroment()
> >
> > Serializers.registerProtoBufSerializers(env);
> > Serializers.registerJavaUtilSerializers(env);
> > ---------
> > Pro:
> >   - Easy for users
> >   - We can grow the set of supported types very large without overloading
> > Kryo
> >   - It is transparent what gets registered
> >
> > Contra:
> >   - Not quite as convenient as if things just run
> >
> >
> > (3)
> > We do nothing and let the user create and register whatever is needed.
> >
> > We could have a library and utility for serializers for certain
> libraries.
> > Users could use this to conveniently add serializers for the libraries
> they
> > use.
> > Pro:
> >   - Simple for us ;-)
> > Contra:
> >   - More repeated work for users
> >
> >
> > ========================
> >
> > For approach (1) and (2), there is an orthogonal question of whether we
> > want to simply register default serializers (that enable that types work)
> > or also register types for tags, to speed up the serialization of those
> > types.
> >
> >
> > Greetings,
> > Stephan
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Handling custom types with Kryo

Arvid Heise-2
An alternative way that may not be applicable in your case:

For Sopremo, all types implemented a common interface. When a package is
loaded, the Sopremo package manager scans the jar and looks for classes
implementing the interfaces (quite fast, because not the entire class must
be loaded). All types implementing the interface are automatically added to
Kryo with their respective annotated serializers.

If you still have bytecode analysis of jobs, you can also statically
determine all types that are used, check for default serializers, and
maintain only a minimal set of serializers used for this specific job. You
could already warn for unregistered types before submitting jobs.

On Tue, Jan 20, 2015 at 6:38 AM, Stephan Ewen <[hidden email]> wrote:

> Yes, I agree that the Avro serializer should be available by default. That
> is one case of a typical type that should work out of the box, given that
> we support Avro file formats.
>
> Let me summarize how I understood that suggestion:
>
>  - We make Avro available by default by registering a default serializer
> for the SpecificBase
>
>  - We create a library of serializers. We do not register them by default.
>
>  - Via FLINK-1417, we analyze the types. For any (nested) type that we
> encounter for which we have a serializer in the library, we register that
> serializer as the default serializer. Also, for every (nested) type we
> encounter, we register a tag at Kryo.
>
> I like that, it should give a nice and smooth user experience.
>
> Greetings,
> Stephan
>
>
>
>
> On Mon, Jan 19, 2015 at 12:32 PM, Robert Metzger <[hidden email]>
> wrote:
>
> > Hi,
> >
> > thank you for putting our discussion to the mailing list. This is indeed
> > where such discussions belong. For the others, we started discussing
> here:
> > https://github.com/apache/flink/pull/304
> >
> > I think there is one additional approach, which is probably close to (1):
> > We only register those serializers by default which we don't see in the
> > pre-flight phase (I think right now thats only GenericData.Array from
> > Avro).
> > We would come across all the other classes (Jodatime, Protobuf, Avro,
> > Thrift, ...) when traversing the class hierarchy, as proposed in
> > FLINK-1417. With this approach, users get the best out-of-the box
> > experience and the number of registered classes / serializers is kept at
> a
> > minimum.
> > We can still offer means to register additional serializers (I think
> thats
> > already merged to master).
> >
> > My main concern with this particular issue is a good out of the box user
> > experience. If there is an issue with type serialization, users will
> notice
> > it very early. (In my experience people often have their existing
> datatypes
> > they use with other systems, and they want to continue using them)
> > Therefore, I want to put some effort into making it as good as possible.
> I
> > would actually sacrifice performance over stability/usability here. Our
> > system is flexible enough to replace it later with a more efficient
> > serialization if that becomes an issue. But maybe my suggestion above is
> > already sufficient.
> >
> > We could also think about introducing a configuration variable which
> allows
> > users to disable the default serializers.
> >
> >
> > Regarding the second question: Is there a downside registering all types
> > for tagging? We reduce the overall I/O which is good for performance.
> >
> > Best,
> > Robert
> >
> >
> >
> > On Mon, Jan 19, 2015 at 8:24 PM, Stephan Ewen <[hidden email]> wrote:
> >
> > > Hi all!
> > >
> > > We have various pending pull requests that add support for certain
> types
> > by
> > > adding extra kryo serializers.
> > >
> > > I think we need to decide how we want to handle the support for extra
> > > types, because more are certainly to come.
> > >
> > > As I understand it, we have three broad options:
> > >
> > > (1)
> > > Add as many serializers to Kryo by default as possible.
> > >  Pro:
> > >     - Many types work out of the box
> > >  Contra:
> > >     - We may eventually overload the kryo registry with serializers
> > >       that are not needed for most cases and suffer in performance
> > >     - It is hard to guess which types work out of the box
> (intransparent)
> > >
> > >
> > > (2)
> > > We create a collection of serializers and a registration util.
> > > --------
> > > val env = ExecutionEnvironemnt.getExecutionEnviroment()
> > >
> > > Serializers.registerProtoBufSerializers(env);
> > > Serializers.registerJavaUtilSerializers(env);
> > > ---------
> > > Pro:
> > >   - Easy for users
> > >   - We can grow the set of supported types very large without
> overloading
> > > Kryo
> > >   - It is transparent what gets registered
> > >
> > > Contra:
> > >   - Not quite as convenient as if things just run
> > >
> > >
> > > (3)
> > > We do nothing and let the user create and register whatever is needed.
> > >
> > > We could have a library and utility for serializers for certain
> > libraries.
> > > Users could use this to conveniently add serializers for the libraries
> > they
> > > use.
> > > Pro:
> > >   - Simple for us ;-)
> > > Contra:
> > >   - More repeated work for users
> > >
> > >
> > > ========================
> > >
> > > For approach (1) and (2), there is an orthogonal question of whether we
> > > want to simply register default serializers (that enable that types
> work)
> > > or also register types for tags, to speed up the serialization of those
> > > types.
> > >
> > >
> > > Greetings,
> > > Stephan
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Handling custom types with Kryo

Robert Metzger
@Stephan: Yes, you are summarizing it correctly.
I'll assign FLINK-1417 to myself and implement it as discussed here (once I
have resolved the other issues assigned to me)

There is one additional point we forgot in the discussion so far: We are
initializing Kryo with twitter/chill's "ScalaKryoInstantiator()". I just
checked and its registering 51 default serializers and 81 registered types.

I just talked to Till (who implemented the KryoSerializer originally) and
he also suggests to pool kryo instances using thread-local variables. I'll
look into that as well once FLINK-1417 has been resolved. I think that
helps to mitigate the heavy initialization of Kryo (which is inevitable)


@Arvid: Our POJO/Types support does explicitly not require our users to
implement any interfaces, so that option is not feasible.
The bytecode analysis is not part of the main flink distribution because it
is using a library with an apache incompatible license.



On Tue, Jan 20, 2015 at 9:58 AM, Arvid Heise <[hidden email]> wrote:

> An alternative way that may not be applicable in your case:
>
> For Sopremo, all types implemented a common interface. When a package is
> loaded, the Sopremo package manager scans the jar and looks for classes
> implementing the interfaces (quite fast, because not the entire class must
> be loaded). All types implementing the interface are automatically added to
> Kryo with their respective annotated serializers.
>
> If you still have bytecode analysis of jobs, you can also statically
> determine all types that are used, check for default serializers, and
> maintain only a minimal set of serializers used for this specific job. You
> could already warn for unregistered types before submitting jobs.
>
> On Tue, Jan 20, 2015 at 6:38 AM, Stephan Ewen <[hidden email]> wrote:
>
> > Yes, I agree that the Avro serializer should be available by default.
> That
> > is one case of a typical type that should work out of the box, given that
> > we support Avro file formats.
> >
> > Let me summarize how I understood that suggestion:
> >
> >  - We make Avro available by default by registering a default serializer
> > for the SpecificBase
> >
> >  - We create a library of serializers. We do not register them by
> default.
> >
> >  - Via FLINK-1417, we analyze the types. For any (nested) type that we
> > encounter for which we have a serializer in the library, we register that
> > serializer as the default serializer. Also, for every (nested) type we
> > encounter, we register a tag at Kryo.
> >
> > I like that, it should give a nice and smooth user experience.
> >
> > Greetings,
> > Stephan
> >
> >
> >
> >
> > On Mon, Jan 19, 2015 at 12:32 PM, Robert Metzger <[hidden email]>
> > wrote:
> >
> > > Hi,
> > >
> > > thank you for putting our discussion to the mailing list. This is
> indeed
> > > where such discussions belong. For the others, we started discussing
> > here:
> > > https://github.com/apache/flink/pull/304
> > >
> > > I think there is one additional approach, which is probably close to
> (1):
> > > We only register those serializers by default which we don't see in the
> > > pre-flight phase (I think right now thats only GenericData.Array from
> > > Avro).
> > > We would come across all the other classes (Jodatime, Protobuf, Avro,
> > > Thrift, ...) when traversing the class hierarchy, as proposed in
> > > FLINK-1417. With this approach, users get the best out-of-the box
> > > experience and the number of registered classes / serializers is kept
> at
> > a
> > > minimum.
> > > We can still offer means to register additional serializers (I think
> > thats
> > > already merged to master).
> > >
> > > My main concern with this particular issue is a good out of the box
> user
> > > experience. If there is an issue with type serialization, users will
> > notice
> > > it very early. (In my experience people often have their existing
> > datatypes
> > > they use with other systems, and they want to continue using them)
> > > Therefore, I want to put some effort into making it as good as
> possible.
> > I
> > > would actually sacrifice performance over stability/usability here. Our
> > > system is flexible enough to replace it later with a more efficient
> > > serialization if that becomes an issue. But maybe my suggestion above
> is
> > > already sufficient.
> > >
> > > We could also think about introducing a configuration variable which
> > allows
> > > users to disable the default serializers.
> > >
> > >
> > > Regarding the second question: Is there a downside registering all
> types
> > > for tagging? We reduce the overall I/O which is good for performance.
> > >
> > > Best,
> > > Robert
> > >
> > >
> > >
> > > On Mon, Jan 19, 2015 at 8:24 PM, Stephan Ewen <[hidden email]>
> wrote:
> > >
> > > > Hi all!
> > > >
> > > > We have various pending pull requests that add support for certain
> > types
> > > by
> > > > adding extra kryo serializers.
> > > >
> > > > I think we need to decide how we want to handle the support for extra
> > > > types, because more are certainly to come.
> > > >
> > > > As I understand it, we have three broad options:
> > > >
> > > > (1)
> > > > Add as many serializers to Kryo by default as possible.
> > > >  Pro:
> > > >     - Many types work out of the box
> > > >  Contra:
> > > >     - We may eventually overload the kryo registry with serializers
> > > >       that are not needed for most cases and suffer in performance
> > > >     - It is hard to guess which types work out of the box
> > (intransparent)
> > > >
> > > >
> > > > (2)
> > > > We create a collection of serializers and a registration util.
> > > > --------
> > > > val env = ExecutionEnvironemnt.getExecutionEnviroment()
> > > >
> > > > Serializers.registerProtoBufSerializers(env);
> > > > Serializers.registerJavaUtilSerializers(env);
> > > > ---------
> > > > Pro:
> > > >   - Easy for users
> > > >   - We can grow the set of supported types very large without
> > overloading
> > > > Kryo
> > > >   - It is transparent what gets registered
> > > >
> > > > Contra:
> > > >   - Not quite as convenient as if things just run
> > > >
> > > >
> > > > (3)
> > > > We do nothing and let the user create and register whatever is
> needed.
> > > >
> > > > We could have a library and utility for serializers for certain
> > > libraries.
> > > > Users could use this to conveniently add serializers for the
> libraries
> > > they
> > > > use.
> > > > Pro:
> > > >   - Simple for us ;-)
> > > > Contra:
> > > >   - More repeated work for users
> > > >
> > > >
> > > > ========================
> > > >
> > > > For approach (1) and (2), there is an orthogonal question of whether
> we
> > > > want to simply register default serializers (that enable that types
> > work)
> > > > or also register types for tags, to speed up the serialization of
> those
> > > > types.
> > > >
> > > >
> > > > Greetings,
> > > > Stephan
> > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Handling custom types with Kryo

till.rohrmann
I like the idea to automatically figure out which types are used by a
program and to register them at Kryo. Thus, +1 for this idea.

On Tue, Jan 20, 2015 at 11:34 AM, Robert Metzger <[hidden email]>
wrote:

> @Stephan: Yes, you are summarizing it correctly.
> I'll assign FLINK-1417 to myself and implement it as discussed here (once I
> have resolved the other issues assigned to me)
>
> There is one additional point we forgot in the discussion so far: We are
> initializing Kryo with twitter/chill's "ScalaKryoInstantiator()". I just
> checked and its registering 51 default serializers and 81 registered types.
>
> I just talked to Till (who implemented the KryoSerializer originally) and
> he also suggests to pool kryo instances using thread-local variables. I'll
> look into that as well once FLINK-1417 has been resolved. I think that
> helps to mitigate the heavy initialization of Kryo (which is inevitable)
>
>
> @Arvid: Our POJO/Types support does explicitly not require our users to
> implement any interfaces, so that option is not feasible.
> The bytecode analysis is not part of the main flink distribution because it
> is using a library with an apache incompatible license.
>
>
>
> On Tue, Jan 20, 2015 at 9:58 AM, Arvid Heise <[hidden email]>
> wrote:
>
> > An alternative way that may not be applicable in your case:
> >
> > For Sopremo, all types implemented a common interface. When a package is
> > loaded, the Sopremo package manager scans the jar and looks for classes
> > implementing the interfaces (quite fast, because not the entire class
> must
> > be loaded). All types implementing the interface are automatically added
> to
> > Kryo with their respective annotated serializers.
> >
> > If you still have bytecode analysis of jobs, you can also statically
> > determine all types that are used, check for default serializers, and
> > maintain only a minimal set of serializers used for this specific job.
> You
> > could already warn for unregistered types before submitting jobs.
> >
> > On Tue, Jan 20, 2015 at 6:38 AM, Stephan Ewen <[hidden email]> wrote:
> >
> > > Yes, I agree that the Avro serializer should be available by default.
> > That
> > > is one case of a typical type that should work out of the box, given
> that
> > > we support Avro file formats.
> > >
> > > Let me summarize how I understood that suggestion:
> > >
> > >  - We make Avro available by default by registering a default
> serializer
> > > for the SpecificBase
> > >
> > >  - We create a library of serializers. We do not register them by
> > default.
> > >
> > >  - Via FLINK-1417, we analyze the types. For any (nested) type that we
> > > encounter for which we have a serializer in the library, we register
> that
> > > serializer as the default serializer. Also, for every (nested) type we
> > > encounter, we register a tag at Kryo.
> > >
> > > I like that, it should give a nice and smooth user experience.
> > >
> > > Greetings,
> > > Stephan
> > >
> > >
> > >
> > >
> > > On Mon, Jan 19, 2015 at 12:32 PM, Robert Metzger <[hidden email]>
> > > wrote:
> > >
> > > > Hi,
> > > >
> > > > thank you for putting our discussion to the mailing list. This is
> > indeed
> > > > where such discussions belong. For the others, we started discussing
> > > here:
> > > > https://github.com/apache/flink/pull/304
> > > >
> > > > I think there is one additional approach, which is probably close to
> > (1):
> > > > We only register those serializers by default which we don't see in
> the
> > > > pre-flight phase (I think right now thats only GenericData.Array from
> > > > Avro).
> > > > We would come across all the other classes (Jodatime, Protobuf, Avro,
> > > > Thrift, ...) when traversing the class hierarchy, as proposed in
> > > > FLINK-1417. With this approach, users get the best out-of-the box
> > > > experience and the number of registered classes / serializers is kept
> > at
> > > a
> > > > minimum.
> > > > We can still offer means to register additional serializers (I think
> > > thats
> > > > already merged to master).
> > > >
> > > > My main concern with this particular issue is a good out of the box
> > user
> > > > experience. If there is an issue with type serialization, users will
> > > notice
> > > > it very early. (In my experience people often have their existing
> > > datatypes
> > > > they use with other systems, and they want to continue using them)
> > > > Therefore, I want to put some effort into making it as good as
> > possible.
> > > I
> > > > would actually sacrifice performance over stability/usability here.
> Our
> > > > system is flexible enough to replace it later with a more efficient
> > > > serialization if that becomes an issue. But maybe my suggestion above
> > is
> > > > already sufficient.
> > > >
> > > > We could also think about introducing a configuration variable which
> > > allows
> > > > users to disable the default serializers.
> > > >
> > > >
> > > > Regarding the second question: Is there a downside registering all
> > types
> > > > for tagging? We reduce the overall I/O which is good for performance.
> > > >
> > > > Best,
> > > > Robert
> > > >
> > > >
> > > >
> > > > On Mon, Jan 19, 2015 at 8:24 PM, Stephan Ewen <[hidden email]>
> > wrote:
> > > >
> > > > > Hi all!
> > > > >
> > > > > We have various pending pull requests that add support for certain
> > > types
> > > > by
> > > > > adding extra kryo serializers.
> > > > >
> > > > > I think we need to decide how we want to handle the support for
> extra
> > > > > types, because more are certainly to come.
> > > > >
> > > > > As I understand it, we have three broad options:
> > > > >
> > > > > (1)
> > > > > Add as many serializers to Kryo by default as possible.
> > > > >  Pro:
> > > > >     - Many types work out of the box
> > > > >  Contra:
> > > > >     - We may eventually overload the kryo registry with serializers
> > > > >       that are not needed for most cases and suffer in performance
> > > > >     - It is hard to guess which types work out of the box
> > > (intransparent)
> > > > >
> > > > >
> > > > > (2)
> > > > > We create a collection of serializers and a registration util.
> > > > > --------
> > > > > val env = ExecutionEnvironemnt.getExecutionEnviroment()
> > > > >
> > > > > Serializers.registerProtoBufSerializers(env);
> > > > > Serializers.registerJavaUtilSerializers(env);
> > > > > ---------
> > > > > Pro:
> > > > >   - Easy for users
> > > > >   - We can grow the set of supported types very large without
> > > overloading
> > > > > Kryo
> > > > >   - It is transparent what gets registered
> > > > >
> > > > > Contra:
> > > > >   - Not quite as convenient as if things just run
> > > > >
> > > > >
> > > > > (3)
> > > > > We do nothing and let the user create and register whatever is
> > needed.
> > > > >
> > > > > We could have a library and utility for serializers for certain
> > > > libraries.
> > > > > Users could use this to conveniently add serializers for the
> > libraries
> > > > they
> > > > > use.
> > > > > Pro:
> > > > >   - Simple for us ;-)
> > > > > Contra:
> > > > >   - More repeated work for users
> > > > >
> > > > >
> > > > > ========================
> > > > >
> > > > > For approach (1) and (2), there is an orthogonal question of
> whether
> > we
> > > > > want to simply register default serializers (that enable that types
> > > work)
> > > > > or also register types for tags, to speed up the serialization of
> > those
> > > > > types.
> > > > >
> > > > >
> > > > > Greetings,
> > > > > Stephan
> > > > >
> > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Handling custom types with Kryo

aalexandrov
+1 for program analysis from me too...

Should be doable also on a lower level (e.g. analysis of compiled *.class
files) with some off-the-shelf libraries, right?

2015-01-20 11:39 GMT+01:00 Till Rohrmann <[hidden email]>:

> I like the idea to automatically figure out which types are used by a
> program and to register them at Kryo. Thus, +1 for this idea.
>
> On Tue, Jan 20, 2015 at 11:34 AM, Robert Metzger <[hidden email]>
> wrote:
>
> > @Stephan: Yes, you are summarizing it correctly.
> > I'll assign FLINK-1417 to myself and implement it as discussed here
> (once I
> > have resolved the other issues assigned to me)
> >
> > There is one additional point we forgot in the discussion so far: We are
> > initializing Kryo with twitter/chill's "ScalaKryoInstantiator()". I just
> > checked and its registering 51 default serializers and 81 registered
> types.
> >
> > I just talked to Till (who implemented the KryoSerializer originally) and
> > he also suggests to pool kryo instances using thread-local variables.
> I'll
> > look into that as well once FLINK-1417 has been resolved. I think that
> > helps to mitigate the heavy initialization of Kryo (which is inevitable)
> >
> >
> > @Arvid: Our POJO/Types support does explicitly not require our users to
> > implement any interfaces, so that option is not feasible.
> > The bytecode analysis is not part of the main flink distribution because
> it
> > is using a library with an apache incompatible license.
> >
> >
> >
> > On Tue, Jan 20, 2015 at 9:58 AM, Arvid Heise <[hidden email]>
> > wrote:
> >
> > > An alternative way that may not be applicable in your case:
> > >
> > > For Sopremo, all types implemented a common interface. When a package
> is
> > > loaded, the Sopremo package manager scans the jar and looks for classes
> > > implementing the interfaces (quite fast, because not the entire class
> > must
> > > be loaded). All types implementing the interface are automatically
> added
> > to
> > > Kryo with their respective annotated serializers.
> > >
> > > If you still have bytecode analysis of jobs, you can also statically
> > > determine all types that are used, check for default serializers, and
> > > maintain only a minimal set of serializers used for this specific job.
> > You
> > > could already warn for unregistered types before submitting jobs.
> > >
> > > On Tue, Jan 20, 2015 at 6:38 AM, Stephan Ewen <[hidden email]>
> wrote:
> > >
> > > > Yes, I agree that the Avro serializer should be available by default.
> > > That
> > > > is one case of a typical type that should work out of the box, given
> > that
> > > > we support Avro file formats.
> > > >
> > > > Let me summarize how I understood that suggestion:
> > > >
> > > >  - We make Avro available by default by registering a default
> > serializer
> > > > for the SpecificBase
> > > >
> > > >  - We create a library of serializers. We do not register them by
> > > default.
> > > >
> > > >  - Via FLINK-1417, we analyze the types. For any (nested) type that
> we
> > > > encounter for which we have a serializer in the library, we register
> > that
> > > > serializer as the default serializer. Also, for every (nested) type
> we
> > > > encounter, we register a tag at Kryo.
> > > >
> > > > I like that, it should give a nice and smooth user experience.
> > > >
> > > > Greetings,
> > > > Stephan
> > > >
> > > >
> > > >
> > > >
> > > > On Mon, Jan 19, 2015 at 12:32 PM, Robert Metzger <
> [hidden email]>
> > > > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > thank you for putting our discussion to the mailing list. This is
> > > indeed
> > > > > where such discussions belong. For the others, we started
> discussing
> > > > here:
> > > > > https://github.com/apache/flink/pull/304
> > > > >
> > > > > I think there is one additional approach, which is probably close
> to
> > > (1):
> > > > > We only register those serializers by default which we don't see in
> > the
> > > > > pre-flight phase (I think right now thats only GenericData.Array
> from
> > > > > Avro).
> > > > > We would come across all the other classes (Jodatime, Protobuf,
> Avro,
> > > > > Thrift, ...) when traversing the class hierarchy, as proposed in
> > > > > FLINK-1417. With this approach, users get the best out-of-the box
> > > > > experience and the number of registered classes / serializers is
> kept
> > > at
> > > > a
> > > > > minimum.
> > > > > We can still offer means to register additional serializers (I
> think
> > > > thats
> > > > > already merged to master).
> > > > >
> > > > > My main concern with this particular issue is a good out of the box
> > > user
> > > > > experience. If there is an issue with type serialization, users
> will
> > > > notice
> > > > > it very early. (In my experience people often have their existing
> > > > datatypes
> > > > > they use with other systems, and they want to continue using them)
> > > > > Therefore, I want to put some effort into making it as good as
> > > possible.
> > > > I
> > > > > would actually sacrifice performance over stability/usability here.
> > Our
> > > > > system is flexible enough to replace it later with a more efficient
> > > > > serialization if that becomes an issue. But maybe my suggestion
> above
> > > is
> > > > > already sufficient.
> > > > >
> > > > > We could also think about introducing a configuration variable
> which
> > > > allows
> > > > > users to disable the default serializers.
> > > > >
> > > > >
> > > > > Regarding the second question: Is there a downside registering all
> > > types
> > > > > for tagging? We reduce the overall I/O which is good for
> performance.
> > > > >
> > > > > Best,
> > > > > Robert
> > > > >
> > > > >
> > > > >
> > > > > On Mon, Jan 19, 2015 at 8:24 PM, Stephan Ewen <[hidden email]>
> > > wrote:
> > > > >
> > > > > > Hi all!
> > > > > >
> > > > > > We have various pending pull requests that add support for
> certain
> > > > types
> > > > > by
> > > > > > adding extra kryo serializers.
> > > > > >
> > > > > > I think we need to decide how we want to handle the support for
> > extra
> > > > > > types, because more are certainly to come.
> > > > > >
> > > > > > As I understand it, we have three broad options:
> > > > > >
> > > > > > (1)
> > > > > > Add as many serializers to Kryo by default as possible.
> > > > > >  Pro:
> > > > > >     - Many types work out of the box
> > > > > >  Contra:
> > > > > >     - We may eventually overload the kryo registry with
> serializers
> > > > > >       that are not needed for most cases and suffer in
> performance
> > > > > >     - It is hard to guess which types work out of the box
> > > > (intransparent)
> > > > > >
> > > > > >
> > > > > > (2)
> > > > > > We create a collection of serializers and a registration util.
> > > > > > --------
> > > > > > val env = ExecutionEnvironemnt.getExecutionEnviroment()
> > > > > >
> > > > > > Serializers.registerProtoBufSerializers(env);
> > > > > > Serializers.registerJavaUtilSerializers(env);
> > > > > > ---------
> > > > > > Pro:
> > > > > >   - Easy for users
> > > > > >   - We can grow the set of supported types very large without
> > > > overloading
> > > > > > Kryo
> > > > > >   - It is transparent what gets registered
> > > > > >
> > > > > > Contra:
> > > > > >   - Not quite as convenient as if things just run
> > > > > >
> > > > > >
> > > > > > (3)
> > > > > > We do nothing and let the user create and register whatever is
> > > needed.
> > > > > >
> > > > > > We could have a library and utility for serializers for certain
> > > > > libraries.
> > > > > > Users could use this to conveniently add serializers for the
> > > libraries
> > > > > they
> > > > > > use.
> > > > > > Pro:
> > > > > >   - Simple for us ;-)
> > > > > > Contra:
> > > > > >   - More repeated work for users
> > > > > >
> > > > > >
> > > > > > ========================
> > > > > >
> > > > > > For approach (1) and (2), there is an orthogonal question of
> > whether
> > > we
> > > > > > want to simply register default serializers (that enable that
> types
> > > > work)
> > > > > > or also register types for tags, to speed up the serialization of
> > > those
> > > > > > types.
> > > > > >
> > > > > >
> > > > > > Greetings,
> > > > > > Stephan
> > > > > >
> > > > >
> > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Handling custom types with Kryo

Ufuk Celebi-2

On 20 Jan 2015, at 11:51, Alexander Alexandrov <[hidden email]> wrote:

> +1 for program analysis from me too...
>
> Should be doable also on a lower level (e.g. analysis of compiled *.class
> files) with some off-the-shelf libraries, right?

Yes. There was a prototypical implementation using byte code analysis to automatically annotate UDFs (constant fields etc.). As Robert pointed out though, the used library (called soot) is incompatible with the Apache license. :(

I think we have two options if we want to pursue this further:

1) Use soot with a custom build of Flink, so that we don't ship the code, or

2) Look into other (compatible) libraries for this
Reply | Threaded
Open this post in threaded view
|

Re: Handling custom types with Kryo

Timo Walther-2
In reply to this post by aalexandrov
Are we talking about the types for the input/output of operators or also
types that are used inside UDFs?
Operator I/O type classes are known, so we don't need static code
analysis for that. For types inside UDFs I can add that requirement to
FLINK-1319.


On 20.01.2015 11:51, Alexander Alexandrov wrote:

> +1 for program analysis from me too...
>
> Should be doable also on a lower level (e.g. analysis of compiled *.class
> files) with some off-the-shelf libraries, right?
>
> 2015-01-20 11:39 GMT+01:00 Till Rohrmann <[hidden email]>:
>
>> I like the idea to automatically figure out which types are used by a
>> program and to register them at Kryo. Thus, +1 for this idea.
>>
>> On Tue, Jan 20, 2015 at 11:34 AM, Robert Metzger <[hidden email]>
>> wrote:
>>
>>> @Stephan: Yes, you are summarizing it correctly.
>>> I'll assign FLINK-1417 to myself and implement it as discussed here
>> (once I
>>> have resolved the other issues assigned to me)
>>>
>>> There is one additional point we forgot in the discussion so far: We are
>>> initializing Kryo with twitter/chill's "ScalaKryoInstantiator()". I just
>>> checked and its registering 51 default serializers and 81 registered
>> types.
>>> I just talked to Till (who implemented the KryoSerializer originally) and
>>> he also suggests to pool kryo instances using thread-local variables.
>> I'll
>>> look into that as well once FLINK-1417 has been resolved. I think that
>>> helps to mitigate the heavy initialization of Kryo (which is inevitable)
>>>
>>>
>>> @Arvid: Our POJO/Types support does explicitly not require our users to
>>> implement any interfaces, so that option is not feasible.
>>> The bytecode analysis is not part of the main flink distribution because
>> it
>>> is using a library with an apache incompatible license.
>>>
>>>
>>>
>>> On Tue, Jan 20, 2015 at 9:58 AM, Arvid Heise <[hidden email]>
>>> wrote:
>>>
>>>> An alternative way that may not be applicable in your case:
>>>>
>>>> For Sopremo, all types implemented a common interface. When a package
>> is
>>>> loaded, the Sopremo package manager scans the jar and looks for classes
>>>> implementing the interfaces (quite fast, because not the entire class
>>> must
>>>> be loaded). All types implementing the interface are automatically
>> added
>>> to
>>>> Kryo with their respective annotated serializers.
>>>>
>>>> If you still have bytecode analysis of jobs, you can also statically
>>>> determine all types that are used, check for default serializers, and
>>>> maintain only a minimal set of serializers used for this specific job.
>>> You
>>>> could already warn for unregistered types before submitting jobs.
>>>>
>>>> On Tue, Jan 20, 2015 at 6:38 AM, Stephan Ewen <[hidden email]>
>> wrote:
>>>>> Yes, I agree that the Avro serializer should be available by default.
>>>> That
>>>>> is one case of a typical type that should work out of the box, given
>>> that
>>>>> we support Avro file formats.
>>>>>
>>>>> Let me summarize how I understood that suggestion:
>>>>>
>>>>>   - We make Avro available by default by registering a default
>>> serializer
>>>>> for the SpecificBase
>>>>>
>>>>>   - We create a library of serializers. We do not register them by
>>>> default.
>>>>>   - Via FLINK-1417, we analyze the types. For any (nested) type that
>> we
>>>>> encounter for which we have a serializer in the library, we register
>>> that
>>>>> serializer as the default serializer. Also, for every (nested) type
>> we
>>>>> encounter, we register a tag at Kryo.
>>>>>
>>>>> I like that, it should give a nice and smooth user experience.
>>>>>
>>>>> Greetings,
>>>>> Stephan
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Jan 19, 2015 at 12:32 PM, Robert Metzger <
>> [hidden email]>
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> thank you for putting our discussion to the mailing list. This is
>>>> indeed
>>>>>> where such discussions belong. For the others, we started
>> discussing
>>>>> here:
>>>>>> https://github.com/apache/flink/pull/304
>>>>>>
>>>>>> I think there is one additional approach, which is probably close
>> to
>>>> (1):
>>>>>> We only register those serializers by default which we don't see in
>>> the
>>>>>> pre-flight phase (I think right now thats only GenericData.Array
>> from
>>>>>> Avro).
>>>>>> We would come across all the other classes (Jodatime, Protobuf,
>> Avro,
>>>>>> Thrift, ...) when traversing the class hierarchy, as proposed in
>>>>>> FLINK-1417. With this approach, users get the best out-of-the box
>>>>>> experience and the number of registered classes / serializers is
>> kept
>>>> at
>>>>> a
>>>>>> minimum.
>>>>>> We can still offer means to register additional serializers (I
>> think
>>>>> thats
>>>>>> already merged to master).
>>>>>>
>>>>>> My main concern with this particular issue is a good out of the box
>>>> user
>>>>>> experience. If there is an issue with type serialization, users
>> will
>>>>> notice
>>>>>> it very early. (In my experience people often have their existing
>>>>> datatypes
>>>>>> they use with other systems, and they want to continue using them)
>>>>>> Therefore, I want to put some effort into making it as good as
>>>> possible.
>>>>> I
>>>>>> would actually sacrifice performance over stability/usability here.
>>> Our
>>>>>> system is flexible enough to replace it later with a more efficient
>>>>>> serialization if that becomes an issue. But maybe my suggestion
>> above
>>>> is
>>>>>> already sufficient.
>>>>>>
>>>>>> We could also think about introducing a configuration variable
>> which
>>>>> allows
>>>>>> users to disable the default serializers.
>>>>>>
>>>>>>
>>>>>> Regarding the second question: Is there a downside registering all
>>>> types
>>>>>> for tagging? We reduce the overall I/O which is good for
>> performance.
>>>>>> Best,
>>>>>> Robert
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, Jan 19, 2015 at 8:24 PM, Stephan Ewen <[hidden email]>
>>>> wrote:
>>>>>>> Hi all!
>>>>>>>
>>>>>>> We have various pending pull requests that add support for
>> certain
>>>>> types
>>>>>> by
>>>>>>> adding extra kryo serializers.
>>>>>>>
>>>>>>> I think we need to decide how we want to handle the support for
>>> extra
>>>>>>> types, because more are certainly to come.
>>>>>>>
>>>>>>> As I understand it, we have three broad options:
>>>>>>>
>>>>>>> (1)
>>>>>>> Add as many serializers to Kryo by default as possible.
>>>>>>>   Pro:
>>>>>>>      - Many types work out of the box
>>>>>>>   Contra:
>>>>>>>      - We may eventually overload the kryo registry with
>> serializers
>>>>>>>        that are not needed for most cases and suffer in
>> performance
>>>>>>>      - It is hard to guess which types work out of the box
>>>>> (intransparent)
>>>>>>>
>>>>>>> (2)
>>>>>>> We create a collection of serializers and a registration util.
>>>>>>> --------
>>>>>>> val env = ExecutionEnvironemnt.getExecutionEnviroment()
>>>>>>>
>>>>>>> Serializers.registerProtoBufSerializers(env);
>>>>>>> Serializers.registerJavaUtilSerializers(env);
>>>>>>> ---------
>>>>>>> Pro:
>>>>>>>    - Easy for users
>>>>>>>    - We can grow the set of supported types very large without
>>>>> overloading
>>>>>>> Kryo
>>>>>>>    - It is transparent what gets registered
>>>>>>>
>>>>>>> Contra:
>>>>>>>    - Not quite as convenient as if things just run
>>>>>>>
>>>>>>>
>>>>>>> (3)
>>>>>>> We do nothing and let the user create and register whatever is
>>>> needed.
>>>>>>> We could have a library and utility for serializers for certain
>>>>>> libraries.
>>>>>>> Users could use this to conveniently add serializers for the
>>>> libraries
>>>>>> they
>>>>>>> use.
>>>>>>> Pro:
>>>>>>>    - Simple for us ;-)
>>>>>>> Contra:
>>>>>>>    - More repeated work for users
>>>>>>>
>>>>>>>
>>>>>>> ========================
>>>>>>>
>>>>>>> For approach (1) and (2), there is an orthogonal question of
>>> whether
>>>> we
>>>>>>> want to simply register default serializers (that enable that
>> types
>>>>> work)
>>>>>>> or also register types for tags, to speed up the serialization of
>>>> those
>>>>>>> types.
>>>>>>>
>>>>>>>
>>>>>>> Greetings,
>>>>>>> Stephan
>>>>>>>

Reply | Threaded
Open this post in threaded view
|

Re: Handling custom types with Kryo

aalexandrov
I think we are talking about the Operator I/O types, as types used
internally only by the UDFs should not be serialized.

2015-01-20 12:27 GMT+01:00 Timo Walther <[hidden email]>:

> Are we talking about the types for the input/output of operators or also
> types that are used inside UDFs?
> Operator I/O type classes are known, so we don't need static code analysis
> for that. For types inside UDFs I can add that requirement to FLINK-1319.
>
>
>
> On 20.01.2015 11:51, Alexander Alexandrov wrote:
>
>> +1 for program analysis from me too...
>>
>> Should be doable also on a lower level (e.g. analysis of compiled *.class
>> files) with some off-the-shelf libraries, right?
>>
>> 2015-01-20 11:39 GMT+01:00 Till Rohrmann <[hidden email]>:
>>
>>  I like the idea to automatically figure out which types are used by a
>>> program and to register them at Kryo. Thus, +1 for this idea.
>>>
>>> On Tue, Jan 20, 2015 at 11:34 AM, Robert Metzger <[hidden email]>
>>> wrote:
>>>
>>>  @Stephan: Yes, you are summarizing it correctly.
>>>> I'll assign FLINK-1417 to myself and implement it as discussed here
>>>>
>>> (once I
>>>
>>>> have resolved the other issues assigned to me)
>>>>
>>>> There is one additional point we forgot in the discussion so far: We are
>>>> initializing Kryo with twitter/chill's "ScalaKryoInstantiator()". I just
>>>> checked and its registering 51 default serializers and 81 registered
>>>>
>>> types.
>>>
>>>> I just talked to Till (who implemented the KryoSerializer originally)
>>>> and
>>>> he also suggests to pool kryo instances using thread-local variables.
>>>>
>>> I'll
>>>
>>>> look into that as well once FLINK-1417 has been resolved. I think that
>>>> helps to mitigate the heavy initialization of Kryo (which is inevitable)
>>>>
>>>>
>>>> @Arvid: Our POJO/Types support does explicitly not require our users to
>>>> implement any interfaces, so that option is not feasible.
>>>> The bytecode analysis is not part of the main flink distribution because
>>>>
>>> it
>>>
>>>> is using a library with an apache incompatible license.
>>>>
>>>>
>>>>
>>>> On Tue, Jan 20, 2015 at 9:58 AM, Arvid Heise <[hidden email]>
>>>> wrote:
>>>>
>>>>  An alternative way that may not be applicable in your case:
>>>>>
>>>>> For Sopremo, all types implemented a common interface. When a package
>>>>>
>>>> is
>>>
>>>> loaded, the Sopremo package manager scans the jar and looks for classes
>>>>> implementing the interfaces (quite fast, because not the entire class
>>>>>
>>>> must
>>>>
>>>>> be loaded). All types implementing the interface are automatically
>>>>>
>>>> added
>>>
>>>> to
>>>>
>>>>> Kryo with their respective annotated serializers.
>>>>>
>>>>> If you still have bytecode analysis of jobs, you can also statically
>>>>> determine all types that are used, check for default serializers, and
>>>>> maintain only a minimal set of serializers used for this specific job.
>>>>>
>>>> You
>>>>
>>>>> could already warn for unregistered types before submitting jobs.
>>>>>
>>>>> On Tue, Jan 20, 2015 at 6:38 AM, Stephan Ewen <[hidden email]>
>>>>>
>>>> wrote:
>>>
>>>> Yes, I agree that the Avro serializer should be available by default.
>>>>>>
>>>>> That
>>>>>
>>>>>> is one case of a typical type that should work out of the box, given
>>>>>>
>>>>> that
>>>>
>>>>> we support Avro file formats.
>>>>>>
>>>>>> Let me summarize how I understood that suggestion:
>>>>>>
>>>>>>   - We make Avro available by default by registering a default
>>>>>>
>>>>> serializer
>>>>
>>>>> for the SpecificBase
>>>>>>
>>>>>>   - We create a library of serializers. We do not register them by
>>>>>>
>>>>> default.
>>>>>
>>>>>>   - Via FLINK-1417, we analyze the types. For any (nested) type that
>>>>>>
>>>>> we
>>>
>>>> encounter for which we have a serializer in the library, we register
>>>>>>
>>>>> that
>>>>
>>>>> serializer as the default serializer. Also, for every (nested) type
>>>>>>
>>>>> we
>>>
>>>> encounter, we register a tag at Kryo.
>>>>>>
>>>>>> I like that, it should give a nice and smooth user experience.
>>>>>>
>>>>>> Greetings,
>>>>>> Stephan
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, Jan 19, 2015 at 12:32 PM, Robert Metzger <
>>>>>>
>>>>> [hidden email]>
>>>
>>>> wrote:
>>>>>>
>>>>>>  Hi,
>>>>>>>
>>>>>>> thank you for putting our discussion to the mailing list. This is
>>>>>>>
>>>>>> indeed
>>>>>
>>>>>> where such discussions belong. For the others, we started
>>>>>>>
>>>>>> discussing
>>>
>>>> here:
>>>>>>
>>>>>>> https://github.com/apache/flink/pull/304
>>>>>>>
>>>>>>> I think there is one additional approach, which is probably close
>>>>>>>
>>>>>> to
>>>
>>>> (1):
>>>>>
>>>>>> We only register those serializers by default which we don't see in
>>>>>>>
>>>>>> the
>>>>
>>>>> pre-flight phase (I think right now thats only GenericData.Array
>>>>>>>
>>>>>> from
>>>
>>>> Avro).
>>>>>>> We would come across all the other classes (Jodatime, Protobuf,
>>>>>>>
>>>>>> Avro,
>>>
>>>> Thrift, ...) when traversing the class hierarchy, as proposed in
>>>>>>> FLINK-1417. With this approach, users get the best out-of-the box
>>>>>>> experience and the number of registered classes / serializers is
>>>>>>>
>>>>>> kept
>>>
>>>> at
>>>>>
>>>>>> a
>>>>>>
>>>>>>> minimum.
>>>>>>> We can still offer means to register additional serializers (I
>>>>>>>
>>>>>> think
>>>
>>>> thats
>>>>>>
>>>>>>> already merged to master).
>>>>>>>
>>>>>>> My main concern with this particular issue is a good out of the box
>>>>>>>
>>>>>> user
>>>>>
>>>>>> experience. If there is an issue with type serialization, users
>>>>>>>
>>>>>> will
>>>
>>>> notice
>>>>>>
>>>>>>> it very early. (In my experience people often have their existing
>>>>>>>
>>>>>> datatypes
>>>>>>
>>>>>>> they use with other systems, and they want to continue using them)
>>>>>>> Therefore, I want to put some effort into making it as good as
>>>>>>>
>>>>>> possible.
>>>>>
>>>>>> I
>>>>>>
>>>>>>> would actually sacrifice performance over stability/usability here.
>>>>>>>
>>>>>> Our
>>>>
>>>>> system is flexible enough to replace it later with a more efficient
>>>>>>> serialization if that becomes an issue. But maybe my suggestion
>>>>>>>
>>>>>> above
>>>
>>>> is
>>>>>
>>>>>> already sufficient.
>>>>>>>
>>>>>>> We could also think about introducing a configuration variable
>>>>>>>
>>>>>> which
>>>
>>>> allows
>>>>>>
>>>>>>> users to disable the default serializers.
>>>>>>>
>>>>>>>
>>>>>>> Regarding the second question: Is there a downside registering all
>>>>>>>
>>>>>> types
>>>>>
>>>>>> for tagging? We reduce the overall I/O which is good for
>>>>>>>
>>>>>> performance.
>>>
>>>> Best,
>>>>>>> Robert
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Jan 19, 2015 at 8:24 PM, Stephan Ewen <[hidden email]>
>>>>>>>
>>>>>> wrote:
>>>>>
>>>>>> Hi all!
>>>>>>>>
>>>>>>>> We have various pending pull requests that add support for
>>>>>>>>
>>>>>>> certain
>>>
>>>> types
>>>>>>
>>>>>>> by
>>>>>>>
>>>>>>>> adding extra kryo serializers.
>>>>>>>>
>>>>>>>> I think we need to decide how we want to handle the support for
>>>>>>>>
>>>>>>> extra
>>>>
>>>>> types, because more are certainly to come.
>>>>>>>>
>>>>>>>> As I understand it, we have three broad options:
>>>>>>>>
>>>>>>>> (1)
>>>>>>>> Add as many serializers to Kryo by default as possible.
>>>>>>>>   Pro:
>>>>>>>>      - Many types work out of the box
>>>>>>>>   Contra:
>>>>>>>>      - We may eventually overload the kryo registry with
>>>>>>>>
>>>>>>> serializers
>>>
>>>>        that are not needed for most cases and suffer in
>>>>>>>>
>>>>>>> performance
>>>
>>>>      - It is hard to guess which types work out of the box
>>>>>>>>
>>>>>>> (intransparent)
>>>>>>
>>>>>>>
>>>>>>>> (2)
>>>>>>>> We create a collection of serializers and a registration util.
>>>>>>>> --------
>>>>>>>> val env = ExecutionEnvironemnt.getExecutionEnviroment()
>>>>>>>>
>>>>>>>> Serializers.registerProtoBufSerializers(env);
>>>>>>>> Serializers.registerJavaUtilSerializers(env);
>>>>>>>> ---------
>>>>>>>> Pro:
>>>>>>>>    - Easy for users
>>>>>>>>    - We can grow the set of supported types very large without
>>>>>>>>
>>>>>>> overloading
>>>>>>
>>>>>>> Kryo
>>>>>>>>    - It is transparent what gets registered
>>>>>>>>
>>>>>>>> Contra:
>>>>>>>>    - Not quite as convenient as if things just run
>>>>>>>>
>>>>>>>>
>>>>>>>> (3)
>>>>>>>> We do nothing and let the user create and register whatever is
>>>>>>>>
>>>>>>> needed.
>>>>>
>>>>>> We could have a library and utility for serializers for certain
>>>>>>>>
>>>>>>> libraries.
>>>>>>>
>>>>>>>> Users could use this to conveniently add serializers for the
>>>>>>>>
>>>>>>> libraries
>>>>>
>>>>>> they
>>>>>>>
>>>>>>>> use.
>>>>>>>> Pro:
>>>>>>>>    - Simple for us ;-)
>>>>>>>> Contra:
>>>>>>>>    - More repeated work for users
>>>>>>>>
>>>>>>>>
>>>>>>>> ========================
>>>>>>>>
>>>>>>>> For approach (1) and (2), there is an orthogonal question of
>>>>>>>>
>>>>>>> whether
>>>>
>>>>> we
>>>>>
>>>>>> want to simply register default serializers (that enable that
>>>>>>>>
>>>>>>> types
>>>
>>>> work)
>>>>>>
>>>>>>> or also register types for tags, to speed up the serialization of
>>>>>>>>
>>>>>>> those
>>>>>
>>>>>> types.
>>>>>>>>
>>>>>>>>
>>>>>>>> Greetings,
>>>>>>>> Stephan
>>>>>>>>
>>>>>>>>
>
Reply | Threaded
Open this post in threaded view
|

Re: Handling custom types with Kryo

Aljoscha Krettek-2
In reply to this post by Stephan Ewen
Yes, that sounds very reasonable.
On Jan 20, 2015 6:40 AM, "Stephan Ewen" <[hidden email]> wrote:

> Yes, I agree that the Avro serializer should be available by default. That
> is one case of a typical type that should work out of the box, given that
> we support Avro file formats.
>
> Let me summarize how I understood that suggestion:
>
>  - We make Avro available by default by registering a default serializer
> for the SpecificBase
>
>  - We create a library of serializers. We do not register them by default.
>
>  - Via FLINK-1417, we analyze the types. For any (nested) type that we
> encounter for which we have a serializer in the library, we register that
> serializer as the default serializer. Also, for every (nested) type we
> encounter, we register a tag at Kryo.
>
> I like that, it should give a nice and smooth user experience.
>
> Greetings,
> Stephan
>
>
>
>
> On Mon, Jan 19, 2015 at 12:32 PM, Robert Metzger <[hidden email]>
> wrote:
>
> > Hi,
> >
> > thank you for putting our discussion to the mailing list. This is indeed
> > where such discussions belong. For the others, we started discussing
> here:
> > https://github.com/apache/flink/pull/304
> >
> > I think there is one additional approach, which is probably close to (1):
> > We only register those serializers by default which we don't see in the
> > pre-flight phase (I think right now thats only GenericData.Array from
> > Avro).
> > We would come across all the other classes (Jodatime, Protobuf, Avro,
> > Thrift, ...) when traversing the class hierarchy, as proposed in
> > FLINK-1417. With this approach, users get the best out-of-the box
> > experience and the number of registered classes / serializers is kept at
> a
> > minimum.
> > We can still offer means to register additional serializers (I think
> thats
> > already merged to master).
> >
> > My main concern with this particular issue is a good out of the box user
> > experience. If there is an issue with type serialization, users will
> notice
> > it very early. (In my experience people often have their existing
> datatypes
> > they use with other systems, and they want to continue using them)
> > Therefore, I want to put some effort into making it as good as possible.
> I
> > would actually sacrifice performance over stability/usability here. Our
> > system is flexible enough to replace it later with a more efficient
> > serialization if that becomes an issue. But maybe my suggestion above is
> > already sufficient.
> >
> > We could also think about introducing a configuration variable which
> allows
> > users to disable the default serializers.
> >
> >
> > Regarding the second question: Is there a downside registering all types
> > for tagging? We reduce the overall I/O which is good for performance.
> >
> > Best,
> > Robert
> >
> >
> >
> > On Mon, Jan 19, 2015 at 8:24 PM, Stephan Ewen <[hidden email]> wrote:
> >
> > > Hi all!
> > >
> > > We have various pending pull requests that add support for certain
> types
> > by
> > > adding extra kryo serializers.
> > >
> > > I think we need to decide how we want to handle the support for extra
> > > types, because more are certainly to come.
> > >
> > > As I understand it, we have three broad options:
> > >
> > > (1)
> > > Add as many serializers to Kryo by default as possible.
> > >  Pro:
> > >     - Many types work out of the box
> > >  Contra:
> > >     - We may eventually overload the kryo registry with serializers
> > >       that are not needed for most cases and suffer in performance
> > >     - It is hard to guess which types work out of the box
> (intransparent)
> > >
> > >
> > > (2)
> > > We create a collection of serializers and a registration util.
> > > --------
> > > val env = ExecutionEnvironemnt.getExecutionEnviroment()
> > >
> > > Serializers.registerProtoBufSerializers(env);
> > > Serializers.registerJavaUtilSerializers(env);
> > > ---------
> > > Pro:
> > >   - Easy for users
> > >   - We can grow the set of supported types very large without
> overloading
> > > Kryo
> > >   - It is transparent what gets registered
> > >
> > > Contra:
> > >   - Not quite as convenient as if things just run
> > >
> > >
> > > (3)
> > > We do nothing and let the user create and register whatever is needed.
> > >
> > > We could have a library and utility for serializers for certain
> > libraries.
> > > Users could use this to conveniently add serializers for the libraries
> > they
> > > use.
> > > Pro:
> > >   - Simple for us ;-)
> > > Contra:
> > >   - More repeated work for users
> > >
> > >
> > > ========================
> > >
> > > For approach (1) and (2), there is an orthogonal question of whether we
> > > want to simply register default serializers (that enable that types
> work)
> > > or also register types for tags, to speed up the serialization of those
> > > types.
> > >
> > >
> > > Greetings,
> > > Stephan
> > >
> >
>