Questions concerning Akka / Documentation

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Questions concerning Akka / Documentation

Stephan Ewen
Hi!

Since the new distributed infrastructure is built on Akka, some internal
concepts have changed now.
I think that this is currently not really document anywhere

@Till Can you elaborate on the questions here:

 - What is the Akka URL in the global configuration ("jobmanager.akka.url")
From the perspective of the global configuration, don't we simply have the
address and port of the actor system?

 - We currently have multiple competing failure-detection mechanisms: For
one, the job manager actor watches the task manager actors. Also, we still
have the manual heart beats in place. Shouldn't we remove the old manual
heartbeats and have the instance manager watch the task manager actors?

 - There are transport heartbeats and watch heartbeats. I could not find a
good explanation of what the transport heartbeats are. Also, the heartbeat
interval is very large (1000 s) by default, so I am wondering what there
purpose is.

 - There are many different timeouts:
   -> startup timeout
   -> watch heartbeat timeout
   -> ask timeout
   -> TCP timeout
  How to the relate / interact? Does it make sense to define them relative
to one another?

I think it makes a lot of sense to document these points somewhere.

Greetings,
Stephan
Reply | Threaded
Open this post in threaded view
|

Re: Questions concerning Akka / Documentation

Till Rohrmann
Hi,

you are right, the new implementation still lacks a lot of documentation
which makes understanding the code harder than necessary.

On Sun, Jan 4, 2015 at 10:28 PM, Stephan Ewen <[hidden email]> wrote:

> Hi!
>
> Since the new distributed infrastructure is built on Akka, some internal
> concepts have changed now.
> I think that this is currently not really document anywhere
>
> @Till Can you elaborate on the questions here:
>
>  - What is the Akka URL in the global configuration ("jobmanager.akka.url")
> From the perspective of the global configuration, don't we simply have the
> address and port of the actor system?
>

The jobmanager.akka.url is used to overwrite the default akka url
generation which is akka.tcp://${HOSTNAME}:${PORT}. This is necessary in
cases where we do not have remote actor systems but a single local, as in
the case of local execution, and thus have to use a different url scheme.
In case of a single actor system, the url would be
akka://${ACTORSYSTEMNAME}. So in fact this configuration option is only
used internally and should not be configured by the user. To make it
fail-safe we should probably use a non exposed mechanism.


>
>  - We currently have multiple competing failure-detection mechanisms: For
> one, the job manager actor watches the task manager actors. Also, we still
> have the manual heart beats in place. Shouldn't we remove the old manual
> heartbeats and have the instance manager watch the task manager actors?
>

It's right that we still have the old heartbeats in place but they are
stripped down. Currently, they are only used to update the
lastReceivedHeartBeat field in the Instance object. Consequently, they
could be simply removed at the price of not getting shown the time since
the last heartbeat in the web interface. The failure detection mechanism is
currently realized exclusively by using Akka's death watch, meaning that
the JobManager watches the TaskManagers and vice versa. I also thought that
some people wanted to piggy back on the heartbeat message to do monitoring.
Therefore I kept it for the moment. But I guess that a dedicated monitoring
message would be better.


>  - There are transport heartbeats and watch heartbeats. I could not find a
> good explanation of what the transport heartbeats are. Also, the heartbeat
> interval is very large (1000 s) by default, so I am wondering what there
> purpose is.
>

Yes you're right that Akka has a lot of little knobs to turn and twist and
some of them are more obvious than others. The transport failure detector
is Akka's own mechanism to detect lost messages. This is necessary for UDP
but not for TCP since it has its own failure detector. In order to decrease
the unnecessary network traffic, I set the heartbeat pause and heartbeat
interval of the transport failure detector to these high numbers.


>
>  - There are many different timeouts:
>    -> startup timeout
>

That is the timeout for creating an actor system.


>    -> watch heartbeat timeout
>

This timeout is used for the death watch. But the detector is actually
controlled by akka.watch.heartbeat.interval, akka.watch.heartbeat.pause and
akka.watch.threshold. In [1] it is described what these parameters do.


>    -> ask timeout
>

That is the general timeout which is used for all futures once the actor
system has been started.


>    -> TCP timeout
>

The TCP timeout is the timeout which is used by Netty for all outbound
connections.


>   How to the relate / interact? Does it make sense to define them relative
> to one another?


For the sake of simplicity and usability, it is a good idea to derive the
individual timeouts by means of some heuristics from a single timeout
value. Maybe we could use these heuristics as default values but still
allow the user to define these values himself if he wants to.


>
> I think it makes a lot of sense to document these points somewhere.
>

I'll add an overview and details of the implementation to the internals
section of the documentation.


>
> Greetings,
> Stephan
>

[1]
http://doc.akka.io/docs/akka/snapshot/scala/remoting.html#watching-remote-actors
Reply | Threaded
Open this post in threaded view
|

Re: Questions concerning Akka / Documentation

Stephan Ewen
Hi!

Thanks for clarifying. Here are some thoughts:

1) The akka URL should go through a non-exposes mechanism, true. In fact,
using the global configuration for the local embedded mode at all seems to
be a bad design that we should get rid of.

2) Okay, so we keep our own hearbeats in place as a means for metric
reports. At some point, we can avoid having the JobManager actor watch the
TaskManager actor then, it seems.

3) re: transport failure detector - makes sense

4) Yes, let's have a single timeout value that defines the ask timeout, tcp
timeout, and the interval of the watch failure detector, and allow to
override them by specifying the options.

Stephan


On Mon, Jan 5, 2015 at 10:30 AM, Till Rohrmann <[hidden email]> wrote:

> Hi,
>
> you are right, the new implementation still lacks a lot of documentation
> which makes understanding the code harder than necessary.
>
> On Sun, Jan 4, 2015 at 10:28 PM, Stephan Ewen <[hidden email]> wrote:
>
> > Hi!
> >
> > Since the new distributed infrastructure is built on Akka, some internal
> > concepts have changed now.
> > I think that this is currently not really document anywhere
> >
> > @Till Can you elaborate on the questions here:
> >
> >  - What is the Akka URL in the global configuration
> ("jobmanager.akka.url")
> > From the perspective of the global configuration, don't we simply have
> the
> > address and port of the actor system?
> >
>
> The jobmanager.akka.url is used to overwrite the default akka url
> generation which is akka.tcp://${HOSTNAME}:${PORT}. This is necessary in
> cases where we do not have remote actor systems but a single local, as in
> the case of local execution, and thus have to use a different url scheme.
> In case of a single actor system, the url would be
> akka://${ACTORSYSTEMNAME}. So in fact this configuration option is only
> used internally and should not be configured by the user. To make it
> fail-safe we should probably use a non exposed mechanism.
>
>
> >
> >  - We currently have multiple competing failure-detection mechanisms: For
> > one, the job manager actor watches the task manager actors. Also, we
> still
> > have the manual heart beats in place. Shouldn't we remove the old manual
> > heartbeats and have the instance manager watch the task manager actors?
> >
>
> It's right that we still have the old heartbeats in place but they are
> stripped down. Currently, they are only used to update the
> lastReceivedHeartBeat field in the Instance object. Consequently, they
> could be simply removed at the price of not getting shown the time since
> the last heartbeat in the web interface. The failure detection mechanism is
> currently realized exclusively by using Akka's death watch, meaning that
> the JobManager watches the TaskManagers and vice versa. I also thought that
> some people wanted to piggy back on the heartbeat message to do monitoring.
> Therefore I kept it for the moment. But I guess that a dedicated monitoring
> message would be better.
>
>
> >  - There are transport heartbeats and watch heartbeats. I could not find
> a
> > good explanation of what the transport heartbeats are. Also, the
> heartbeat
> > interval is very large (1000 s) by default, so I am wondering what there
> > purpose is.
> >
>
> Yes you're right that Akka has a lot of little knobs to turn and twist and
> some of them are more obvious than others. The transport failure detector
> is Akka's own mechanism to detect lost messages. This is necessary for UDP
> but not for TCP since it has its own failure detector. In order to decrease
> the unnecessary network traffic, I set the heartbeat pause and heartbeat
> interval of the transport failure detector to these high numbers.
>
>
> >
> >  - There are many different timeouts:
> >    -> startup timeout
> >
>
> That is the timeout for creating an actor system.
>
>
> >    -> watch heartbeat timeout
> >
>
> This timeout is used for the death watch. But the detector is actually
> controlled by akka.watch.heartbeat.interval, akka.watch.heartbeat.pause and
> akka.watch.threshold. In [1] it is described what these parameters do.
>
>
> >    -> ask timeout
> >
>
> That is the general timeout which is used for all futures once the actor
> system has been started.
>
>
> >    -> TCP timeout
> >
>
> The TCP timeout is the timeout which is used by Netty for all outbound
> connections.
>
>
> >   How to the relate / interact? Does it make sense to define them
> relative
> > to one another?
>
>
> For the sake of simplicity and usability, it is a good idea to derive the
> individual timeouts by means of some heuristics from a single timeout
> value. Maybe we could use these heuristics as default values but still
> allow the user to define these values himself if he wants to.
>
>
> >
> > I think it makes a lot of sense to document these points somewhere.
> >
>
> I'll add an overview and details of the implementation to the internals
> section of the documentation.
>
>
> >
> > Greetings,
> > Stephan
> >
>
> [1]
>
> http://doc.akka.io/docs/akka/snapshot/scala/remoting.html#watching-remote-actors
>
Reply | Threaded
Open this post in threaded view
|

Re: Questions concerning Akka / Documentation

Till Rohrmann
I addressed the issues.

On Mon, Jan 5, 2015 at 3:16 PM, Stephan Ewen <[hidden email]> wrote:

> Hi!
>
> Thanks for clarifying. Here are some thoughts:
>
> 1) The akka URL should go through a non-exposes mechanism, true. In fact,
> using the global configuration for the local embedded mode at all seems to
> be a bad design that we should get rid of.


Sooner or later we should also get rid of the global configuration.


>
> 2) Okay, so we keep our own hearbeats in place as a means for metric
> reports. At some point, we can avoid having the JobManager actor watch the
> TaskManager actor then, it seems.
>

That is right. We can choose depending which one performs better.


>
> 3) re: transport failure detector - makes sense
>
> 4) Yes, let's have a single timeout value that defines the ask timeout, tcp
> timeout, and the interval of the watch failure detector, and allow to
> override them by specifying the options.
>

I chose the following heuristics for the moment:

ask timeout = tcp timeout = startup timeout = death watch pause = 10 *
interval of death watch

We can see how they behave and if necessary adapt.


>
> Stephan
>
>
> On Mon, Jan 5, 2015 at 10:30 AM, Till Rohrmann <[hidden email]>
> wrote:
>
> > Hi,
> >
> > you are right, the new implementation still lacks a lot of documentation
> > which makes understanding the code harder than necessary.
> >
> > On Sun, Jan 4, 2015 at 10:28 PM, Stephan Ewen <[hidden email]> wrote:
> >
> > > Hi!
> > >
> > > Since the new distributed infrastructure is built on Akka, some
> internal
> > > concepts have changed now.
> > > I think that this is currently not really document anywhere
> > >
> > > @Till Can you elaborate on the questions here:
> > >
> > >  - What is the Akka URL in the global configuration
> > ("jobmanager.akka.url")
> > > From the perspective of the global configuration, don't we simply have
> > the
> > > address and port of the actor system?
> > >
> >
> > The jobmanager.akka.url is used to overwrite the default akka url
> > generation which is akka.tcp://${HOSTNAME}:${PORT}. This is necessary in
> > cases where we do not have remote actor systems but a single local, as in
> > the case of local execution, and thus have to use a different url scheme.
> > In case of a single actor system, the url would be
> > akka://${ACTORSYSTEMNAME}. So in fact this configuration option is only
> > used internally and should not be configured by the user. To make it
> > fail-safe we should probably use a non exposed mechanism.
> >
> >
> > >
> > >  - We currently have multiple competing failure-detection mechanisms:
> For
> > > one, the job manager actor watches the task manager actors. Also, we
> > still
> > > have the manual heart beats in place. Shouldn't we remove the old
> manual
> > > heartbeats and have the instance manager watch the task manager actors?
> > >
> >
> > It's right that we still have the old heartbeats in place but they are
> > stripped down. Currently, they are only used to update the
> > lastReceivedHeartBeat field in the Instance object. Consequently, they
> > could be simply removed at the price of not getting shown the time since
> > the last heartbeat in the web interface. The failure detection mechanism
> is
> > currently realized exclusively by using Akka's death watch, meaning that
> > the JobManager watches the TaskManagers and vice versa. I also thought
> that
> > some people wanted to piggy back on the heartbeat message to do
> monitoring.
> > Therefore I kept it for the moment. But I guess that a dedicated
> monitoring
> > message would be better.
> >
> >
> > >  - There are transport heartbeats and watch heartbeats. I could not
> find
> > a
> > > good explanation of what the transport heartbeats are. Also, the
> > heartbeat
> > > interval is very large (1000 s) by default, so I am wondering what
> there
> > > purpose is.
> > >
> >
> > Yes you're right that Akka has a lot of little knobs to turn and twist
> and
> > some of them are more obvious than others. The transport failure detector
> > is Akka's own mechanism to detect lost messages. This is necessary for
> UDP
> > but not for TCP since it has its own failure detector. In order to
> decrease
> > the unnecessary network traffic, I set the heartbeat pause and heartbeat
> > interval of the transport failure detector to these high numbers.
> >
> >
> > >
> > >  - There are many different timeouts:
> > >    -> startup timeout
> > >
> >
> > That is the timeout for creating an actor system.
> >
> >
> > >    -> watch heartbeat timeout
> > >
> >
> > This timeout is used for the death watch. But the detector is actually
> > controlled by akka.watch.heartbeat.interval, akka.watch.heartbeat.pause
> and
> > akka.watch.threshold. In [1] it is described what these parameters do.
> >
> >
> > >    -> ask timeout
> > >
> >
> > That is the general timeout which is used for all futures once the actor
> > system has been started.
> >
> >
> > >    -> TCP timeout
> > >
> >
> > The TCP timeout is the timeout which is used by Netty for all outbound
> > connections.
> >
> >
> > >   How to the relate / interact? Does it make sense to define them
> > relative
> > > to one another?
> >
> >
> > For the sake of simplicity and usability, it is a good idea to derive the
> > individual timeouts by means of some heuristics from a single timeout
> > value. Maybe we could use these heuristics as default values but still
> > allow the user to define these values himself if he wants to.
> >
> >
> > >
> > > I think it makes a lot of sense to document these points somewhere.
> > >
> >
> > I'll add an overview and details of the implementation to the internals
> > section of the documentation.
> >
> >
> > >
> > > Greetings,
> > > Stephan
> > >
> >
> > [1]
> >
> >
> http://doc.akka.io/docs/akka/snapshot/scala/remoting.html#watching-remote-actors
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Questions concerning Akka / Documentation

Stephan Ewen
Sounds good!

On Mon, Jan 5, 2015 at 7:14 PM, Till Rohrmann <[hidden email]> wrote:

> I addressed the issues.
>
> On Mon, Jan 5, 2015 at 3:16 PM, Stephan Ewen <[hidden email]> wrote:
>
> > Hi!
> >
> > Thanks for clarifying. Here are some thoughts:
> >
> > 1) The akka URL should go through a non-exposes mechanism, true. In fact,
> > using the global configuration for the local embedded mode at all seems
> to
> > be a bad design that we should get rid of.
>
>
> Sooner or later we should also get rid of the global configuration.
>
>
> >
> > 2) Okay, so we keep our own hearbeats in place as a means for metric
> > reports. At some point, we can avoid having the JobManager actor watch
> the
> > TaskManager actor then, it seems.
> >
>
> That is right. We can choose depending which one performs better.
>
>
> >
> > 3) re: transport failure detector - makes sense
> >
> > 4) Yes, let's have a single timeout value that defines the ask timeout,
> tcp
> > timeout, and the interval of the watch failure detector, and allow to
> > override them by specifying the options.
> >
>
> I chose the following heuristics for the moment:
>
> ask timeout = tcp timeout = startup timeout = death watch pause = 10 *
> interval of death watch
>
> We can see how they behave and if necessary adapt.
>
>
> >
> > Stephan
> >
> >
> > On Mon, Jan 5, 2015 at 10:30 AM, Till Rohrmann <[hidden email]>
> > wrote:
> >
> > > Hi,
> > >
> > > you are right, the new implementation still lacks a lot of
> documentation
> > > which makes understanding the code harder than necessary.
> > >
> > > On Sun, Jan 4, 2015 at 10:28 PM, Stephan Ewen <[hidden email]>
> wrote:
> > >
> > > > Hi!
> > > >
> > > > Since the new distributed infrastructure is built on Akka, some
> > internal
> > > > concepts have changed now.
> > > > I think that this is currently not really document anywhere
> > > >
> > > > @Till Can you elaborate on the questions here:
> > > >
> > > >  - What is the Akka URL in the global configuration
> > > ("jobmanager.akka.url")
> > > > From the perspective of the global configuration, don't we simply
> have
> > > the
> > > > address and port of the actor system?
> > > >
> > >
> > > The jobmanager.akka.url is used to overwrite the default akka url
> > > generation which is akka.tcp://${HOSTNAME}:${PORT}. This is necessary
> in
> > > cases where we do not have remote actor systems but a single local, as
> in
> > > the case of local execution, and thus have to use a different url
> scheme.
> > > In case of a single actor system, the url would be
> > > akka://${ACTORSYSTEMNAME}. So in fact this configuration option is only
> > > used internally and should not be configured by the user. To make it
> > > fail-safe we should probably use a non exposed mechanism.
> > >
> > >
> > > >
> > > >  - We currently have multiple competing failure-detection mechanisms:
> > For
> > > > one, the job manager actor watches the task manager actors. Also, we
> > > still
> > > > have the manual heart beats in place. Shouldn't we remove the old
> > manual
> > > > heartbeats and have the instance manager watch the task manager
> actors?
> > > >
> > >
> > > It's right that we still have the old heartbeats in place but they are
> > > stripped down. Currently, they are only used to update the
> > > lastReceivedHeartBeat field in the Instance object. Consequently, they
> > > could be simply removed at the price of not getting shown the time
> since
> > > the last heartbeat in the web interface. The failure detection
> mechanism
> > is
> > > currently realized exclusively by using Akka's death watch, meaning
> that
> > > the JobManager watches the TaskManagers and vice versa. I also thought
> > that
> > > some people wanted to piggy back on the heartbeat message to do
> > monitoring.
> > > Therefore I kept it for the moment. But I guess that a dedicated
> > monitoring
> > > message would be better.
> > >
> > >
> > > >  - There are transport heartbeats and watch heartbeats. I could not
> > find
> > > a
> > > > good explanation of what the transport heartbeats are. Also, the
> > > heartbeat
> > > > interval is very large (1000 s) by default, so I am wondering what
> > there
> > > > purpose is.
> > > >
> > >
> > > Yes you're right that Akka has a lot of little knobs to turn and twist
> > and
> > > some of them are more obvious than others. The transport failure
> detector
> > > is Akka's own mechanism to detect lost messages. This is necessary for
> > UDP
> > > but not for TCP since it has its own failure detector. In order to
> > decrease
> > > the unnecessary network traffic, I set the heartbeat pause and
> heartbeat
> > > interval of the transport failure detector to these high numbers.
> > >
> > >
> > > >
> > > >  - There are many different timeouts:
> > > >    -> startup timeout
> > > >
> > >
> > > That is the timeout for creating an actor system.
> > >
> > >
> > > >    -> watch heartbeat timeout
> > > >
> > >
> > > This timeout is used for the death watch. But the detector is actually
> > > controlled by akka.watch.heartbeat.interval, akka.watch.heartbeat.pause
> > and
> > > akka.watch.threshold. In [1] it is described what these parameters do.
> > >
> > >
> > > >    -> ask timeout
> > > >
> > >
> > > That is the general timeout which is used for all futures once the
> actor
> > > system has been started.
> > >
> > >
> > > >    -> TCP timeout
> > > >
> > >
> > > The TCP timeout is the timeout which is used by Netty for all outbound
> > > connections.
> > >
> > >
> > > >   How to the relate / interact? Does it make sense to define them
> > > relative
> > > > to one another?
> > >
> > >
> > > For the sake of simplicity and usability, it is a good idea to derive
> the
> > > individual timeouts by means of some heuristics from a single timeout
> > > value. Maybe we could use these heuristics as default values but still
> > > allow the user to define these values himself if he wants to.
> > >
> > >
> > > >
> > > > I think it makes a lot of sense to document these points somewhere.
> > > >
> > >
> > > I'll add an overview and details of the implementation to the internals
> > > section of the documentation.
> > >
> > >
> > > >
> > > > Greetings,
> > > > Stephan
> > > >
> > >
> > > [1]
> > >
> > >
> >
> http://doc.akka.io/docs/akka/snapshot/scala/remoting.html#watching-remote-actors
> > >
> >
>