Hi!
Since the new distributed infrastructure is built on Akka, some internal concepts have changed now. I think that this is currently not really document anywhere @Till Can you elaborate on the questions here: - What is the Akka URL in the global configuration ("jobmanager.akka.url") From the perspective of the global configuration, don't we simply have the address and port of the actor system? - We currently have multiple competing failure-detection mechanisms: For one, the job manager actor watches the task manager actors. Also, we still have the manual heart beats in place. Shouldn't we remove the old manual heartbeats and have the instance manager watch the task manager actors? - There are transport heartbeats and watch heartbeats. I could not find a good explanation of what the transport heartbeats are. Also, the heartbeat interval is very large (1000 s) by default, so I am wondering what there purpose is. - There are many different timeouts: -> startup timeout -> watch heartbeat timeout -> ask timeout -> TCP timeout How to the relate / interact? Does it make sense to define them relative to one another? I think it makes a lot of sense to document these points somewhere. Greetings, Stephan |
Hi,
you are right, the new implementation still lacks a lot of documentation which makes understanding the code harder than necessary. On Sun, Jan 4, 2015 at 10:28 PM, Stephan Ewen <[hidden email]> wrote: > Hi! > > Since the new distributed infrastructure is built on Akka, some internal > concepts have changed now. > I think that this is currently not really document anywhere > > @Till Can you elaborate on the questions here: > > - What is the Akka URL in the global configuration ("jobmanager.akka.url") > From the perspective of the global configuration, don't we simply have the > address and port of the actor system? > The jobmanager.akka.url is used to overwrite the default akka url generation which is akka.tcp://${HOSTNAME}:${PORT}. This is necessary in cases where we do not have remote actor systems but a single local, as in the case of local execution, and thus have to use a different url scheme. In case of a single actor system, the url would be akka://${ACTORSYSTEMNAME}. So in fact this configuration option is only used internally and should not be configured by the user. To make it fail-safe we should probably use a non exposed mechanism. > > - We currently have multiple competing failure-detection mechanisms: For > one, the job manager actor watches the task manager actors. Also, we still > have the manual heart beats in place. Shouldn't we remove the old manual > heartbeats and have the instance manager watch the task manager actors? > It's right that we still have the old heartbeats in place but they are stripped down. Currently, they are only used to update the lastReceivedHeartBeat field in the Instance object. Consequently, they could be simply removed at the price of not getting shown the time since the last heartbeat in the web interface. The failure detection mechanism is currently realized exclusively by using Akka's death watch, meaning that the JobManager watches the TaskManagers and vice versa. I also thought that some people wanted to piggy back on the heartbeat message to do monitoring. Therefore I kept it for the moment. But I guess that a dedicated monitoring message would be better. > - There are transport heartbeats and watch heartbeats. I could not find a > good explanation of what the transport heartbeats are. Also, the heartbeat > interval is very large (1000 s) by default, so I am wondering what there > purpose is. > Yes you're right that Akka has a lot of little knobs to turn and twist and some of them are more obvious than others. The transport failure detector is Akka's own mechanism to detect lost messages. This is necessary for UDP but not for TCP since it has its own failure detector. In order to decrease the unnecessary network traffic, I set the heartbeat pause and heartbeat interval of the transport failure detector to these high numbers. > > - There are many different timeouts: > -> startup timeout > That is the timeout for creating an actor system. > -> watch heartbeat timeout > This timeout is used for the death watch. But the detector is actually controlled by akka.watch.heartbeat.interval, akka.watch.heartbeat.pause and akka.watch.threshold. In [1] it is described what these parameters do. > -> ask timeout > That is the general timeout which is used for all futures once the actor system has been started. > -> TCP timeout > The TCP timeout is the timeout which is used by Netty for all outbound connections. > How to the relate / interact? Does it make sense to define them relative > to one another? For the sake of simplicity and usability, it is a good idea to derive the individual timeouts by means of some heuristics from a single timeout value. Maybe we could use these heuristics as default values but still allow the user to define these values himself if he wants to. > > I think it makes a lot of sense to document these points somewhere. > I'll add an overview and details of the implementation to the internals section of the documentation. > > Greetings, > Stephan > [1] http://doc.akka.io/docs/akka/snapshot/scala/remoting.html#watching-remote-actors |
Hi!
Thanks for clarifying. Here are some thoughts: 1) The akka URL should go through a non-exposes mechanism, true. In fact, using the global configuration for the local embedded mode at all seems to be a bad design that we should get rid of. 2) Okay, so we keep our own hearbeats in place as a means for metric reports. At some point, we can avoid having the JobManager actor watch the TaskManager actor then, it seems. 3) re: transport failure detector - makes sense 4) Yes, let's have a single timeout value that defines the ask timeout, tcp timeout, and the interval of the watch failure detector, and allow to override them by specifying the options. Stephan On Mon, Jan 5, 2015 at 10:30 AM, Till Rohrmann <[hidden email]> wrote: > Hi, > > you are right, the new implementation still lacks a lot of documentation > which makes understanding the code harder than necessary. > > On Sun, Jan 4, 2015 at 10:28 PM, Stephan Ewen <[hidden email]> wrote: > > > Hi! > > > > Since the new distributed infrastructure is built on Akka, some internal > > concepts have changed now. > > I think that this is currently not really document anywhere > > > > @Till Can you elaborate on the questions here: > > > > - What is the Akka URL in the global configuration > ("jobmanager.akka.url") > > From the perspective of the global configuration, don't we simply have > the > > address and port of the actor system? > > > > The jobmanager.akka.url is used to overwrite the default akka url > generation which is akka.tcp://${HOSTNAME}:${PORT}. This is necessary in > cases where we do not have remote actor systems but a single local, as in > the case of local execution, and thus have to use a different url scheme. > In case of a single actor system, the url would be > akka://${ACTORSYSTEMNAME}. So in fact this configuration option is only > used internally and should not be configured by the user. To make it > fail-safe we should probably use a non exposed mechanism. > > > > > > - We currently have multiple competing failure-detection mechanisms: For > > one, the job manager actor watches the task manager actors. Also, we > still > > have the manual heart beats in place. Shouldn't we remove the old manual > > heartbeats and have the instance manager watch the task manager actors? > > > > It's right that we still have the old heartbeats in place but they are > stripped down. Currently, they are only used to update the > lastReceivedHeartBeat field in the Instance object. Consequently, they > could be simply removed at the price of not getting shown the time since > the last heartbeat in the web interface. The failure detection mechanism is > currently realized exclusively by using Akka's death watch, meaning that > the JobManager watches the TaskManagers and vice versa. I also thought that > some people wanted to piggy back on the heartbeat message to do monitoring. > Therefore I kept it for the moment. But I guess that a dedicated monitoring > message would be better. > > > > - There are transport heartbeats and watch heartbeats. I could not find > a > > good explanation of what the transport heartbeats are. Also, the > heartbeat > > interval is very large (1000 s) by default, so I am wondering what there > > purpose is. > > > > Yes you're right that Akka has a lot of little knobs to turn and twist and > some of them are more obvious than others. The transport failure detector > is Akka's own mechanism to detect lost messages. This is necessary for UDP > but not for TCP since it has its own failure detector. In order to decrease > the unnecessary network traffic, I set the heartbeat pause and heartbeat > interval of the transport failure detector to these high numbers. > > > > > > - There are many different timeouts: > > -> startup timeout > > > > That is the timeout for creating an actor system. > > > > -> watch heartbeat timeout > > > > This timeout is used for the death watch. But the detector is actually > controlled by akka.watch.heartbeat.interval, akka.watch.heartbeat.pause and > akka.watch.threshold. In [1] it is described what these parameters do. > > > > -> ask timeout > > > > That is the general timeout which is used for all futures once the actor > system has been started. > > > > -> TCP timeout > > > > The TCP timeout is the timeout which is used by Netty for all outbound > connections. > > > > How to the relate / interact? Does it make sense to define them > relative > > to one another? > > > For the sake of simplicity and usability, it is a good idea to derive the > individual timeouts by means of some heuristics from a single timeout > value. Maybe we could use these heuristics as default values but still > allow the user to define these values himself if he wants to. > > > > > > I think it makes a lot of sense to document these points somewhere. > > > > I'll add an overview and details of the implementation to the internals > section of the documentation. > > > > > > Greetings, > > Stephan > > > > [1] > > http://doc.akka.io/docs/akka/snapshot/scala/remoting.html#watching-remote-actors > |
I addressed the issues.
On Mon, Jan 5, 2015 at 3:16 PM, Stephan Ewen <[hidden email]> wrote: > Hi! > > Thanks for clarifying. Here are some thoughts: > > 1) The akka URL should go through a non-exposes mechanism, true. In fact, > using the global configuration for the local embedded mode at all seems to > be a bad design that we should get rid of. Sooner or later we should also get rid of the global configuration. > > 2) Okay, so we keep our own hearbeats in place as a means for metric > reports. At some point, we can avoid having the JobManager actor watch the > TaskManager actor then, it seems. > That is right. We can choose depending which one performs better. > > 3) re: transport failure detector - makes sense > > 4) Yes, let's have a single timeout value that defines the ask timeout, tcp > timeout, and the interval of the watch failure detector, and allow to > override them by specifying the options. > I chose the following heuristics for the moment: ask timeout = tcp timeout = startup timeout = death watch pause = 10 * interval of death watch We can see how they behave and if necessary adapt. > > Stephan > > > On Mon, Jan 5, 2015 at 10:30 AM, Till Rohrmann <[hidden email]> > wrote: > > > Hi, > > > > you are right, the new implementation still lacks a lot of documentation > > which makes understanding the code harder than necessary. > > > > On Sun, Jan 4, 2015 at 10:28 PM, Stephan Ewen <[hidden email]> wrote: > > > > > Hi! > > > > > > Since the new distributed infrastructure is built on Akka, some > internal > > > concepts have changed now. > > > I think that this is currently not really document anywhere > > > > > > @Till Can you elaborate on the questions here: > > > > > > - What is the Akka URL in the global configuration > > ("jobmanager.akka.url") > > > From the perspective of the global configuration, don't we simply have > > the > > > address and port of the actor system? > > > > > > > The jobmanager.akka.url is used to overwrite the default akka url > > generation which is akka.tcp://${HOSTNAME}:${PORT}. This is necessary in > > cases where we do not have remote actor systems but a single local, as in > > the case of local execution, and thus have to use a different url scheme. > > In case of a single actor system, the url would be > > akka://${ACTORSYSTEMNAME}. So in fact this configuration option is only > > used internally and should not be configured by the user. To make it > > fail-safe we should probably use a non exposed mechanism. > > > > > > > > > > - We currently have multiple competing failure-detection mechanisms: > For > > > one, the job manager actor watches the task manager actors. Also, we > > still > > > have the manual heart beats in place. Shouldn't we remove the old > manual > > > heartbeats and have the instance manager watch the task manager actors? > > > > > > > It's right that we still have the old heartbeats in place but they are > > stripped down. Currently, they are only used to update the > > lastReceivedHeartBeat field in the Instance object. Consequently, they > > could be simply removed at the price of not getting shown the time since > > the last heartbeat in the web interface. The failure detection mechanism > is > > currently realized exclusively by using Akka's death watch, meaning that > > the JobManager watches the TaskManagers and vice versa. I also thought > that > > some people wanted to piggy back on the heartbeat message to do > monitoring. > > Therefore I kept it for the moment. But I guess that a dedicated > monitoring > > message would be better. > > > > > > > - There are transport heartbeats and watch heartbeats. I could not > find > > a > > > good explanation of what the transport heartbeats are. Also, the > > heartbeat > > > interval is very large (1000 s) by default, so I am wondering what > there > > > purpose is. > > > > > > > Yes you're right that Akka has a lot of little knobs to turn and twist > and > > some of them are more obvious than others. The transport failure detector > > is Akka's own mechanism to detect lost messages. This is necessary for > UDP > > but not for TCP since it has its own failure detector. In order to > decrease > > the unnecessary network traffic, I set the heartbeat pause and heartbeat > > interval of the transport failure detector to these high numbers. > > > > > > > > > > - There are many different timeouts: > > > -> startup timeout > > > > > > > That is the timeout for creating an actor system. > > > > > > > -> watch heartbeat timeout > > > > > > > This timeout is used for the death watch. But the detector is actually > > controlled by akka.watch.heartbeat.interval, akka.watch.heartbeat.pause > and > > akka.watch.threshold. In [1] it is described what these parameters do. > > > > > > > -> ask timeout > > > > > > > That is the general timeout which is used for all futures once the actor > > system has been started. > > > > > > > -> TCP timeout > > > > > > > The TCP timeout is the timeout which is used by Netty for all outbound > > connections. > > > > > > > How to the relate / interact? Does it make sense to define them > > relative > > > to one another? > > > > > > For the sake of simplicity and usability, it is a good idea to derive the > > individual timeouts by means of some heuristics from a single timeout > > value. Maybe we could use these heuristics as default values but still > > allow the user to define these values himself if he wants to. > > > > > > > > > > I think it makes a lot of sense to document these points somewhere. > > > > > > > I'll add an overview and details of the implementation to the internals > > section of the documentation. > > > > > > > > > > Greetings, > > > Stephan > > > > > > > [1] > > > > > http://doc.akka.io/docs/akka/snapshot/scala/remoting.html#watching-remote-actors > > > |
Sounds good!
On Mon, Jan 5, 2015 at 7:14 PM, Till Rohrmann <[hidden email]> wrote: > I addressed the issues. > > On Mon, Jan 5, 2015 at 3:16 PM, Stephan Ewen <[hidden email]> wrote: > > > Hi! > > > > Thanks for clarifying. Here are some thoughts: > > > > 1) The akka URL should go through a non-exposes mechanism, true. In fact, > > using the global configuration for the local embedded mode at all seems > to > > be a bad design that we should get rid of. > > > Sooner or later we should also get rid of the global configuration. > > > > > > 2) Okay, so we keep our own hearbeats in place as a means for metric > > reports. At some point, we can avoid having the JobManager actor watch > the > > TaskManager actor then, it seems. > > > > That is right. We can choose depending which one performs better. > > > > > > 3) re: transport failure detector - makes sense > > > > 4) Yes, let's have a single timeout value that defines the ask timeout, > tcp > > timeout, and the interval of the watch failure detector, and allow to > > override them by specifying the options. > > > > I chose the following heuristics for the moment: > > ask timeout = tcp timeout = startup timeout = death watch pause = 10 * > interval of death watch > > We can see how they behave and if necessary adapt. > > > > > > Stephan > > > > > > On Mon, Jan 5, 2015 at 10:30 AM, Till Rohrmann <[hidden email]> > > wrote: > > > > > Hi, > > > > > > you are right, the new implementation still lacks a lot of > documentation > > > which makes understanding the code harder than necessary. > > > > > > On Sun, Jan 4, 2015 at 10:28 PM, Stephan Ewen <[hidden email]> > wrote: > > > > > > > Hi! > > > > > > > > Since the new distributed infrastructure is built on Akka, some > > internal > > > > concepts have changed now. > > > > I think that this is currently not really document anywhere > > > > > > > > @Till Can you elaborate on the questions here: > > > > > > > > - What is the Akka URL in the global configuration > > > ("jobmanager.akka.url") > > > > From the perspective of the global configuration, don't we simply > have > > > the > > > > address and port of the actor system? > > > > > > > > > > The jobmanager.akka.url is used to overwrite the default akka url > > > generation which is akka.tcp://${HOSTNAME}:${PORT}. This is necessary > in > > > cases where we do not have remote actor systems but a single local, as > in > > > the case of local execution, and thus have to use a different url > scheme. > > > In case of a single actor system, the url would be > > > akka://${ACTORSYSTEMNAME}. So in fact this configuration option is only > > > used internally and should not be configured by the user. To make it > > > fail-safe we should probably use a non exposed mechanism. > > > > > > > > > > > > > > - We currently have multiple competing failure-detection mechanisms: > > For > > > > one, the job manager actor watches the task manager actors. Also, we > > > still > > > > have the manual heart beats in place. Shouldn't we remove the old > > manual > > > > heartbeats and have the instance manager watch the task manager > actors? > > > > > > > > > > It's right that we still have the old heartbeats in place but they are > > > stripped down. Currently, they are only used to update the > > > lastReceivedHeartBeat field in the Instance object. Consequently, they > > > could be simply removed at the price of not getting shown the time > since > > > the last heartbeat in the web interface. The failure detection > mechanism > > is > > > currently realized exclusively by using Akka's death watch, meaning > that > > > the JobManager watches the TaskManagers and vice versa. I also thought > > that > > > some people wanted to piggy back on the heartbeat message to do > > monitoring. > > > Therefore I kept it for the moment. But I guess that a dedicated > > monitoring > > > message would be better. > > > > > > > > > > - There are transport heartbeats and watch heartbeats. I could not > > find > > > a > > > > good explanation of what the transport heartbeats are. Also, the > > > heartbeat > > > > interval is very large (1000 s) by default, so I am wondering what > > there > > > > purpose is. > > > > > > > > > > Yes you're right that Akka has a lot of little knobs to turn and twist > > and > > > some of them are more obvious than others. The transport failure > detector > > > is Akka's own mechanism to detect lost messages. This is necessary for > > UDP > > > but not for TCP since it has its own failure detector. In order to > > decrease > > > the unnecessary network traffic, I set the heartbeat pause and > heartbeat > > > interval of the transport failure detector to these high numbers. > > > > > > > > > > > > > > - There are many different timeouts: > > > > -> startup timeout > > > > > > > > > > That is the timeout for creating an actor system. > > > > > > > > > > -> watch heartbeat timeout > > > > > > > > > > This timeout is used for the death watch. But the detector is actually > > > controlled by akka.watch.heartbeat.interval, akka.watch.heartbeat.pause > > and > > > akka.watch.threshold. In [1] it is described what these parameters do. > > > > > > > > > > -> ask timeout > > > > > > > > > > That is the general timeout which is used for all futures once the > actor > > > system has been started. > > > > > > > > > > -> TCP timeout > > > > > > > > > > The TCP timeout is the timeout which is used by Netty for all outbound > > > connections. > > > > > > > > > > How to the relate / interact? Does it make sense to define them > > > relative > > > > to one another? > > > > > > > > > For the sake of simplicity and usability, it is a good idea to derive > the > > > individual timeouts by means of some heuristics from a single timeout > > > value. Maybe we could use these heuristics as default values but still > > > allow the user to define these values himself if he wants to. > > > > > > > > > > > > > > I think it makes a lot of sense to document these points somewhere. > > > > > > > > > > I'll add an overview and details of the implementation to the internals > > > section of the documentation. > > > > > > > > > > > > > > Greetings, > > > > Stephan > > > > > > > > > > [1] > > > > > > > > > http://doc.akka.io/docs/akka/snapshot/scala/remoting.html#watching-remote-actors > > > > > > |
Free forum by Nabble | Edit this page |