Proposal: Refactor distributed coordination to use the Akka Actor Library

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Proposal: Refactor distributed coordination to use the Akka Actor Library

Stephan Ewen
This proposes to refactor the RPC service and the coordination between
Client, JobManager, and TaskManager to use the Akka actor library.

Even though Akka is written in Scala, it offers a Java interface and we can
use Akka completely from Java.

Below are a list of arguments why this would help the system:


Problems with the current RPC service:
--------------------------------------------------------

  - No asynchronous calls with callbacks. This is the reason why several
parts of the runtime poll the status, introducing unnecessary latency.

  - No exception forwarding (many exceptions are simply swallowed), making
debugging and operation in flaky environments very hard

  - Limited number of handler threads. The RPC can only handle a fix number
of concurrent requests, forcing you to maintain separate thread pools to
delegate actions to

  - No support for primitive data types (or boxed primitives) as arguments,
everything has to be a specially serializable type

  - Problematic threading model. The RPC continuously spawns and terminates
threads



Benefits of switching to the Akka actor model:
-------------------------------------------------------------------------------

  - Akka solves all of the above issues out of the box

  - The supervisor model allows you to do failure detection of actors. That
provides a unified way of detecting and handling failures (missing
heartbeats, failed calls, ...)

  - Akka has tools to make stateful actors persistent and restart them on
other machines in cases of failure. That would greatly help in implementing
"master fail-over", which will become important

  - You can define many "call targets" (actors). Tasks (on taskmanagers)
can directly call their ExecutionVertex on the JobManager, rather than
calling the JobManager, creating a Runnable that looks up the execution
vertex, and so on...

  - The actor model's approach to queue actions on an actor and run the one
after another makes the concurrency model of the state machine very simple
and robust

  - We "outsource" our own concerns about maintaining and improving that
part of the system

Greetings,
Stephan
Reply | Threaded
Open this post in threaded view
|

Re: Proposal: Refactor distributed coordination to use the Akka Actor Library

Robert Metzger
I agree with using Akka for RPC. It is ASF 2.0 licensed, seems to have a
big community [1] and users [2] that depend on the system.

The YARN client is also using the old RPC service. I would like to rewrite
it with Akka once we have added it into the other parts of the system, to
learn it.


[1] https://github.com/akka/akka/pulls
[2] http://doc.akka.io/docs/akka/2.0.4/additional/companies-using-akka.html



On Fri, Sep 5, 2014 at 1:34 PM, Stephan Ewen <[hidden email]> wrote:

> This proposes to refactor the RPC service and the coordination between
> Client, JobManager, and TaskManager to use the Akka actor library.
>
> Even though Akka is written in Scala, it offers a Java interface and we can
> use Akka completely from Java.
>
> Below are a list of arguments why this would help the system:
>
>
> Problems with the current RPC service:
> --------------------------------------------------------
>
>   - No asynchronous calls with callbacks. This is the reason why several
> parts of the runtime poll the status, introducing unnecessary latency.
>
>   - No exception forwarding (many exceptions are simply swallowed), making
> debugging and operation in flaky environments very hard
>
>   - Limited number of handler threads. The RPC can only handle a fix number
> of concurrent requests, forcing you to maintain separate thread pools to
> delegate actions to
>
>   - No support for primitive data types (or boxed primitives) as arguments,
> everything has to be a specially serializable type
>
>   - Problematic threading model. The RPC continuously spawns and terminates
> threads
>
>
>
> Benefits of switching to the Akka actor model:
>
> -------------------------------------------------------------------------------
>
>   - Akka solves all of the above issues out of the box
>
>   - The supervisor model allows you to do failure detection of actors. That
> provides a unified way of detecting and handling failures (missing
> heartbeats, failed calls, ...)
>
>   - Akka has tools to make stateful actors persistent and restart them on
> other machines in cases of failure. That would greatly help in implementing
> "master fail-over", which will become important
>
>   - You can define many "call targets" (actors). Tasks (on taskmanagers)
> can directly call their ExecutionVertex on the JobManager, rather than
> calling the JobManager, creating a Runnable that looks up the execution
> vertex, and so on...
>
>   - The actor model's approach to queue actions on an actor and run the one
> after another makes the concurrency model of the state machine very simple
> and robust
>
>   - We "outsource" our own concerns about maintaining and improving that
> part of the system
>
> Greetings,
> Stephan
>
Reply | Threaded
Open this post in threaded view
|

Re: Proposal: Refactor distributed coordination to use the Akka Actor Library

Kostas Tzoumas-2
+1 for refactoring using Akka, the arguments are overwhelming.


On Fri, Sep 5, 2014 at 2:04 PM, Robert Metzger <[hidden email]> wrote:

> I agree with using Akka for RPC. It is ASF 2.0 licensed, seems to have a
> big community [1] and users [2] that depend on the system.
>
> The YARN client is also using the old RPC service. I would like to rewrite
> it with Akka once we have added it into the other parts of the system, to
> learn it.
>
>
> [1] https://github.com/akka/akka/pulls
> [2]
> http://doc.akka.io/docs/akka/2.0.4/additional/companies-using-akka.html
>
>
>
> On Fri, Sep 5, 2014 at 1:34 PM, Stephan Ewen <[hidden email]> wrote:
>
> > This proposes to refactor the RPC service and the coordination between
> > Client, JobManager, and TaskManager to use the Akka actor library.
> >
> > Even though Akka is written in Scala, it offers a Java interface and we
> can
> > use Akka completely from Java.
> >
> > Below are a list of arguments why this would help the system:
> >
> >
> > Problems with the current RPC service:
> > --------------------------------------------------------
> >
> >   - No asynchronous calls with callbacks. This is the reason why several
> > parts of the runtime poll the status, introducing unnecessary latency.
> >
> >   - No exception forwarding (many exceptions are simply swallowed),
> making
> > debugging and operation in flaky environments very hard
> >
> >   - Limited number of handler threads. The RPC can only handle a fix
> number
> > of concurrent requests, forcing you to maintain separate thread pools to
> > delegate actions to
> >
> >   - No support for primitive data types (or boxed primitives) as
> arguments,
> > everything has to be a specially serializable type
> >
> >   - Problematic threading model. The RPC continuously spawns and
> terminates
> > threads
> >
> >
> >
> > Benefits of switching to the Akka actor model:
> >
> >
> -------------------------------------------------------------------------------
> >
> >   - Akka solves all of the above issues out of the box
> >
> >   - The supervisor model allows you to do failure detection of actors.
> That
> > provides a unified way of detecting and handling failures (missing
> > heartbeats, failed calls, ...)
> >
> >   - Akka has tools to make stateful actors persistent and restart them on
> > other machines in cases of failure. That would greatly help in
> implementing
> > "master fail-over", which will become important
> >
> >   - You can define many "call targets" (actors). Tasks (on taskmanagers)
> > can directly call their ExecutionVertex on the JobManager, rather than
> > calling the JobManager, creating a Runnable that looks up the execution
> > vertex, and so on...
> >
> >   - The actor model's approach to queue actions on an actor and run the
> one
> > after another makes the concurrency model of the state machine very
> simple
> > and robust
> >
> >   - We "outsource" our own concerns about maintaining and improving that
> > part of the system
> >
> > Greetings,
> > Stephan
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Proposal: Refactor distributed coordination to use the Akka Actor Library

Ufuk Celebi-2
+1


On Fri, Sep 5, 2014 at 2:25 PM, Kostas Tzoumas <[hidden email]> wrote:

> +1 for refactoring using Akka, the arguments are overwhelming.
>
>
> On Fri, Sep 5, 2014 at 2:04 PM, Robert Metzger <[hidden email]>
> wrote:
>
> > I agree with using Akka for RPC. It is ASF 2.0 licensed, seems to have a
> > big community [1] and users [2] that depend on the system.
> >
> > The YARN client is also using the old RPC service. I would like to
> rewrite
> > it with Akka once we have added it into the other parts of the system, to
> > learn it.
> >
> >
> > [1] https://github.com/akka/akka/pulls
> > [2]
> > http://doc.akka.io/docs/akka/2.0.4/additional/companies-using-akka.html
> >
> >
> >
> > On Fri, Sep 5, 2014 at 1:34 PM, Stephan Ewen <[hidden email]> wrote:
> >
> > > This proposes to refactor the RPC service and the coordination between
> > > Client, JobManager, and TaskManager to use the Akka actor library.
> > >
> > > Even though Akka is written in Scala, it offers a Java interface and we
> > can
> > > use Akka completely from Java.
> > >
> > > Below are a list of arguments why this would help the system:
> > >
> > >
> > > Problems with the current RPC service:
> > > --------------------------------------------------------
> > >
> > >   - No asynchronous calls with callbacks. This is the reason why
> several
> > > parts of the runtime poll the status, introducing unnecessary latency.
> > >
> > >   - No exception forwarding (many exceptions are simply swallowed),
> > making
> > > debugging and operation in flaky environments very hard
> > >
> > >   - Limited number of handler threads. The RPC can only handle a fix
> > number
> > > of concurrent requests, forcing you to maintain separate thread pools
> to
> > > delegate actions to
> > >
> > >   - No support for primitive data types (or boxed primitives) as
> > arguments,
> > > everything has to be a specially serializable type
> > >
> > >   - Problematic threading model. The RPC continuously spawns and
> > terminates
> > > threads
> > >
> > >
> > >
> > > Benefits of switching to the Akka actor model:
> > >
> > >
> >
> -------------------------------------------------------------------------------
> > >
> > >   - Akka solves all of the above issues out of the box
> > >
> > >   - The supervisor model allows you to do failure detection of actors.
> > That
> > > provides a unified way of detecting and handling failures (missing
> > > heartbeats, failed calls, ...)
> > >
> > >   - Akka has tools to make stateful actors persistent and restart them
> on
> > > other machines in cases of failure. That would greatly help in
> > implementing
> > > "master fail-over", which will become important
> > >
> > >   - You can define many "call targets" (actors). Tasks (on
> taskmanagers)
> > > can directly call their ExecutionVertex on the JobManager, rather than
> > > calling the JobManager, creating a Runnable that looks up the execution
> > > vertex, and so on...
> > >
> > >   - The actor model's approach to queue actions on an actor and run the
> > one
> > > after another makes the concurrency model of the state machine very
> > simple
> > > and robust
> > >
> > >   - We "outsource" our own concerns about maintaining and improving
> that
> > > part of the system
> > >
> > > Greetings,
> > > Stephan
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Proposal: Refactor distributed coordination to use the Akka Actor Library

Stephan Ewen
+1


On Fri, Sep 5, 2014 at 2:53 PM, Ufuk Celebi <[hidden email]> wrote:

> +1
>
>
> On Fri, Sep 5, 2014 at 2:25 PM, Kostas Tzoumas <[hidden email]>
> wrote:
>
> > +1 for refactoring using Akka, the arguments are overwhelming.
> >
> >
> > On Fri, Sep 5, 2014 at 2:04 PM, Robert Metzger <[hidden email]>
> > wrote:
> >
> > > I agree with using Akka for RPC. It is ASF 2.0 licensed, seems to have
> a
> > > big community [1] and users [2] that depend on the system.
> > >
> > > The YARN client is also using the old RPC service. I would like to
> > rewrite
> > > it with Akka once we have added it into the other parts of the system,
> to
> > > learn it.
> > >
> > >
> > > [1] https://github.com/akka/akka/pulls
> > > [2]
> > >
> http://doc.akka.io/docs/akka/2.0.4/additional/companies-using-akka.html
> > >
> > >
> > >
> > > On Fri, Sep 5, 2014 at 1:34 PM, Stephan Ewen <[hidden email]> wrote:
> > >
> > > > This proposes to refactor the RPC service and the coordination
> between
> > > > Client, JobManager, and TaskManager to use the Akka actor library.
> > > >
> > > > Even though Akka is written in Scala, it offers a Java interface and
> we
> > > can
> > > > use Akka completely from Java.
> > > >
> > > > Below are a list of arguments why this would help the system:
> > > >
> > > >
> > > > Problems with the current RPC service:
> > > > --------------------------------------------------------
> > > >
> > > >   - No asynchronous calls with callbacks. This is the reason why
> > several
> > > > parts of the runtime poll the status, introducing unnecessary
> latency.
> > > >
> > > >   - No exception forwarding (many exceptions are simply swallowed),
> > > making
> > > > debugging and operation in flaky environments very hard
> > > >
> > > >   - Limited number of handler threads. The RPC can only handle a fix
> > > number
> > > > of concurrent requests, forcing you to maintain separate thread pools
> > to
> > > > delegate actions to
> > > >
> > > >   - No support for primitive data types (or boxed primitives) as
> > > arguments,
> > > > everything has to be a specially serializable type
> > > >
> > > >   - Problematic threading model. The RPC continuously spawns and
> > > terminates
> > > > threads
> > > >
> > > >
> > > >
> > > > Benefits of switching to the Akka actor model:
> > > >
> > > >
> > >
> >
> -------------------------------------------------------------------------------
> > > >
> > > >   - Akka solves all of the above issues out of the box
> > > >
> > > >   - The supervisor model allows you to do failure detection of
> actors.
> > > That
> > > > provides a unified way of detecting and handling failures (missing
> > > > heartbeats, failed calls, ...)
> > > >
> > > >   - Akka has tools to make stateful actors persistent and restart
> them
> > on
> > > > other machines in cases of failure. That would greatly help in
> > > implementing
> > > > "master fail-over", which will become important
> > > >
> > > >   - You can define many "call targets" (actors). Tasks (on
> > taskmanagers)
> > > > can directly call their ExecutionVertex on the JobManager, rather
> than
> > > > calling the JobManager, creating a Runnable that looks up the
> execution
> > > > vertex, and so on...
> > > >
> > > >   - The actor model's approach to queue actions on an actor and run
> the
> > > one
> > > > after another makes the concurrency model of the state machine very
> > > simple
> > > > and robust
> > > >
> > > >   - We "outsource" our own concerns about maintaining and improving
> > that
> > > > part of the system
> > > >
> > > > Greetings,
> > > > Stephan
> > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Proposal: Refactor distributed coordination to use the Akka Actor Library

till.rohrmann
+1


On Fri, Sep 5, 2014 at 3:04 PM, Stephan Ewen <[hidden email]> wrote:

> +1
>
>
> On Fri, Sep 5, 2014 at 2:53 PM, Ufuk Celebi <[hidden email]> wrote:
>
> > +1
> >
> >
> > On Fri, Sep 5, 2014 at 2:25 PM, Kostas Tzoumas <[hidden email]>
> > wrote:
> >
> > > +1 for refactoring using Akka, the arguments are overwhelming.
> > >
> > >
> > > On Fri, Sep 5, 2014 at 2:04 PM, Robert Metzger <[hidden email]>
> > > wrote:
> > >
> > > > I agree with using Akka for RPC. It is ASF 2.0 licensed, seems to
> have
> > a
> > > > big community [1] and users [2] that depend on the system.
> > > >
> > > > The YARN client is also using the old RPC service. I would like to
> > > rewrite
> > > > it with Akka once we have added it into the other parts of the
> system,
> > to
> > > > learn it.
> > > >
> > > >
> > > > [1] https://github.com/akka/akka/pulls
> > > > [2]
> > > >
> > http://doc.akka.io/docs/akka/2.0.4/additional/companies-using-akka.html
> > > >
> > > >
> > > >
> > > > On Fri, Sep 5, 2014 at 1:34 PM, Stephan Ewen <[hidden email]>
> wrote:
> > > >
> > > > > This proposes to refactor the RPC service and the coordination
> > between
> > > > > Client, JobManager, and TaskManager to use the Akka actor library.
> > > > >
> > > > > Even though Akka is written in Scala, it offers a Java interface
> and
> > we
> > > > can
> > > > > use Akka completely from Java.
> > > > >
> > > > > Below are a list of arguments why this would help the system:
> > > > >
> > > > >
> > > > > Problems with the current RPC service:
> > > > > --------------------------------------------------------
> > > > >
> > > > >   - No asynchronous calls with callbacks. This is the reason why
> > > several
> > > > > parts of the runtime poll the status, introducing unnecessary
> > latency.
> > > > >
> > > > >   - No exception forwarding (many exceptions are simply swallowed),
> > > > making
> > > > > debugging and operation in flaky environments very hard
> > > > >
> > > > >   - Limited number of handler threads. The RPC can only handle a
> fix
> > > > number
> > > > > of concurrent requests, forcing you to maintain separate thread
> pools
> > > to
> > > > > delegate actions to
> > > > >
> > > > >   - No support for primitive data types (or boxed primitives) as
> > > > arguments,
> > > > > everything has to be a specially serializable type
> > > > >
> > > > >   - Problematic threading model. The RPC continuously spawns and
> > > > terminates
> > > > > threads
> > > > >
> > > > >
> > > > >
> > > > > Benefits of switching to the Akka actor model:
> > > > >
> > > > >
> > > >
> > >
> >
> -------------------------------------------------------------------------------
> > > > >
> > > > >   - Akka solves all of the above issues out of the box
> > > > >
> > > > >   - The supervisor model allows you to do failure detection of
> > actors.
> > > > That
> > > > > provides a unified way of detecting and handling failures (missing
> > > > > heartbeats, failed calls, ...)
> > > > >
> > > > >   - Akka has tools to make stateful actors persistent and restart
> > them
> > > on
> > > > > other machines in cases of failure. That would greatly help in
> > > > implementing
> > > > > "master fail-over", which will become important
> > > > >
> > > > >   - You can define many "call targets" (actors). Tasks (on
> > > taskmanagers)
> > > > > can directly call their ExecutionVertex on the JobManager, rather
> > than
> > > > > calling the JobManager, creating a Runnable that looks up the
> > execution
> > > > > vertex, and so on...
> > > > >
> > > > >   - The actor model's approach to queue actions on an actor and run
> > the
> > > > one
> > > > > after another makes the concurrency model of the state machine very
> > > > simple
> > > > > and robust
> > > > >
> > > > >   - We "outsource" our own concerns about maintaining and improving
> > > that
> > > > > part of the system
> > > > >
> > > > > Greetings,
> > > > > Stephan
> > > > >
> > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Proposal: Refactor distributed coordination to use the Akka Actor Library

Sebastian Schelter
+1


2014-09-05 6:46 GMT-07:00 Till Rohrmann <[hidden email]>:

> +1
>
>
> On Fri, Sep 5, 2014 at 3:04 PM, Stephan Ewen <[hidden email]> wrote:
>
> > +1
> >
> >
> > On Fri, Sep 5, 2014 at 2:53 PM, Ufuk Celebi <[hidden email]> wrote:
> >
> > > +1
> > >
> > >
> > > On Fri, Sep 5, 2014 at 2:25 PM, Kostas Tzoumas <[hidden email]>
> > > wrote:
> > >
> > > > +1 for refactoring using Akka, the arguments are overwhelming.
> > > >
> > > >
> > > > On Fri, Sep 5, 2014 at 2:04 PM, Robert Metzger <[hidden email]>
> > > > wrote:
> > > >
> > > > > I agree with using Akka for RPC. It is ASF 2.0 licensed, seems to
> > have
> > > a
> > > > > big community [1] and users [2] that depend on the system.
> > > > >
> > > > > The YARN client is also using the old RPC service. I would like to
> > > > rewrite
> > > > > it with Akka once we have added it into the other parts of the
> > system,
> > > to
> > > > > learn it.
> > > > >
> > > > >
> > > > > [1] https://github.com/akka/akka/pulls
> > > > > [2]
> > > > >
> > >
> http://doc.akka.io/docs/akka/2.0.4/additional/companies-using-akka.html
> > > > >
> > > > >
> > > > >
> > > > > On Fri, Sep 5, 2014 at 1:34 PM, Stephan Ewen <[hidden email]>
> > wrote:
> > > > >
> > > > > > This proposes to refactor the RPC service and the coordination
> > > between
> > > > > > Client, JobManager, and TaskManager to use the Akka actor
> library.
> > > > > >
> > > > > > Even though Akka is written in Scala, it offers a Java interface
> > and
> > > we
> > > > > can
> > > > > > use Akka completely from Java.
> > > > > >
> > > > > > Below are a list of arguments why this would help the system:
> > > > > >
> > > > > >
> > > > > > Problems with the current RPC service:
> > > > > > --------------------------------------------------------
> > > > > >
> > > > > >   - No asynchronous calls with callbacks. This is the reason why
> > > > several
> > > > > > parts of the runtime poll the status, introducing unnecessary
> > > latency.
> > > > > >
> > > > > >   - No exception forwarding (many exceptions are simply
> swallowed),
> > > > > making
> > > > > > debugging and operation in flaky environments very hard
> > > > > >
> > > > > >   - Limited number of handler threads. The RPC can only handle a
> > fix
> > > > > number
> > > > > > of concurrent requests, forcing you to maintain separate thread
> > pools
> > > > to
> > > > > > delegate actions to
> > > > > >
> > > > > >   - No support for primitive data types (or boxed primitives) as
> > > > > arguments,
> > > > > > everything has to be a specially serializable type
> > > > > >
> > > > > >   - Problematic threading model. The RPC continuously spawns and
> > > > > terminates
> > > > > > threads
> > > > > >
> > > > > >
> > > > > >
> > > > > > Benefits of switching to the Akka actor model:
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> -------------------------------------------------------------------------------
> > > > > >
> > > > > >   - Akka solves all of the above issues out of the box
> > > > > >
> > > > > >   - The supervisor model allows you to do failure detection of
> > > actors.
> > > > > That
> > > > > > provides a unified way of detecting and handling failures
> (missing
> > > > > > heartbeats, failed calls, ...)
> > > > > >
> > > > > >   - Akka has tools to make stateful actors persistent and restart
> > > them
> > > > on
> > > > > > other machines in cases of failure. That would greatly help in
> > > > > implementing
> > > > > > "master fail-over", which will become important
> > > > > >
> > > > > >   - You can define many "call targets" (actors). Tasks (on
> > > > taskmanagers)
> > > > > > can directly call their ExecutionVertex on the JobManager, rather
> > > than
> > > > > > calling the JobManager, creating a Runnable that looks up the
> > > execution
> > > > > > vertex, and so on...
> > > > > >
> > > > > >   - The actor model's approach to queue actions on an actor and
> run
> > > the
> > > > > one
> > > > > > after another makes the concurrency model of the state machine
> very
> > > > > simple
> > > > > > and robust
> > > > > >
> > > > > >   - We "outsource" our own concerns about maintaining and
> improving
> > > > that
> > > > > > part of the system
> > > > > >
> > > > > > Greetings,
> > > > > > Stephan
> > > > > >
> > > > >
> > > >
> > >
> >
>