Heartbeat lost

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Heartbeat lost

Kruse, Sebastian
Hi everyone,

In some of my jobs, I occasionally encounter the problem, that some of the task managers lose the heartbeat connection to the job manager. The jobmanager did not crash, though. Here an excerpt from the dashboard:

Error: java.lang.Exception: TaskManager lost heartbeat connection to JobManager
at org.apache.flink.runtime.taskmanager.TaskManager.registerAndRunHeartbeatLoop(TaskManager.java:847)
at org.apache.flink.runtime.taskmanager.TaskManager.access$000(TaskManager.java:109)
at org.apache.flink.runtime.taskmanager.TaskManager$1.run(TaskManager.java:365)

I am not sure if this is a bug. I rather figure that the network or jobmanager workload is too high, so that somehow the heartbeats do not arrive (on time), but that's a mere guess. A first step for me could be to increase the heartbeat interval.

Has anyone of you encountered this problem or do you have any ideas on how to avoid this issue?

Thanks,
Sebastian
Reply | Threaded
Open this post in threaded view
|

Re: Heartbeat lost

Stephan Ewen
Yes, that sounds like a good idea.

I have experienced that occasionally before, under high parallelism and
algorithms where the task manager got long garbage collection stalls...

The default timeout (30 seconds) can be aggressive for sich jobs...

Stephan
Am 18.11.2014 09:47 schrieb "Kruse, Sebastian" <[hidden email]>:

> Hi everyone,
>
> In some of my jobs, I occasionally encounter the problem, that some of the
> task managers lose the heartbeat connection to the job manager. The
> jobmanager did not crash, though. Here an excerpt from the dashboard:
>
> Error: java.lang.Exception: TaskManager lost heartbeat connection to
> JobManager
> at
> org.apache.flink.runtime.taskmanager.TaskManager.registerAndRunHeartbeatLoop(TaskManager.java:847)
> at
> org.apache.flink.runtime.taskmanager.TaskManager.access$000(TaskManager.java:109)
> at
> org.apache.flink.runtime.taskmanager.TaskManager$1.run(TaskManager.java:365)
>
> I am not sure if this is a bug. I rather figure that the network or
> jobmanager workload is too high, so that somehow the heartbeats do not
> arrive (on time), but that's a mere guess. A first step for me could be to
> increase the heartbeat interval.
>
> Has anyone of you encountered this problem or do you have any ideas on how
> to avoid this issue?
>
> Thanks,
> Sebastian
>
Reply | Threaded
Open this post in threaded view
|

RE: Heartbeat lost

Kruse, Sebastian
I am using the RemoteCollectorOutputFormat (if you recall, Fabian Tschirschnitz contributed this) to send the output data to the driver which happens to run on the same machine as the jobmanager. In some cases, this output becomes huge, I assume this to be the problem.

However, since the heartbeat runs in its own thread, we could assign it a higher priority than regular driver/jobmanager code, to avoid the suppression of heartbeats. Or do I miss something?

Cheers,
Sebastian

-----Original Message-----
From: [hidden email] [mailto:[hidden email]] On Behalf Of Stephan Ewen
Sent: Dienstag, 18. November 2014 10:57
To: [hidden email]
Subject: Re: Heartbeat lost

Yes, that sounds like a good idea.

I have experienced that occasionally before, under high parallelism and algorithms where the task manager got long garbage collection stalls...

The default timeout (30 seconds) can be aggressive for sich jobs...

Stephan
Am 18.11.2014 09:47 schrieb "Kruse, Sebastian" <[hidden email]>:

> Hi everyone,
>
> In some of my jobs, I occasionally encounter the problem, that some of
> the task managers lose the heartbeat connection to the job manager.
> The jobmanager did not crash, though. Here an excerpt from the dashboard:
>
> Error: java.lang.Exception: TaskManager lost heartbeat connection to
> JobManager at
> org.apache.flink.runtime.taskmanager.TaskManager.registerAndRunHeartbe
> atLoop(TaskManager.java:847)
> at
> org.apache.flink.runtime.taskmanager.TaskManager.access$000(TaskManage
> r.java:109)
> at
> org.apache.flink.runtime.taskmanager.TaskManager$1.run(TaskManager.jav
> a:365)
>
> I am not sure if this is a bug. I rather figure that the network or
> jobmanager workload is too high, so that somehow the heartbeats do not
> arrive (on time), but that's a mere guess. A first step for me could
> be to increase the heartbeat interval.
>
> Has anyone of you encountered this problem or do you have any ideas on
> how to avoid this issue?
>
> Thanks,
> Sebastian
>
Reply | Threaded
Open this post in threaded view
|

Re: Heartbeat lost

Stephan Ewen
The heartbeats currently go through the RPC service which is soon to be
replaced by akka. So any fix there would be temporary.

You can try increasing the thread priority, let us know if it works.

Otherwise you can increase the heart beat timeout via
"jobmanager.max-heartbeat-delay-before-failure.sec". CAREFUL: The keys says
seconds, but the value is in milliseconds. We actually need to fix that

Stephan


On Tue, Nov 18, 2014 at 1:25 PM, Kruse, Sebastian <[hidden email]>
wrote:

> I am using the RemoteCollectorOutputFormat (if you recall, Fabian
> Tschirschnitz contributed this) to send the output data to the driver which
> happens to run on the same machine as the jobmanager. In some cases, this
> output becomes huge, I assume this to be the problem.
>
> However, since the heartbeat runs in its own thread, we could assign it a
> higher priority than regular driver/jobmanager code, to avoid the
> suppression of heartbeats. Or do I miss something?
>
> Cheers,
> Sebastian
>
> -----Original Message-----
> From: [hidden email] [mailto:[hidden email]] On Behalf Of
> Stephan Ewen
> Sent: Dienstag, 18. November 2014 10:57
> To: [hidden email]
> Subject: Re: Heartbeat lost
>
> Yes, that sounds like a good idea.
>
> I have experienced that occasionally before, under high parallelism and
> algorithms where the task manager got long garbage collection stalls...
>
> The default timeout (30 seconds) can be aggressive for sich jobs...
>
> Stephan
> Am 18.11.2014 09:47 schrieb "Kruse, Sebastian" <[hidden email]>:
>
> > Hi everyone,
> >
> > In some of my jobs, I occasionally encounter the problem, that some of
> > the task managers lose the heartbeat connection to the job manager.
> > The jobmanager did not crash, though. Here an excerpt from the dashboard:
> >
> > Error: java.lang.Exception: TaskManager lost heartbeat connection to
> > JobManager at
> > org.apache.flink.runtime.taskmanager.TaskManager.registerAndRunHeartbe
> > atLoop(TaskManager.java:847)
> > at
> > org.apache.flink.runtime.taskmanager.TaskManager.access$000(TaskManage
> > r.java:109)
> > at
> > org.apache.flink.runtime.taskmanager.TaskManager$1.run(TaskManager.jav
> > a:365)
> >
> > I am not sure if this is a bug. I rather figure that the network or
> > jobmanager workload is too high, so that somehow the heartbeats do not
> > arrive (on time), but that's a mere guess. A first step for me could
> > be to increase the heartbeat interval.
> >
> > Has anyone of you encountered this problem or do you have any ideas on
> > how to avoid this issue?
> >
> > Thanks,
> > Sebastian
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Heartbeat lost

Flavio Pompermaier
In reply to this post by Stephan Ewen
Have you evaluated to adopt reactor instead of akka?
On Nov 18, 2014 10:57 AM, "Stephan Ewen" <[hidden email]> wrote:

> Yes, that sounds like a good idea.
>
> I have experienced that occasionally before, under high parallelism and
> algorithms where the task manager got long garbage collection stalls...
>
> The default timeout (30 seconds) can be aggressive for sich jobs...
>
> Stephan
> Am 18.11.2014 09:47 schrieb "Kruse, Sebastian" <[hidden email]>:
>
> > Hi everyone,
> >
> > In some of my jobs, I occasionally encounter the problem, that some of
> the
> > task managers lose the heartbeat connection to the job manager. The
> > jobmanager did not crash, though. Here an excerpt from the dashboard:
> >
> > Error: java.lang.Exception: TaskManager lost heartbeat connection to
> > JobManager
> > at
> >
> org.apache.flink.runtime.taskmanager.TaskManager.registerAndRunHeartbeatLoop(TaskManager.java:847)
> > at
> >
> org.apache.flink.runtime.taskmanager.TaskManager.access$000(TaskManager.java:109)
> > at
> >
> org.apache.flink.runtime.taskmanager.TaskManager$1.run(TaskManager.java:365)
> >
> > I am not sure if this is a bug. I rather figure that the network or
> > jobmanager workload is too high, so that somehow the heartbeats do not
> > arrive (on time), but that's a mere guess. A first step for me could be
> to
> > increase the heartbeat interval.
> >
> > Has anyone of you encountered this problem or do you have any ideas on
> how
> > to avoid this issue?
> >
> > Thanks,
> > Sebastian
> >
>
Reply | Threaded
Open this post in threaded view
|

RE: Heartbeat lost

Kruse, Sebastian
In reply to this post by Stephan Ewen
To me, it looks like the "jobmanager.max-heartbeat-delay-before-failure.sec" is only used by the jobmanager to determine dead taskmanagers, but not vice versa. This is probably fine, because the parameter starts with "jobmanager". However, the number of missed heartbeats from the jobmanager to the taskmanager seems to be hard-wired to 3:

TaskManager, ll.335ff.:

                // start the heart beats
                {
                        final long interval = GlobalConfiguration.getInteger(
                                        ConfigConstants.TASK_MANAGER_HEARTBEAT_INTERVAL_KEY,
                                        ConfigConstants.DEFAULT_TASK_MANAGER_HEARTBEAT_INTERVAL);
                       
                        this.heartbeatThread = new Thread() {
                                @Override
                                public void run() {
                                        registerAndRunHeartbeatLoop(interval, MAX_LOST_HEART_BEATS);
                                }
                        };
                        this.heartbeatThread.setName("Heartbeat Thread");
                        this.heartbeatThread.start();
                }

Maybe, we should have a the "taskmanager.max-heartbeat-delay-before-failure.msec" as well.

-----Original Message-----
From: [hidden email] [mailto:[hidden email]] On Behalf Of Stephan Ewen
Sent: Dienstag, 18. November 2014 14:08
To: [hidden email]
Subject: Re: Heartbeat lost

The heartbeats currently go through the RPC service which is soon to be replaced by akka. So any fix there would be temporary.

You can try increasing the thread priority, let us know if it works.

Otherwise you can increase the heart beat timeout via "jobmanager.max-heartbeat-delay-before-failure.sec". CAREFUL: The keys says seconds, but the value is in milliseconds. We actually need to fix that

Stephan


On Tue, Nov 18, 2014 at 1:25 PM, Kruse, Sebastian <[hidden email]>
wrote:

> I am using the RemoteCollectorOutputFormat (if you recall, Fabian
> Tschirschnitz contributed this) to send the output data to the driver
> which happens to run on the same machine as the jobmanager. In some
> cases, this output becomes huge, I assume this to be the problem.
>
> However, since the heartbeat runs in its own thread, we could assign
> it a higher priority than regular driver/jobmanager code, to avoid the
> suppression of heartbeats. Or do I miss something?
>
> Cheers,
> Sebastian
>
> -----Original Message-----
> From: [hidden email] [mailto:[hidden email]] On Behalf
> Of Stephan Ewen
> Sent: Dienstag, 18. November 2014 10:57
> To: [hidden email]
> Subject: Re: Heartbeat lost
>
> Yes, that sounds like a good idea.
>
> I have experienced that occasionally before, under high parallelism
> and algorithms where the task manager got long garbage collection stalls...
>
> The default timeout (30 seconds) can be aggressive for sich jobs...
>
> Stephan
> Am 18.11.2014 09:47 schrieb "Kruse, Sebastian" <[hidden email]>:
>
> > Hi everyone,
> >
> > In some of my jobs, I occasionally encounter the problem, that some
> > of the task managers lose the heartbeat connection to the job manager.
> > The jobmanager did not crash, though. Here an excerpt from the dashboard:
> >
> > Error: java.lang.Exception: TaskManager lost heartbeat connection to
> > JobManager at
> > org.apache.flink.runtime.taskmanager.TaskManager.registerAndRunHeart
> > be
> > atLoop(TaskManager.java:847)
> > at
> > org.apache.flink.runtime.taskmanager.TaskManager.access$000(TaskMana
> > ge
> > r.java:109)
> > at
> > org.apache.flink.runtime.taskmanager.TaskManager$1.run(TaskManager.j
> > av
> > a:365)
> >
> > I am not sure if this is a bug. I rather figure that the network or
> > jobmanager workload is too high, so that somehow the heartbeats do
> > not arrive (on time), but that's a mere guess. A first step for me
> > could be to increase the heartbeat interval.
> >
> > Has anyone of you encountered this problem or do you have any ideas
> > on how to avoid this issue?
> >
> > Thanks,
> > Sebastian
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Heartbeat lost

Stephan Ewen
The mechanisms are different here: JobManager cares about time and discards
a TaskManager is the heartbeat was delayed long enough.

Delayed heartbeats are not a problem for the TaskManager - if the heartbeat
thread gets stuck, it gets stuck. Only seriously lost heartbeates cause a
problem, and that goes together with an IOException. The only other reason
for an unsuccessful heartbeat is that the JobManager rejected the heartbeat
because the delay has passed and the TaskManager has been marked as dead.

In that sense, the TaskManager respects the delay as well, unless network
problems occur. In that case, it fails earlier.

Do you actually experience these IOExceptions (in the log of the
TaskManager) ?





On Wed, Nov 19, 2014 at 2:49 PM, Kruse, Sebastian <[hidden email]>
wrote:

> To me, it looks like the
> "jobmanager.max-heartbeat-delay-before-failure.sec" is only used by the
> jobmanager to determine dead taskmanagers, but not vice versa. This is
> probably fine, because the parameter starts with "jobmanager". However, the
> number of missed heartbeats from the jobmanager to the taskmanager seems to
> be hard-wired to 3:
>
> TaskManager, ll.335ff.:
>
>                 // start the heart beats
>                 {
>                         final long interval =
> GlobalConfiguration.getInteger(
>
> ConfigConstants.TASK_MANAGER_HEARTBEAT_INTERVAL_KEY,
>
> ConfigConstants.DEFAULT_TASK_MANAGER_HEARTBEAT_INTERVAL);
>
>                         this.heartbeatThread = new Thread() {
>                                 @Override
>                                 public void run() {
>
> registerAndRunHeartbeatLoop(interval, MAX_LOST_HEART_BEATS);
>                                 }
>                         };
>                         this.heartbeatThread.setName("Heartbeat Thread");
>                         this.heartbeatThread.start();
>                 }
>
> Maybe, we should have a the
> "taskmanager.max-heartbeat-delay-before-failure.msec" as well.
>
> -----Original Message-----
> From: [hidden email] [mailto:[hidden email]] On Behalf Of
> Stephan Ewen
> Sent: Dienstag, 18. November 2014 14:08
> To: [hidden email]
> Subject: Re: Heartbeat lost
>
> The heartbeats currently go through the RPC service which is soon to be
> replaced by akka. So any fix there would be temporary.
>
> You can try increasing the thread priority, let us know if it works.
>
> Otherwise you can increase the heart beat timeout via
> "jobmanager.max-heartbeat-delay-before-failure.sec". CAREFUL: The keys says
> seconds, but the value is in milliseconds. We actually need to fix that
>
> Stephan
>
>
> On Tue, Nov 18, 2014 at 1:25 PM, Kruse, Sebastian <[hidden email]>
> wrote:
>
> > I am using the RemoteCollectorOutputFormat (if you recall, Fabian
> > Tschirschnitz contributed this) to send the output data to the driver
> > which happens to run on the same machine as the jobmanager. In some
> > cases, this output becomes huge, I assume this to be the problem.
> >
> > However, since the heartbeat runs in its own thread, we could assign
> > it a higher priority than regular driver/jobmanager code, to avoid the
> > suppression of heartbeats. Or do I miss something?
> >
> > Cheers,
> > Sebastian
> >
> > -----Original Message-----
> > From: [hidden email] [mailto:[hidden email]] On Behalf
> > Of Stephan Ewen
> > Sent: Dienstag, 18. November 2014 10:57
> > To: [hidden email]
> > Subject: Re: Heartbeat lost
> >
> > Yes, that sounds like a good idea.
> >
> > I have experienced that occasionally before, under high parallelism
> > and algorithms where the task manager got long garbage collection
> stalls...
> >
> > The default timeout (30 seconds) can be aggressive for sich jobs...
> >
> > Stephan
> > Am 18.11.2014 09:47 schrieb "Kruse, Sebastian" <[hidden email]>:
> >
> > > Hi everyone,
> > >
> > > In some of my jobs, I occasionally encounter the problem, that some
> > > of the task managers lose the heartbeat connection to the job manager.
> > > The jobmanager did not crash, though. Here an excerpt from the
> dashboard:
> > >
> > > Error: java.lang.Exception: TaskManager lost heartbeat connection to
> > > JobManager at
> > > org.apache.flink.runtime.taskmanager.TaskManager.registerAndRunHeart
> > > be
> > > atLoop(TaskManager.java:847)
> > > at
> > > org.apache.flink.runtime.taskmanager.TaskManager.access$000(TaskMana
> > > ge
> > > r.java:109)
> > > at
> > > org.apache.flink.runtime.taskmanager.TaskManager$1.run(TaskManager.j
> > > av
> > > a:365)
> > >
> > > I am not sure if this is a bug. I rather figure that the network or
> > > jobmanager workload is too high, so that somehow the heartbeats do
> > > not arrive (on time), but that's a mere guess. A first step for me
> > > could be to increase the heartbeat interval.
> > >
> > > Has anyone of you encountered this problem or do you have any ideas
> > > on how to avoid this issue?
> > >
> > > Thanks,
> > > Sebastian
> > >
> >
>