(DEPRECATED) Apache Flink Mailing List archive.

No job recovery after job manager failure

Classic

List

Threaded

5 messages Options

Kashmar, Ali

No job recovery after job manager failure

Hi,

I’m trying to test HA on a 3-node Flink cluster (task slots = 48). So I started a job with parallelism = 32 and waited for a few seconds so that all nodes are doing work. I then shut down the node that had the leader job manager, and by shut down I mean I powered off the virtual machine running it. I monitored the logs to see what was going on and I saw that zookeeper has elected a new leader. I also saw a log for recovering jobs, but nothing actually happens. Here’s the job manager log from the node that became the leader:

11:06:43,448 INFO org.apache.flink.runtime.jobmanager.JobManager - JobManager akka.tcp://flink@192.168.200.174:56023/user/jobmanager was granted leadership with leader session ID Some(16eb0d0a-2cae-473e-aa41-679a87d3669b).
11:06:45,912 INFO org.apache.flink.runtime.webmonitor.JobManagerRetriever - New leader reachable under akka.tcp://flink@192.168.200.174:56023/user/jobmanager:16eb0d0a-2cae-473e-aa41-679a87d3669b.
11:06:45,963 INFO org.apache.flink.runtime.instance.InstanceManager - Registered TaskManager at 192.168.200.174 (akka.tcp://flink@192.168.200.174:52324/user/taskmanager) as e8720b15c63d508e8dc19b19e70d4c88. Current number of registered hosts is 1. Current number of alive task slots is 16.
11:06:45,975 INFO org.apache.flink.runtime.instance.InstanceManager - Registered TaskManager at 192.168.200.175 (akka.tcp://flink@192.168.200.175:46612/user/taskmanager) as 766a7938746c2d41e817e2ceb42a9a64. Current number of registered hosts is 2. Current number of alive task slots is 32.
11:08:25,925 INFO org.apache.flink.runtime.jobmanager.JobManager - Recovering all jobs.

I waited 10 minutes after that last log and there was no change. And here’s the task-manager log from the same node:

11:06:45,914 INFO org.apache.flink.runtime.taskmanager.TaskManager - Trying to register at JobManager akka.tcp://flink@192.168.200.174:56023/user/jobmanager (attempt 1, timeout: 500 milliseconds)
11:06:45,983 INFO org.apache.flink.runtime.taskmanager.TaskManager - Successful registration at JobManager (akka.tcp://flink@192.168.200.174:56023/user/jobmanager), starting network stack and library cache.
11:06:45,988 INFO org.apache.flink.runtime.io.network.netty.NettyClient - Successful initialization (took 4 ms).
11:06:45,994 INFO org.apache.flink.runtime.io.network.netty.NettyServer - Successful initialization (took 6 ms). Listening on SocketAddress /192.168.200.174:39322.
11:06:45,994 INFO org.apache.flink.runtime.taskmanager.TaskManager - Determined BLOB server address to be /192.168.200.174:48746. Starting BLOB cache.
11:06:45,995 INFO org.apache.flink.runtime.blob.BlobCache - Created BLOB cache storage directory /tmp/blobStore-4d4e4cc2-c161-4df1-acea-abda2b28d39e

Is this a bug?

Thanks,
Ali

Ufuk Celebi-2

Re: No job recovery after job manager failure

Hey Ali,

can you send me the complete logs?

I don’t think it’s possible via the mailing list. Just send it to my private email [hidden email].

– Ufuk

> On 16 Dec 2015, at 17:26, Kashmar, Ali <[hidden email]> wrote:
>
> Hi,
>
> I’m trying to test HA on a 3-node Flink cluster (task slots = 48). So I started a job with parallelism = 32 and waited for a few seconds so that all nodes are doing work. I then shut down the node that had the leader job manager, and by shut down I mean I powered off the virtual machine running it. I monitored the logs to see what was going on and I saw that zookeeper has elected a new leader. I also saw a log for recovering jobs, but nothing actually happens. Here’s the job manager log from the node that became the leader:
>
> 11:06:43,448 INFO org.apache.flink.runtime.jobmanager.JobManager - JobManager akka.tcp://flink@192.168.200.174:56023/user/jobmanager was granted leadership with leader session ID Some(16eb0d0a-2cae-473e-aa41-679a87d3669b).
> 11:06:45,912 INFO org.apache.flink.runtime.webmonitor.JobManagerRetriever - New leader reachable under akka.tcp://flink@192.168.200.174:56023/user/jobmanager:16eb0d0a-2cae-473e-aa41-679a87d3669b.
> 11:06:45,963 INFO org.apache.flink.runtime.instance.InstanceManager - Registered TaskManager at 192.168.200.174 (akka.tcp://flink@192.168.200.174:52324/user/taskmanager) as e8720b15c63d508e8dc19b19e70d4c88. Current number of registered hosts is 1. Current number of alive task slots is 16.
> 11:06:45,975 INFO org.apache.flink.runtime.instance.InstanceManager - Registered TaskManager at 192.168.200.175 (akka.tcp://flink@192.168.200.175:46612/user/taskmanager) as 766a7938746c2d41e817e2ceb42a9a64. Current number of registered hosts is 2. Current number of alive task slots is 32.
> 11:08:25,925 INFO org.apache.flink.runtime.jobmanager.JobManager - Recovering all jobs.
>
>
> I waited 10 minutes after that last log and there was no change. And here’s the task-manager log from the same node:
>
>
> 11:06:45,914 INFO org.apache.flink.runtime.taskmanager.TaskManager - Trying to register at JobManager akka.tcp://flink@192.168.200.174:56023/user/jobmanager (attempt 1, timeout: 500 milliseconds)
> 11:06:45,983 INFO org.apache.flink.runtime.taskmanager.TaskManager - Successful registration at JobManager (akka.tcp://flink@192.168.200.174:56023/user/jobmanager), starting network stack and library cache.
> 11:06:45,988 INFO org.apache.flink.runtime.io.network.netty.NettyClient - Successful initialization (took 4 ms).
> 11:06:45,994 INFO org.apache.flink.runtime.io.network.netty.NettyServer - Successful initialization (took 6 ms). Listening on SocketAddress /192.168.200.174:39322.
> 11:06:45,994 INFO org.apache.flink.runtime.taskmanager.TaskManager - Determined BLOB server address to be /192.168.200.174:48746. Starting BLOB cache.
> 11:06:45,995 INFO org.apache.flink.runtime.blob.BlobCache - Created BLOB cache storage directory /tmp/blobStore-4d4e4cc2-c161-4df1-acea-abda2b28d39e
>
>
> Is this a bug?
>
> Thanks,
> Ali

Ufuk Celebi-2

Re: No job recovery after job manager failure

In reply to this post by Kashmar, Ali

Hey Ali,

can you send me the complete logs?

I don’t think it’s possible via the mailing list. Just send it to my private email [hidden email].

– Ufuk

Ufuk Celebi-2

Re: No job recovery after job manager failure

As an update: I’m investigating this. Ali sent me the log files.

> On 16 Dec 2015, at 18:15, Ufuk Celebi <[hidden email]> wrote:
>
> Hey Ali,
>
> can you send me the complete logs?
>
> I don’t think it’s possible via the mailing list. Just send it to my private email [hidden email].
>
> – Ufuk
>
>> On 16 Dec 2015, at 17:26, Kashmar, Ali <[hidden email]> wrote:
>>
>> Hi,
>>
>> I’m trying to test HA on a 3-node Flink cluster (task slots = 48). So I started a job with parallelism = 32 and waited for a few seconds so that all nodes are doing work. I then shut down the node that had the leader job manager, and by shut down I mean I powered off the virtual machine running it. I monitored the logs to see what was going on and I saw that zookeeper has elected a new leader. I also saw a log for recovering jobs, but nothing actually happens. Here’s the job manager log from the node that became the leader:
>>
>> 11:06:43,448 INFO org.apache.flink.runtime.jobmanager.JobManager - JobManager akka.tcp://flink@192.168.200.174:56023/user/jobmanager was granted leadership with leader session ID Some(16eb0d0a-2cae-473e-aa41-679a87d3669b).
>> 11:06:45,912 INFO org.apache.flink.runtime.webmonitor.JobManagerRetriever - New leader reachable under akka.tcp://flink@192.168.200.174:56023/user/jobmanager:16eb0d0a-2cae-473e-aa41-679a87d3669b.
>> 11:06:45,963 INFO org.apache.flink.runtime.instance.InstanceManager - Registered TaskManager at 192.168.200.174 (akka.tcp://flink@192.168.200.174:52324/user/taskmanager) as e8720b15c63d508e8dc19b19e70d4c88. Current number of registered hosts is 1. Current number of alive task slots is 16.
>> 11:06:45,975 INFO org.apache.flink.runtime.instance.InstanceManager - Registered TaskManager at 192.168.200.175 (akka.tcp://flink@192.168.200.175:46612/user/taskmanager) as 766a7938746c2d41e817e2ceb42a9a64. Current number of registered hosts is 2. Current number of alive task slots is 32.
>> 11:08:25,925 INFO org.apache.flink.runtime.jobmanager.JobManager - Recovering all jobs.
>>
>>
>> I waited 10 minutes after that last log and there was no change. And here’s the task-manager log from the same node:
>>
>>
>> 11:06:45,914 INFO org.apache.flink.runtime.taskmanager.TaskManager - Trying to register at JobManager akka.tcp://flink@192.168.200.174:56023/user/jobmanager (attempt 1, timeout: 500 milliseconds)
>> 11:06:45,983 INFO org.apache.flink.runtime.taskmanager.TaskManager - Successful registration at JobManager (akka.tcp://flink@192.168.200.174:56023/user/jobmanager), starting network stack and library cache.
>> 11:06:45,988 INFO org.apache.flink.runtime.io.network.netty.NettyClient - Successful initialization (took 4 ms).
>> 11:06:45,994 INFO org.apache.flink.runtime.io.network.netty.NettyServer - Successful initialization (took 6 ms). Listening on SocketAddress /192.168.200.174:39322.
>> 11:06:45,994 INFO org.apache.flink.runtime.taskmanager.TaskManager - Determined BLOB server address to be /192.168.200.174:48746. Starting BLOB cache.
>> 11:06:45,995 INFO org.apache.flink.runtime.blob.BlobCache - Created BLOB cache storage directory /tmp/blobStore-4d4e4cc2-c161-4df1-acea-abda2b28d39e
>>
>>
>> Is this a bug?
>>
>> Thanks,
>> Ali
>

Ufuk Celebi-2

Re: No job recovery after job manager failure

The issue was that 1) local state backend but loss of VM and 2) recovery did not log any Exception.

2) has been addressed in this PR: https://github.com/apache/flink/pull/1472

– Ufuk

> On 17 Dec 2015, at 15:26, Ufuk Celebi <[hidden email]> wrote:
>
> As an update: I’m investigating this. Ali sent me the log files.
>
>> On 16 Dec 2015, at 18:15, Ufuk Celebi <[hidden email]> wrote:
>>
>> Hey Ali,
>>
>> can you send me the complete logs?
>>
>> I don’t think it’s possible via the mailing list. Just send it to my private email [hidden email].
>>
>> – Ufuk
>>
>>> On 16 Dec 2015, at 17:26, Kashmar, Ali <[hidden email]> wrote:
>>>
>>> Hi,
>>>
>>> I’m trying to test HA on a 3-node Flink cluster (task slots = 48). So I started a job with parallelism = 32 and waited for a few seconds so that all nodes are doing work. I then shut down the node that had the leader job manager, and by shut down I mean I powered off the virtual machine running it. I monitored the logs to see what was going on and I saw that zookeeper has elected a new leader. I also saw a log for recovering jobs, but nothing actually happens. Here’s the job manager log from the node that became the leader:
>>>
>>> 11:06:43,448 INFO org.apache.flink.runtime.jobmanager.JobManager - JobManager akka.tcp://flink@192.168.200.174:56023/user/jobmanager was granted leadership with leader session ID Some(16eb0d0a-2cae-473e-aa41-679a87d3669b).
>>> 11:06:45,912 INFO org.apache.flink.runtime.webmonitor.JobManagerRetriever - New leader reachable under akka.tcp://flink@192.168.200.174:56023/user/jobmanager:16eb0d0a-2cae-473e-aa41-679a87d3669b.
>>> 11:06:45,963 INFO org.apache.flink.runtime.instance.InstanceManager - Registered TaskManager at 192.168.200.174 (akka.tcp://flink@192.168.200.174:52324/user/taskmanager) as e8720b15c63d508e8dc19b19e70d4c88. Current number of registered hosts is 1. Current number of alive task slots is 16.
>>> 11:06:45,975 INFO org.apache.flink.runtime.instance.InstanceManager - Registered TaskManager at 192.168.200.175 (akka.tcp://flink@192.168.200.175:46612/user/taskmanager) as 766a7938746c2d41e817e2ceb42a9a64. Current number of registered hosts is 2. Current number of alive task slots is 32.
>>> 11:08:25,925 INFO org.apache.flink.runtime.jobmanager.JobManager - Recovering all jobs.
>>>
>>>
>>> I waited 10 minutes after that last log and there was no change. And here’s the task-manager log from the same node:
>>>
>>>
>>> 11:06:45,914 INFO org.apache.flink.runtime.taskmanager.TaskManager - Trying to register at JobManager akka.tcp://flink@192.168.200.174:56023/user/jobmanager (attempt 1, timeout: 500 milliseconds)
>>> 11:06:45,983 INFO org.apache.flink.runtime.taskmanager.TaskManager - Successful registration at JobManager (akka.tcp://flink@192.168.200.174:56023/user/jobmanager), starting network stack and library cache.
>>> 11:06:45,988 INFO org.apache.flink.runtime.io.network.netty.NettyClient - Successful initialization (took 4 ms).
>>> 11:06:45,994 INFO org.apache.flink.runtime.io.network.netty.NettyServer - Successful initialization (took 6 ms). Listening on SocketAddress /192.168.200.174:39322.
>>> 11:06:45,994 INFO org.apache.flink.runtime.taskmanager.TaskManager - Determined BLOB server address to be /192.168.200.174:48746. Starting BLOB cache.
>>> 11:06:45,995 INFO org.apache.flink.runtime.blob.BlobCache - Created BLOB cache storage directory /tmp/blobStore-4d4e4cc2-c161-4df1-acea-abda2b28d39e
>>>
>>>
>>> Is this a bug?
>>>
>>> Thanks,
>>> Ali
>>
>