Hi,
I’m trying to test HA on a 3-node Flink cluster (task slots = 48). So I started a job with parallelism = 32 and waited for a few seconds so that all nodes are doing work. I then shut down the node that had the leader job manager, and by shut down I mean I powered off the virtual machine running it. I monitored the logs to see what was going on and I saw that zookeeper has elected a new leader. I also saw a log for recovering jobs, but nothing actually happens. Here’s the job manager log from the node that became the leader: 11:06:43,448 INFO org.apache.flink.runtime.jobmanager.JobManager - JobManager akka.tcp://flink@192.168.200.174:56023/user/jobmanager was granted leadership with leader session ID Some(16eb0d0a-2cae-473e-aa41-679a87d3669b). 11:06:45,912 INFO org.apache.flink.runtime.webmonitor.JobManagerRetriever - New leader reachable under akka.tcp://flink@192.168.200.174:56023/user/jobmanager:16eb0d0a-2cae-473e-aa41-679a87d3669b. 11:06:45,963 INFO org.apache.flink.runtime.instance.InstanceManager - Registered TaskManager at 192.168.200.174 (akka.tcp://flink@192.168.200.174:52324/user/taskmanager) as e8720b15c63d508e8dc19b19e70d4c88. Current number of registered hosts is 1. Current number of alive task slots is 16. 11:06:45,975 INFO org.apache.flink.runtime.instance.InstanceManager - Registered TaskManager at 192.168.200.175 (akka.tcp://flink@192.168.200.175:46612/user/taskmanager) as 766a7938746c2d41e817e2ceb42a9a64. Current number of registered hosts is 2. Current number of alive task slots is 32. 11:08:25,925 INFO org.apache.flink.runtime.jobmanager.JobManager - Recovering all jobs. I waited 10 minutes after that last log and there was no change. And here’s the task-manager log from the same node: 11:06:45,914 INFO org.apache.flink.runtime.taskmanager.TaskManager - Trying to register at JobManager akka.tcp://flink@192.168.200.174:56023/user/jobmanager (attempt 1, timeout: 500 milliseconds) 11:06:45,983 INFO org.apache.flink.runtime.taskmanager.TaskManager - Successful registration at JobManager (akka.tcp://flink@192.168.200.174:56023/user/jobmanager), starting network stack and library cache. 11:06:45,988 INFO org.apache.flink.runtime.io.network.netty.NettyClient - Successful initialization (took 4 ms). 11:06:45,994 INFO org.apache.flink.runtime.io.network.netty.NettyServer - Successful initialization (took 6 ms). Listening on SocketAddress /192.168.200.174:39322. 11:06:45,994 INFO org.apache.flink.runtime.taskmanager.TaskManager - Determined BLOB server address to be /192.168.200.174:48746. Starting BLOB cache. 11:06:45,995 INFO org.apache.flink.runtime.blob.BlobCache - Created BLOB cache storage directory /tmp/blobStore-4d4e4cc2-c161-4df1-acea-abda2b28d39e Is this a bug? Thanks, Ali |
Hey Ali,
can you send me the complete logs? I don’t think it’s possible via the mailing list. Just send it to my private email [hidden email]. – Ufuk > On 16 Dec 2015, at 17:26, Kashmar, Ali <[hidden email]> wrote: > > Hi, > > I’m trying to test HA on a 3-node Flink cluster (task slots = 48). So I started a job with parallelism = 32 and waited for a few seconds so that all nodes are doing work. I then shut down the node that had the leader job manager, and by shut down I mean I powered off the virtual machine running it. I monitored the logs to see what was going on and I saw that zookeeper has elected a new leader. I also saw a log for recovering jobs, but nothing actually happens. Here’s the job manager log from the node that became the leader: > > 11:06:43,448 INFO org.apache.flink.runtime.jobmanager.JobManager - JobManager akka.tcp://flink@192.168.200.174:56023/user/jobmanager was granted leadership with leader session ID Some(16eb0d0a-2cae-473e-aa41-679a87d3669b). > 11:06:45,912 INFO org.apache.flink.runtime.webmonitor.JobManagerRetriever - New leader reachable under akka.tcp://flink@192.168.200.174:56023/user/jobmanager:16eb0d0a-2cae-473e-aa41-679a87d3669b. > 11:06:45,963 INFO org.apache.flink.runtime.instance.InstanceManager - Registered TaskManager at 192.168.200.174 (akka.tcp://flink@192.168.200.174:52324/user/taskmanager) as e8720b15c63d508e8dc19b19e70d4c88. Current number of registered hosts is 1. Current number of alive task slots is 16. > 11:06:45,975 INFO org.apache.flink.runtime.instance.InstanceManager - Registered TaskManager at 192.168.200.175 (akka.tcp://flink@192.168.200.175:46612/user/taskmanager) as 766a7938746c2d41e817e2ceb42a9a64. Current number of registered hosts is 2. Current number of alive task slots is 32. > 11:08:25,925 INFO org.apache.flink.runtime.jobmanager.JobManager - Recovering all jobs. > > > I waited 10 minutes after that last log and there was no change. And here’s the task-manager log from the same node: > > > 11:06:45,914 INFO org.apache.flink.runtime.taskmanager.TaskManager - Trying to register at JobManager akka.tcp://flink@192.168.200.174:56023/user/jobmanager (attempt 1, timeout: 500 milliseconds) > 11:06:45,983 INFO org.apache.flink.runtime.taskmanager.TaskManager - Successful registration at JobManager (akka.tcp://flink@192.168.200.174:56023/user/jobmanager), starting network stack and library cache. > 11:06:45,988 INFO org.apache.flink.runtime.io.network.netty.NettyClient - Successful initialization (took 4 ms). > 11:06:45,994 INFO org.apache.flink.runtime.io.network.netty.NettyServer - Successful initialization (took 6 ms). Listening on SocketAddress /192.168.200.174:39322. > 11:06:45,994 INFO org.apache.flink.runtime.taskmanager.TaskManager - Determined BLOB server address to be /192.168.200.174:48746. Starting BLOB cache. > 11:06:45,995 INFO org.apache.flink.runtime.blob.BlobCache - Created BLOB cache storage directory /tmp/blobStore-4d4e4cc2-c161-4df1-acea-abda2b28d39e > > > Is this a bug? > > Thanks, > Ali |
In reply to this post by Kashmar, Ali
Hey Ali,
can you send me the complete logs? I don’t think it’s possible via the mailing list. Just send it to my private email [hidden email]. – Ufuk > On 16 Dec 2015, at 17:26, Kashmar, Ali <[hidden email]> wrote: > > Hi, > > I’m trying to test HA on a 3-node Flink cluster (task slots = 48). So I started a job with parallelism = 32 and waited for a few seconds so that all nodes are doing work. I then shut down the node that had the leader job manager, and by shut down I mean I powered off the virtual machine running it. I monitored the logs to see what was going on and I saw that zookeeper has elected a new leader. I also saw a log for recovering jobs, but nothing actually happens. Here’s the job manager log from the node that became the leader: > > 11:06:43,448 INFO org.apache.flink.runtime.jobmanager.JobManager - JobManager akka.tcp://flink@192.168.200.174:56023/user/jobmanager was granted leadership with leader session ID Some(16eb0d0a-2cae-473e-aa41-679a87d3669b). > 11:06:45,912 INFO org.apache.flink.runtime.webmonitor.JobManagerRetriever - New leader reachable under akka.tcp://flink@192.168.200.174:56023/user/jobmanager:16eb0d0a-2cae-473e-aa41-679a87d3669b. > 11:06:45,963 INFO org.apache.flink.runtime.instance.InstanceManager - Registered TaskManager at 192.168.200.174 (akka.tcp://flink@192.168.200.174:52324/user/taskmanager) as e8720b15c63d508e8dc19b19e70d4c88. Current number of registered hosts is 1. Current number of alive task slots is 16. > 11:06:45,975 INFO org.apache.flink.runtime.instance.InstanceManager - Registered TaskManager at 192.168.200.175 (akka.tcp://flink@192.168.200.175:46612/user/taskmanager) as 766a7938746c2d41e817e2ceb42a9a64. Current number of registered hosts is 2. Current number of alive task slots is 32. > 11:08:25,925 INFO org.apache.flink.runtime.jobmanager.JobManager - Recovering all jobs. > > > I waited 10 minutes after that last log and there was no change. And here’s the task-manager log from the same node: > > > 11:06:45,914 INFO org.apache.flink.runtime.taskmanager.TaskManager - Trying to register at JobManager akka.tcp://flink@192.168.200.174:56023/user/jobmanager (attempt 1, timeout: 500 milliseconds) > 11:06:45,983 INFO org.apache.flink.runtime.taskmanager.TaskManager - Successful registration at JobManager (akka.tcp://flink@192.168.200.174:56023/user/jobmanager), starting network stack and library cache. > 11:06:45,988 INFO org.apache.flink.runtime.io.network.netty.NettyClient - Successful initialization (took 4 ms). > 11:06:45,994 INFO org.apache.flink.runtime.io.network.netty.NettyServer - Successful initialization (took 6 ms). Listening on SocketAddress /192.168.200.174:39322. > 11:06:45,994 INFO org.apache.flink.runtime.taskmanager.TaskManager - Determined BLOB server address to be /192.168.200.174:48746. Starting BLOB cache. > 11:06:45,995 INFO org.apache.flink.runtime.blob.BlobCache - Created BLOB cache storage directory /tmp/blobStore-4d4e4cc2-c161-4df1-acea-abda2b28d39e > > > Is this a bug? > > Thanks, > Ali |
As an update: I’m investigating this. Ali sent me the log files.
> On 16 Dec 2015, at 18:15, Ufuk Celebi <[hidden email]> wrote: > > Hey Ali, > > can you send me the complete logs? > > I don’t think it’s possible via the mailing list. Just send it to my private email [hidden email]. > > – Ufuk > >> On 16 Dec 2015, at 17:26, Kashmar, Ali <[hidden email]> wrote: >> >> Hi, >> >> I’m trying to test HA on a 3-node Flink cluster (task slots = 48). So I started a job with parallelism = 32 and waited for a few seconds so that all nodes are doing work. I then shut down the node that had the leader job manager, and by shut down I mean I powered off the virtual machine running it. I monitored the logs to see what was going on and I saw that zookeeper has elected a new leader. I also saw a log for recovering jobs, but nothing actually happens. Here’s the job manager log from the node that became the leader: >> >> 11:06:43,448 INFO org.apache.flink.runtime.jobmanager.JobManager - JobManager akka.tcp://flink@192.168.200.174:56023/user/jobmanager was granted leadership with leader session ID Some(16eb0d0a-2cae-473e-aa41-679a87d3669b). >> 11:06:45,912 INFO org.apache.flink.runtime.webmonitor.JobManagerRetriever - New leader reachable under akka.tcp://flink@192.168.200.174:56023/user/jobmanager:16eb0d0a-2cae-473e-aa41-679a87d3669b. >> 11:06:45,963 INFO org.apache.flink.runtime.instance.InstanceManager - Registered TaskManager at 192.168.200.174 (akka.tcp://flink@192.168.200.174:52324/user/taskmanager) as e8720b15c63d508e8dc19b19e70d4c88. Current number of registered hosts is 1. Current number of alive task slots is 16. >> 11:06:45,975 INFO org.apache.flink.runtime.instance.InstanceManager - Registered TaskManager at 192.168.200.175 (akka.tcp://flink@192.168.200.175:46612/user/taskmanager) as 766a7938746c2d41e817e2ceb42a9a64. Current number of registered hosts is 2. Current number of alive task slots is 32. >> 11:08:25,925 INFO org.apache.flink.runtime.jobmanager.JobManager - Recovering all jobs. >> >> >> I waited 10 minutes after that last log and there was no change. And here’s the task-manager log from the same node: >> >> >> 11:06:45,914 INFO org.apache.flink.runtime.taskmanager.TaskManager - Trying to register at JobManager akka.tcp://flink@192.168.200.174:56023/user/jobmanager (attempt 1, timeout: 500 milliseconds) >> 11:06:45,983 INFO org.apache.flink.runtime.taskmanager.TaskManager - Successful registration at JobManager (akka.tcp://flink@192.168.200.174:56023/user/jobmanager), starting network stack and library cache. >> 11:06:45,988 INFO org.apache.flink.runtime.io.network.netty.NettyClient - Successful initialization (took 4 ms). >> 11:06:45,994 INFO org.apache.flink.runtime.io.network.netty.NettyServer - Successful initialization (took 6 ms). Listening on SocketAddress /192.168.200.174:39322. >> 11:06:45,994 INFO org.apache.flink.runtime.taskmanager.TaskManager - Determined BLOB server address to be /192.168.200.174:48746. Starting BLOB cache. >> 11:06:45,995 INFO org.apache.flink.runtime.blob.BlobCache - Created BLOB cache storage directory /tmp/blobStore-4d4e4cc2-c161-4df1-acea-abda2b28d39e >> >> >> Is this a bug? >> >> Thanks, >> Ali > |
The issue was that 1) local state backend but loss of VM and 2) recovery did not log any Exception.
2) has been addressed in this PR: https://github.com/apache/flink/pull/1472 – Ufuk > On 17 Dec 2015, at 15:26, Ufuk Celebi <[hidden email]> wrote: > > As an update: I’m investigating this. Ali sent me the log files. > >> On 16 Dec 2015, at 18:15, Ufuk Celebi <[hidden email]> wrote: >> >> Hey Ali, >> >> can you send me the complete logs? >> >> I don’t think it’s possible via the mailing list. Just send it to my private email [hidden email]. >> >> – Ufuk >> >>> On 16 Dec 2015, at 17:26, Kashmar, Ali <[hidden email]> wrote: >>> >>> Hi, >>> >>> I’m trying to test HA on a 3-node Flink cluster (task slots = 48). So I started a job with parallelism = 32 and waited for a few seconds so that all nodes are doing work. I then shut down the node that had the leader job manager, and by shut down I mean I powered off the virtual machine running it. I monitored the logs to see what was going on and I saw that zookeeper has elected a new leader. I also saw a log for recovering jobs, but nothing actually happens. Here’s the job manager log from the node that became the leader: >>> >>> 11:06:43,448 INFO org.apache.flink.runtime.jobmanager.JobManager - JobManager akka.tcp://flink@192.168.200.174:56023/user/jobmanager was granted leadership with leader session ID Some(16eb0d0a-2cae-473e-aa41-679a87d3669b). >>> 11:06:45,912 INFO org.apache.flink.runtime.webmonitor.JobManagerRetriever - New leader reachable under akka.tcp://flink@192.168.200.174:56023/user/jobmanager:16eb0d0a-2cae-473e-aa41-679a87d3669b. >>> 11:06:45,963 INFO org.apache.flink.runtime.instance.InstanceManager - Registered TaskManager at 192.168.200.174 (akka.tcp://flink@192.168.200.174:52324/user/taskmanager) as e8720b15c63d508e8dc19b19e70d4c88. Current number of registered hosts is 1. Current number of alive task slots is 16. >>> 11:06:45,975 INFO org.apache.flink.runtime.instance.InstanceManager - Registered TaskManager at 192.168.200.175 (akka.tcp://flink@192.168.200.175:46612/user/taskmanager) as 766a7938746c2d41e817e2ceb42a9a64. Current number of registered hosts is 2. Current number of alive task slots is 32. >>> 11:08:25,925 INFO org.apache.flink.runtime.jobmanager.JobManager - Recovering all jobs. >>> >>> >>> I waited 10 minutes after that last log and there was no change. And here’s the task-manager log from the same node: >>> >>> >>> 11:06:45,914 INFO org.apache.flink.runtime.taskmanager.TaskManager - Trying to register at JobManager akka.tcp://flink@192.168.200.174:56023/user/jobmanager (attempt 1, timeout: 500 milliseconds) >>> 11:06:45,983 INFO org.apache.flink.runtime.taskmanager.TaskManager - Successful registration at JobManager (akka.tcp://flink@192.168.200.174:56023/user/jobmanager), starting network stack and library cache. >>> 11:06:45,988 INFO org.apache.flink.runtime.io.network.netty.NettyClient - Successful initialization (took 4 ms). >>> 11:06:45,994 INFO org.apache.flink.runtime.io.network.netty.NettyServer - Successful initialization (took 6 ms). Listening on SocketAddress /192.168.200.174:39322. >>> 11:06:45,994 INFO org.apache.flink.runtime.taskmanager.TaskManager - Determined BLOB server address to be /192.168.200.174:48746. Starting BLOB cache. >>> 11:06:45,995 INFO org.apache.flink.runtime.blob.BlobCache - Created BLOB cache storage directory /tmp/blobStore-4d4e4cc2-c161-4df1-acea-abda2b28d39e >>> >>> >>> Is this a bug? >>> >>> Thanks, >>> Ali >> > |
Free forum by Nabble | Edit this page |