Hi,
I've setup Flink HA on AWS ( 3 Taskmanagers and 2 Jobmanagers each are on EC2 m4.large instance with checkpoint enabled on S3 ). My topology works fine, but after few hours I do see that Taskmanagers gets detached with Jobmanager. I tried to reach Jobmanager using telnet at the same time and it worked but Taskmanager does not succeed in connecting again. It attaches only after I restart it. I tried following settings but still the problem persists. akka.ask.timeout: 20 s akka.lookup.timeout: 20 s akka.watch.heartbeat.interval: 20 s Please find attached snapshot on one of the Taskmanager. Is there any setting that I need to do ? Thanks,
Deepak Jha |
Hi Deepak,
could you check the logs whether the JobManager has been quarantined and thus, cannot be connected to anymore? The logs should at least contain a hint why the TaskManager lost the connection initially. Cheers, Till On Thu, Sep 8, 2016 at 7:08 PM, Deepak Jha <[hidden email]> wrote: > Hi, > I've setup Flink HA on AWS ( 3 Taskmanagers and 2 Jobmanagers each are on > EC2 m4.large instance with checkpoint enabled on S3 ). My topology works > fine, but after few hours I do see that Taskmanagers gets detached with > Jobmanager. I tried to reach Jobmanager using telnet at the same time and > it worked but Taskmanager does not succeed in connecting again. It attaches > only after I restart it. I tried following settings but still the problem > persists. > > akka.ask.timeout: 20 s > akka.lookup.timeout: 20 s > akka.watch.heartbeat.interval: 20 s > > Please find attached snapshot on one of the Taskmanager. Is there any > setting that I need to do ? > > -- > Thanks, > Deepak Jha > > |
Hi Till,
I'm getting following message in Jobmanager log 2016-09-09 07:46:55,093 PDT [WARN] ip-10-8-11-249 [flink-akka.actor.default-dispatcher-985] akka.remote.RemoteWatcher - *Detected unreachable: [akka.tcp://flink@10.8.4.57:6121 <http://flink@10.8.4.57:6121>]* 2016-09-09 07:46:55,094 PDT [INFO] ip-10-8-11-249 [flink-akka.actor.default-dispatcher-985] o.a.f.runtime.jobmanager.JobManager - Task manager akka.tcp:// flink@10.8.4.57:6121/user/taskmanager terminated. 2016-09-09 07:46:55,094 PDT [INFO] ip-10-8-11-249 [flink-akka.actor.default-dispatcher-985] o.a.f.r.instance.InstanceManager - Unregistered task manager akka.tcp://flink@10.8.4.57:6121/user/taskmanager. Number of registered task managers 2. Number of available slots 4. 2016-09-09 07:46:55,096 PDT [WARN] ip-10-8-11-249 [flink-akka.actor.default-dispatcher-982] Remoting - Association to [akka.tcp://flink@10.8.4.57:6121] having UID [-1223410403] is irrecoverably failed. *UID is now quarantined and all messages to this UID will be delivered to dead letters. Remote actorsystem must be restarted to recover from this situation.* 2016-09-09 07:46:55,097 PDT [INFO] ip-10-8-11-249 [flink-akka.actor.default-dispatcher-982] akka.actor.LocalActorRef - Message [akka.remote.transport.AssociationHandle$Disassociated] from Actor[akka://flink/deadLetters] to Actor[akka://flink/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2Fflink%4010.8.4.57%3A6121-0/endpointWriter/endpointReader-akka.tcp%3A%2F%2Fflink%4010.8.4.57%3A6121-0#393939009] was not delivered. [54] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'. 2016-09-09 07:46:55,098 PDT [INFO] ip-10-8-11-249 [flink-akka.actor.default-dispatcher-985] akka.actor.LocalActorRef - Message [akka.remote.transport.AssociationHandle$Disassociated] from Actor[akka://flink/deadLetters] to Actor[akka://flink/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2Fflink%4010.8.4.57%3A51291-2#1151730456] was not delivered. [55] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'. 2016-09-09 07:46:58,479 PDT [INFO] ip-10-8-11-249 [ForkJoinPool-3-worker-1] o.a.f.r.c.ZooKeeperCompletedCheckpointStore - Recovering checkpoints from ZooKeeper. Hope it helps. I'm using Flink 1.0.2 On Fri, Sep 9, 2016 at 12:34 AM, Till Rohrmann <[hidden email]> wrote: > Hi Deepak, > > could you check the logs whether the JobManager has been quarantined and > thus, cannot be connected to anymore? The logs should at least contain a > hint why the TaskManager lost the connection initially. > > Cheers, > Till > > On Thu, Sep 8, 2016 at 7:08 PM, Deepak Jha <[hidden email]> wrote: > > > Hi, > > I've setup Flink HA on AWS ( 3 Taskmanagers and 2 Jobmanagers each are on > > EC2 m4.large instance with checkpoint enabled on S3 ). My topology works > > fine, but after few hours I do see that Taskmanagers gets detached with > > Jobmanager. I tried to reach Jobmanager using telnet at the same time and > > it worked but Taskmanager does not succeed in connecting again. It > attaches > > only after I restart it. I tried following settings but still the problem > > persists. > > > > akka.ask.timeout: 20 s > > akka.lookup.timeout: 20 s > > akka.watch.heartbeat.interval: 20 s > > > > Please find attached snapshot on one of the Taskmanager. Is there any > > setting that I need to do ? > > > > -- > > Thanks, > > Deepak Jha > > > > > -- Thanks, Deepak Jha |
Hi Till,
One more thing i noticed after looking into following message in taskmanager log 2016-09-11 17:57:25,310 PDT [WARN] ip-10-6-0-15 [flink-akka.actor.default-dispatcher-31] Remoting - Tried to associate with unreachable remote address [akka.tcp://flink@10.6.22.22:50050]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: *The remote system has quarantined this system. No further associations to the remote system are possible until this system is restarted*. So in this case ideally the local ActorSystem should go down so that service supervisor/runit will restart the system and taskmanager will again be able to connect to the remote system.. If it does not happen automatically then we have to monitor logs in some way and then try to ensure that it restarts. Ideally flink taskmanager Actor System should go down. Please let me know if my understanding is wrong. On Fri, Sep 9, 2016 at 8:01 AM, Deepak Jha <[hidden email]> wrote: > Hi Till, > I'm getting following message in Jobmanager log > > 2016-09-09 07:46:55,093 PDT [WARN] ip-10-8-11-249 > [flink-akka.actor.default-dispatcher-985] akka.remote.RemoteWatcher - *Detected > unreachable: [akka.tcp://flink@10.8.4.57:6121 > <http://flink@10.8.4.57:6121>]* > 2016-09-09 07:46:55,094 PDT [INFO] ip-10-8-11-249 > [flink-akka.actor.default-dispatcher-985] o.a.f.runtime.jobmanager.JobManager > - Task manager akka.tcp://flink@10.8.4.57:6121/user/taskmanager > terminated. > 2016-09-09 07:46:55,094 PDT [INFO] ip-10-8-11-249 > [flink-akka.actor.default-dispatcher-985] o.a.f.r.instance.InstanceManager > - Unregistered task manager akka.tcp://flink@10.8.4.57: > 6121/user/taskmanager. Number of registered task managers 2. Number of > available slots 4. > 2016-09-09 07:46:55,096 PDT [WARN] ip-10-8-11-249 > [flink-akka.actor.default-dispatcher-982] Remoting - Association to > [akka.tcp://flink@10.8.4.57:6121] having UID [-1223410403] is > irrecoverably failed. *UID is now quarantined and all messages to this > UID will be delivered to dead letters. Remote actorsystem must be restarted > to recover from this situation.* > 2016-09-09 07:46:55,097 PDT [INFO] ip-10-8-11-249 > [flink-akka.actor.default-dispatcher-982] akka.actor.LocalActorRef - > Message [akka.remote.transport.AssociationHandle$Disassociated] from > Actor[akka://flink/deadLetters] to Actor[akka://flink/system/ > endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2Fflink%4010.8.4.57% > 3A6121-0/endpointWriter/endpointReader-akka.tcp%3A%2F% > 2Fflink%4010.8.4.57%3A6121-0#393939009] was not delivered. [54] dead > letters encountered. This logging can be turned off or adjusted with > configuration settings 'akka.log-dead-letters' and > 'akka.log-dead-letters-during-shutdown'. > 2016-09-09 07:46:55,098 PDT [INFO] ip-10-8-11-249 > [flink-akka.actor.default-dispatcher-985] akka.actor.LocalActorRef - > Message [akka.remote.transport.AssociationHandle$Disassociated] from > Actor[akka://flink/deadLetters] to Actor[akka://flink/system/transports/ > akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F% > 2Fflink%4010.8.4.57%3A51291-2#1151730456] was not delivered. [55] dead > letters encountered. This logging can be turned off or adjusted with > configuration settings 'akka.log-dead-letters' and > 'akka.log-dead-letters-during-shutdown'. > 2016-09-09 07:46:58,479 PDT [INFO] ip-10-8-11-249 > [ForkJoinPool-3-worker-1] o.a.f.r.c.ZooKeeperCompletedCheckpointStore - > Recovering checkpoints from ZooKeeper. > > Hope it helps. I'm using Flink 1.0.2 > > On Fri, Sep 9, 2016 at 12:34 AM, Till Rohrmann <[hidden email]> > wrote: > >> Hi Deepak, >> >> could you check the logs whether the JobManager has been quarantined and >> thus, cannot be connected to anymore? The logs should at least contain a >> hint why the TaskManager lost the connection initially. >> >> Cheers, >> Till >> >> On Thu, Sep 8, 2016 at 7:08 PM, Deepak Jha <[hidden email]> wrote: >> >> > Hi, >> > I've setup Flink HA on AWS ( 3 Taskmanagers and 2 Jobmanagers each are >> on >> > EC2 m4.large instance with checkpoint enabled on S3 ). My topology works >> > fine, but after few hours I do see that Taskmanagers gets detached with >> > Jobmanager. I tried to reach Jobmanager using telnet at the same time >> and >> > it worked but Taskmanager does not succeed in connecting again. It >> attaches >> > only after I restart it. I tried following settings but still the >> problem >> > persists. >> > >> > akka.ask.timeout: 20 s >> > akka.lookup.timeout: 20 s >> > akka.watch.heartbeat.interval: 20 s >> > >> > Please find attached snapshot on one of the Taskmanager. Is there any >> > setting that I need to do ? >> > >> > -- >> > Thanks, >> > Deepak Jha >> > >> > >> > > > > -- > Thanks, > Deepak Jha > > -- Thanks, Deepak Jha |
Hi Deepak,
it seems that the JobManager's deathwatch declares the TaskManager to be unreachable which will automatically quarantine it. You're right that in such a case, the TaskManager should shut down and be restarted so that it can again reconnect to the JobManager. This is, however, not yet supported automatically. For the moment, I'd recommend you to make the deathwatch a little bit less aggressive via the following settings: - akka.watch.heartbeat.interval: Heartbeat interval for Akka’s DeathWatch mechanism to detect dead TaskManagers. If TaskManagers are wrongly marked dead because of lost or delayed heartbeat messages, then you should increase this value. A thorough description of Akka’s DeathWatch can be found here (DEFAULT: akka.ask.timeout/10). - akka.watch.heartbeat.pause: Acceptable heartbeat pause for Akka’s DeathWatch mechanism. A low value does not allow a irregular heartbeat. A thorough description of Akka’s DeathWatch can be found here (DEFAULT: akka.ask.timeout). - akka.watch.threshold: Threshold for the DeathWatch failure detector. A low value is prone to false positives whereas a high value increases the time to detect a dead TaskManager. A thorough description of Akka’s DeathWatch can be found here (DEFAULT: 12). I hope this helps you to work around the problem for the moment until we've added the automatic shut down and restart. Cheers, Till On Mon, Sep 12, 2016 at 5:55 AM, Deepak Jha <[hidden email]> wrote: > Hi Till, > One more thing i noticed after looking into following message in > taskmanager log > > 2016-09-11 17:57:25,310 PDT [WARN] ip-10-6-0-15 > [flink-akka.actor.default-dispatcher-31] Remoting - Tried to associate > with > unreachable remote address [akka.tcp://flink@10.6.22.22:50050]. Address is > now gated for 5000 ms, all messages to this address will be delivered to > dead letters. Reason: *The remote system has quarantined this system. No > further associations to the remote system are possible until this system is > restarted*. > > So in this case ideally the local ActorSystem should go down so that > service supervisor/runit will restart the system and taskmanager will again > be able to connect to the remote system.. If it does not happen > automatically then we have to monitor logs in some way and then try to > ensure that it restarts. Ideally flink taskmanager Actor System should go > down. Please let me know if my understanding is wrong. > > > > > > On Fri, Sep 9, 2016 at 8:01 AM, Deepak Jha <[hidden email]> wrote: > > > Hi Till, > > I'm getting following message in Jobmanager log > > > > 2016-09-09 07:46:55,093 PDT [WARN] ip-10-8-11-249 > > [flink-akka.actor.default-dispatcher-985] akka.remote.RemoteWatcher - > *Detected > > unreachable: [akka.tcp://flink@10.8.4.57:6121 > > <http://flink@10.8.4.57:6121>]* > > 2016-09-09 07:46:55,094 PDT [INFO] ip-10-8-11-249 > > [flink-akka.actor.default-dispatcher-985] o.a.f.runtime.jobmanager. > JobManager > > - Task manager akka.tcp://flink@10.8.4.57:6121/user/taskmanager > > terminated. > > 2016-09-09 07:46:55,094 PDT [INFO] ip-10-8-11-249 > > [flink-akka.actor.default-dispatcher-985] o.a.f.r.instance. > InstanceManager > > - Unregistered task manager akka.tcp://flink@10.8.4.57: > > 6121/user/taskmanager. Number of registered task managers 2. Number of > > available slots 4. > > 2016-09-09 07:46:55,096 PDT [WARN] ip-10-8-11-249 > > [flink-akka.actor.default-dispatcher-982] Remoting - Association to > > [akka.tcp://flink@10.8.4.57:6121] having UID [-1223410403] is > > irrecoverably failed. *UID is now quarantined and all messages to this > > UID will be delivered to dead letters. Remote actorsystem must be > restarted > > to recover from this situation.* > > 2016-09-09 07:46:55,097 PDT [INFO] ip-10-8-11-249 > > [flink-akka.actor.default-dispatcher-982] akka.actor.LocalActorRef - > > Message [akka.remote.transport.AssociationHandle$Disassociated] from > > Actor[akka://flink/deadLetters] to Actor[akka://flink/system/ > > endpointManager/reliableEndpointWriter-akka. > tcp%3A%2F%2Fflink%4010.8.4.57% > > 3A6121-0/endpointWriter/endpointReader-akka.tcp%3A%2F% > > 2Fflink%4010.8.4.57%3A6121-0#393939009] was not delivered. [54] dead > > letters encountered. This logging can be turned off or adjusted with > > configuration settings 'akka.log-dead-letters' and > > 'akka.log-dead-letters-during-shutdown'. > > 2016-09-09 07:46:55,098 PDT [INFO] ip-10-8-11-249 > > [flink-akka.actor.default-dispatcher-985] akka.actor.LocalActorRef - > > Message [akka.remote.transport.AssociationHandle$Disassociated] from > > Actor[akka://flink/deadLetters] to Actor[akka://flink/system/transports/ > > akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F% > > 2Fflink%4010.8.4.57%3A51291-2#1151730456] was not delivered. [55] dead > > letters encountered. This logging can be turned off or adjusted with > > configuration settings 'akka.log-dead-letters' and > > 'akka.log-dead-letters-during-shutdown'. > > 2016-09-09 07:46:58,479 PDT [INFO] ip-10-8-11-249 > > [ForkJoinPool-3-worker-1] o.a.f.r.c.ZooKeeperCompletedCheckpointStore - > > Recovering checkpoints from ZooKeeper. > > > > Hope it helps. I'm using Flink 1.0.2 > > > > On Fri, Sep 9, 2016 at 12:34 AM, Till Rohrmann <[hidden email]> > > wrote: > > > >> Hi Deepak, > >> > >> could you check the logs whether the JobManager has been quarantined and > >> thus, cannot be connected to anymore? The logs should at least contain a > >> hint why the TaskManager lost the connection initially. > >> > >> Cheers, > >> Till > >> > >> On Thu, Sep 8, 2016 at 7:08 PM, Deepak Jha <[hidden email]> wrote: > >> > >> > Hi, > >> > I've setup Flink HA on AWS ( 3 Taskmanagers and 2 Jobmanagers each are > >> on > >> > EC2 m4.large instance with checkpoint enabled on S3 ). My topology > works > >> > fine, but after few hours I do see that Taskmanagers gets detached > with > >> > Jobmanager. I tried to reach Jobmanager using telnet at the same time > >> and > >> > it worked but Taskmanager does not succeed in connecting again. It > >> attaches > >> > only after I restart it. I tried following settings but still the > >> problem > >> > persists. > >> > > >> > akka.ask.timeout: 20 s > >> > akka.lookup.timeout: 20 s > >> > akka.watch.heartbeat.interval: 20 s > >> > > >> > Please find attached snapshot on one of the Taskmanager. Is there any > >> > setting that I need to do ? > >> > > >> > -- > >> > Thanks, > >> > Deepak Jha > >> > > >> > > >> > > > > > > > > -- > > Thanks, > > Deepak Jha > > > > > > > -- > Thanks, > Deepak Jha > |
Hi Till,
There is a way to shutdown actor systems by setting taskmanager.maxRegistrationDuration to a reasonable duration (eg: 900 seconds). Default value sets it to Inf. In this case I noticed that Taskmanager goes down and runit restarts the service and it gets connected with Jobmanager. As I said earlier as well that retries to connect to Jobmanager does not work even though telnet works at the same time to the same Jobmanager on port 50050. So retry does cache something which does not allow it to reconnect. My flink cluster is on aws ( m4.large instances), not sure if anyone else has observed this behavior. On Thursday, September 15, 2016, Till Rohrmann <[hidden email]> wrote: > Hi Deepak, > > it seems that the JobManager's deathwatch declares the TaskManager to be > unreachable which will automatically quarantine it. You're right that in > such a case, the TaskManager should shut down and be restarted so that it > can again reconnect to the JobManager. This is, however, not yet supported > automatically. > > For the moment, I'd recommend you to make the deathwatch a little bit less > aggressive via the following settings: > > - akka.watch.heartbeat.interval: Heartbeat interval for Akka’s DeathWatch > mechanism to detect dead TaskManagers. If TaskManagers are wrongly marked > dead because of lost or delayed heartbeat messages, then you should > increase this value. A thorough description of Akka’s DeathWatch can be > found here (DEFAULT: akka.ask.timeout/10). > - akka.watch.heartbeat.pause: Acceptable heartbeat pause for Akka’s > DeathWatch mechanism. A low value does not allow a irregular heartbeat. A > thorough description of Akka’s DeathWatch can be found here (DEFAULT: > akka.ask.timeout). > - akka.watch.threshold: Threshold for the DeathWatch failure detector. A > low value is prone to false positives whereas a high value increases the > time to detect a dead TaskManager. A thorough description of Akka’s > DeathWatch can be found here (DEFAULT: 12). > > I hope this helps you to work around the problem for the moment until we've > added the automatic shut down and restart. > > Cheers, > Till > > On Mon, Sep 12, 2016 at 5:55 AM, Deepak Jha <[hidden email] > <javascript:;>> wrote: > > > Hi Till, > > One more thing i noticed after looking into following message in > > taskmanager log > > > > 2016-09-11 17:57:25,310 PDT [WARN] ip-10-6-0-15 > > [flink-akka.actor.default-dispatcher-31] Remoting - Tried to associate > > with > > unreachable remote address [akka.tcp://flink@10.6.22.22:50050]. Address > is > > now gated for 5000 ms, all messages to this address will be delivered to > > dead letters. Reason: *The remote system has quarantined this system. No > > further associations to the remote system are possible until this system > is > > restarted*. > > > > So in this case ideally the local ActorSystem should go down so that > > service supervisor/runit will restart the system and taskmanager will > again > > be able to connect to the remote system.. If it does not happen > > automatically then we have to monitor logs in some way and then try to > > ensure that it restarts. Ideally flink taskmanager Actor System should go > > down. Please let me know if my understanding is wrong. > > > > > > > > > > > > On Fri, Sep 9, 2016 at 8:01 AM, Deepak Jha <[hidden email] > <javascript:;>> wrote: > > > > > Hi Till, > > > I'm getting following message in Jobmanager log > > > > > > 2016-09-09 07:46:55,093 PDT [WARN] ip-10-8-11-249 > > > [flink-akka.actor.default-dispatcher-985] akka.remote.RemoteWatcher - > > *Detected > > > unreachable: [akka.tcp://flink@10.8.4.57:6121 > > > <http://flink@10.8.4.57:6121>]* > > > 2016-09-09 07:46:55,094 PDT [INFO] ip-10-8-11-249 > > > [flink-akka.actor.default-dispatcher-985] o.a.f.runtime.jobmanager. > > JobManager > > > - Task manager akka.tcp://flink@10.8.4.57:6121/user/taskmanager > > > terminated. > > > 2016-09-09 07:46:55,094 PDT [INFO] ip-10-8-11-249 > > > [flink-akka.actor.default-dispatcher-985] o.a.f.r.instance. > > InstanceManager > > > - Unregistered task manager akka.tcp://flink@10.8.4.57 <javascript:;>: > > > 6121/user/taskmanager. Number of registered task managers 2. Number of > > > available slots 4. > > > 2016-09-09 07:46:55,096 PDT [WARN] ip-10-8-11-249 > > > [flink-akka.actor.default-dispatcher-982] Remoting - Association to > > > [akka.tcp://flink@10.8.4.57:6121] having UID [-1223410403] is > > > irrecoverably failed. *UID is now quarantined and all messages to this > > > UID will be delivered to dead letters. Remote actorsystem must be > > restarted > > > to recover from this situation.* > > > 2016-09-09 07:46:55,097 PDT [INFO] ip-10-8-11-249 > > > [flink-akka.actor.default-dispatcher-982] akka.actor.LocalActorRef - > > > Message [akka.remote.transport.AssociationHandle$Disassociated] from > > > Actor[akka://flink/deadLetters] to Actor[akka://flink/system/ > > > endpointManager/reliableEndpointWriter-akka. > > tcp%3A%2F%2Fflink%4010.8.4.57% > > > 3A6121-0/endpointWriter/endpointReader-akka.tcp%3A%2F% > > > 2Fflink%4010.8.4.57%3A6121-0#393939009] was not delivered. [54] dead > > > letters encountered. This logging can be turned off or adjusted with > > > configuration settings 'akka.log-dead-letters' and > > > 'akka.log-dead-letters-during-shutdown'. > > > 2016-09-09 07:46:55,098 PDT [INFO] ip-10-8-11-249 > > > [flink-akka.actor.default-dispatcher-985] akka.actor.LocalActorRef - > > > Message [akka.remote.transport.AssociationHandle$Disassociated] from > > > Actor[akka://flink/deadLetters] to Actor[akka://flink/system/ > transports/ > > > akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F% > > > 2Fflink%4010.8.4.57%3A51291-2#1151730456] was not delivered. [55] dead > > > letters encountered. This logging can be turned off or adjusted with > > > configuration settings 'akka.log-dead-letters' and > > > 'akka.log-dead-letters-during-shutdown'. > > > 2016-09-09 07:46:58,479 PDT [INFO] ip-10-8-11-249 > > > [ForkJoinPool-3-worker-1] o.a.f.r.c.ZooKeeperCompletedCheckpointStore > - > > > Recovering checkpoints from ZooKeeper. > > > > > > Hope it helps. I'm using Flink 1.0.2 > > > > > > On Fri, Sep 9, 2016 at 12:34 AM, Till Rohrmann <[hidden email] > <javascript:;>> > > > wrote: > > > > > >> Hi Deepak, > > >> > > >> could you check the logs whether the JobManager has been quarantined > and > > >> thus, cannot be connected to anymore? The logs should at least > contain a > > >> hint why the TaskManager lost the connection initially. > > >> > > >> Cheers, > > >> Till > > >> > > >> On Thu, Sep 8, 2016 at 7:08 PM, Deepak Jha <[hidden email] > <javascript:;>> wrote: > > >> > > >> > Hi, > > >> > I've setup Flink HA on AWS ( 3 Taskmanagers and 2 Jobmanagers each > are > > >> on > > >> > EC2 m4.large instance with checkpoint enabled on S3 ). My topology > > works > > >> > fine, but after few hours I do see that Taskmanagers gets detached > > with > > >> > Jobmanager. I tried to reach Jobmanager using telnet at the same > time > > >> and > > >> > it worked but Taskmanager does not succeed in connecting again. It > > >> attaches > > >> > only after I restart it. I tried following settings but still the > > >> problem > > >> > persists. > > >> > > > >> > akka.ask.timeout: 20 s > > >> > akka.lookup.timeout: 20 s > > >> > akka.watch.heartbeat.interval: 20 s > > >> > > > >> > Please find attached snapshot on one of the Taskmanager. Is there > any > > >> > setting that I need to do ? > > >> > > > >> > -- > > >> > Thanks, > > >> > Deepak Jha > > >> > > > >> > > > >> > > > > > > > > > > > > -- > > > Thanks, > > > Deepak Jha > > > > > > > > > > > > -- > > Thanks, > > Deepak Jha > > > -- Sent from Gmail Mobile |
Free forum by Nabble | Edit this page |