Flink HA on AWS: Network related issue

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Flink HA on AWS: Network related issue

Deepak Jha
Hi,
I've setup Flink HA on AWS ( 3 Taskmanagers and 2 Jobmanagers each are on EC2 m4.large instance with checkpoint enabled on S3 ). My topology works fine, but after few hours I do see that Taskmanagers gets detached with Jobmanager. I tried to reach Jobmanager using telnet at the same time and it worked but Taskmanager does not succeed in connecting again. It attaches only after I restart it. I tried following settings but still the problem persists.

akka.ask.timeout: 20 s
akka.lookup.timeout: 20 s
akka.watch.heartbeat.interval: 20 s

Please find attached snapshot on one of the Taskmanager. Is there any setting that I need to do ? 

--
Thanks,
Deepak Jha

Reply | Threaded
Open this post in threaded view
|

Re: Flink HA on AWS: Network related issue

Till Rohrmann
Hi Deepak,

could you check the logs whether the JobManager has been quarantined and
thus, cannot be connected to anymore? The logs should at least contain a
hint why the TaskManager lost the connection initially.

Cheers,
Till

On Thu, Sep 8, 2016 at 7:08 PM, Deepak Jha <[hidden email]> wrote:

> Hi,
> I've setup Flink HA on AWS ( 3 Taskmanagers and 2 Jobmanagers each are on
> EC2 m4.large instance with checkpoint enabled on S3 ). My topology works
> fine, but after few hours I do see that Taskmanagers gets detached with
> Jobmanager. I tried to reach Jobmanager using telnet at the same time and
> it worked but Taskmanager does not succeed in connecting again. It attaches
> only after I restart it. I tried following settings but still the problem
> persists.
>
> akka.ask.timeout: 20 s
> akka.lookup.timeout: 20 s
> akka.watch.heartbeat.interval: 20 s
>
> Please find attached snapshot on one of the Taskmanager. Is there any
> setting that I need to do ?
>
> --
> Thanks,
> Deepak Jha
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Flink HA on AWS: Network related issue

Deepak Jha
Hi Till,
I'm getting following message in Jobmanager log

2016-09-09 07:46:55,093 PDT [WARN]  ip-10-8-11-249
[flink-akka.actor.default-dispatcher-985] akka.remote.RemoteWatcher - *Detected
unreachable: [akka.tcp://flink@10.8.4.57:6121
<http://flink@10.8.4.57:6121>]*
2016-09-09 07:46:55,094 PDT [INFO]  ip-10-8-11-249
[flink-akka.actor.default-dispatcher-985]
o.a.f.runtime.jobmanager.JobManager - Task manager akka.tcp://
flink@10.8.4.57:6121/user/taskmanager terminated.
2016-09-09 07:46:55,094 PDT [INFO]  ip-10-8-11-249
[flink-akka.actor.default-dispatcher-985] o.a.f.r.instance.InstanceManager
- Unregistered task manager akka.tcp://flink@10.8.4.57:6121/user/taskmanager.
Number of registered task managers 2. Number of available slots 4.
2016-09-09 07:46:55,096 PDT [WARN]  ip-10-8-11-249
[flink-akka.actor.default-dispatcher-982] Remoting - Association to
[akka.tcp://flink@10.8.4.57:6121] having UID [-1223410403] is irrecoverably
failed. *UID is now quarantined and all messages to this UID will be
delivered to dead letters. Remote actorsystem must be restarted to recover
from this situation.*
2016-09-09 07:46:55,097 PDT [INFO]  ip-10-8-11-249
[flink-akka.actor.default-dispatcher-982] akka.actor.LocalActorRef -
Message [akka.remote.transport.AssociationHandle$Disassociated] from
Actor[akka://flink/deadLetters] to
Actor[akka://flink/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2Fflink%4010.8.4.57%3A6121-0/endpointWriter/endpointReader-akka.tcp%3A%2F%2Fflink%4010.8.4.57%3A6121-0#393939009]
was not delivered. [54] dead letters encountered. This logging can be
turned off or adjusted with configuration settings 'akka.log-dead-letters'
and 'akka.log-dead-letters-during-shutdown'.
2016-09-09 07:46:55,098 PDT [INFO]  ip-10-8-11-249
[flink-akka.actor.default-dispatcher-985] akka.actor.LocalActorRef -
Message [akka.remote.transport.AssociationHandle$Disassociated] from
Actor[akka://flink/deadLetters] to
Actor[akka://flink/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2Fflink%4010.8.4.57%3A51291-2#1151730456]
was not delivered. [55] dead letters encountered. This logging can be
turned off or adjusted with configuration settings 'akka.log-dead-letters'
and 'akka.log-dead-letters-during-shutdown'.
2016-09-09 07:46:58,479 PDT [INFO]  ip-10-8-11-249
[ForkJoinPool-3-worker-1] o.a.f.r.c.ZooKeeperCompletedCheckpointStore -
Recovering checkpoints from ZooKeeper.

Hope it helps. I'm using Flink 1.0.2

On Fri, Sep 9, 2016 at 12:34 AM, Till Rohrmann <[hidden email]> wrote:

> Hi Deepak,
>
> could you check the logs whether the JobManager has been quarantined and
> thus, cannot be connected to anymore? The logs should at least contain a
> hint why the TaskManager lost the connection initially.
>
> Cheers,
> Till
>
> On Thu, Sep 8, 2016 at 7:08 PM, Deepak Jha <[hidden email]> wrote:
>
> > Hi,
> > I've setup Flink HA on AWS ( 3 Taskmanagers and 2 Jobmanagers each are on
> > EC2 m4.large instance with checkpoint enabled on S3 ). My topology works
> > fine, but after few hours I do see that Taskmanagers gets detached with
> > Jobmanager. I tried to reach Jobmanager using telnet at the same time and
> > it worked but Taskmanager does not succeed in connecting again. It
> attaches
> > only after I restart it. I tried following settings but still the problem
> > persists.
> >
> > akka.ask.timeout: 20 s
> > akka.lookup.timeout: 20 s
> > akka.watch.heartbeat.interval: 20 s
> >
> > Please find attached snapshot on one of the Taskmanager. Is there any
> > setting that I need to do ?
> >
> > --
> > Thanks,
> > Deepak Jha
> >
> >
>



--
Thanks,
Deepak Jha
Reply | Threaded
Open this post in threaded view
|

Re: Flink HA on AWS: Network related issue

Deepak Jha
Hi Till,
One more thing i noticed after looking into following message in
taskmanager log

2016-09-11 17:57:25,310 PDT [WARN]  ip-10-6-0-15
[flink-akka.actor.default-dispatcher-31] Remoting - Tried to associate with
unreachable remote address [akka.tcp://flink@10.6.22.22:50050]. Address is
now gated for 5000 ms, all messages to this address will be delivered to
dead letters. Reason: *The remote system has quarantined this system. No
further associations to the remote system are possible until this system is
restarted*.

So in this case ideally the local ActorSystem should go down so that
service supervisor/runit will restart the system and taskmanager will again
be able to connect to the remote system.. If it does not happen
automatically then we have to monitor logs in some way and then try to
ensure that it restarts. Ideally flink taskmanager Actor System should go
down. Please let me know if my understanding is wrong.





On Fri, Sep 9, 2016 at 8:01 AM, Deepak Jha <[hidden email]> wrote:

> Hi Till,
> I'm getting following message in Jobmanager log
>
> 2016-09-09 07:46:55,093 PDT [WARN]  ip-10-8-11-249
> [flink-akka.actor.default-dispatcher-985] akka.remote.RemoteWatcher - *Detected
> unreachable: [akka.tcp://flink@10.8.4.57:6121
> <http://flink@10.8.4.57:6121>]*
> 2016-09-09 07:46:55,094 PDT [INFO]  ip-10-8-11-249
> [flink-akka.actor.default-dispatcher-985] o.a.f.runtime.jobmanager.JobManager
> - Task manager akka.tcp://flink@10.8.4.57:6121/user/taskmanager
> terminated.
> 2016-09-09 07:46:55,094 PDT [INFO]  ip-10-8-11-249
> [flink-akka.actor.default-dispatcher-985] o.a.f.r.instance.InstanceManager
> - Unregistered task manager akka.tcp://flink@10.8.4.57:
> 6121/user/taskmanager. Number of registered task managers 2. Number of
> available slots 4.
> 2016-09-09 07:46:55,096 PDT [WARN]  ip-10-8-11-249
> [flink-akka.actor.default-dispatcher-982] Remoting - Association to
> [akka.tcp://flink@10.8.4.57:6121] having UID [-1223410403] is
> irrecoverably failed. *UID is now quarantined and all messages to this
> UID will be delivered to dead letters. Remote actorsystem must be restarted
> to recover from this situation.*
> 2016-09-09 07:46:55,097 PDT [INFO]  ip-10-8-11-249
> [flink-akka.actor.default-dispatcher-982] akka.actor.LocalActorRef -
> Message [akka.remote.transport.AssociationHandle$Disassociated] from
> Actor[akka://flink/deadLetters] to Actor[akka://flink/system/
> endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2Fflink%4010.8.4.57%
> 3A6121-0/endpointWriter/endpointReader-akka.tcp%3A%2F%
> 2Fflink%4010.8.4.57%3A6121-0#393939009] was not delivered. [54] dead
> letters encountered. This logging can be turned off or adjusted with
> configuration settings 'akka.log-dead-letters' and
> 'akka.log-dead-letters-during-shutdown'.
> 2016-09-09 07:46:55,098 PDT [INFO]  ip-10-8-11-249
> [flink-akka.actor.default-dispatcher-985] akka.actor.LocalActorRef -
> Message [akka.remote.transport.AssociationHandle$Disassociated] from
> Actor[akka://flink/deadLetters] to Actor[akka://flink/system/transports/
> akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%
> 2Fflink%4010.8.4.57%3A51291-2#1151730456] was not delivered. [55] dead
> letters encountered. This logging can be turned off or adjusted with
> configuration settings 'akka.log-dead-letters' and
> 'akka.log-dead-letters-during-shutdown'.
> 2016-09-09 07:46:58,479 PDT [INFO]  ip-10-8-11-249
> [ForkJoinPool-3-worker-1] o.a.f.r.c.ZooKeeperCompletedCheckpointStore -
> Recovering checkpoints from ZooKeeper.
>
> Hope it helps. I'm using Flink 1.0.2
>
> On Fri, Sep 9, 2016 at 12:34 AM, Till Rohrmann <[hidden email]>
> wrote:
>
>> Hi Deepak,
>>
>> could you check the logs whether the JobManager has been quarantined and
>> thus, cannot be connected to anymore? The logs should at least contain a
>> hint why the TaskManager lost the connection initially.
>>
>> Cheers,
>> Till
>>
>> On Thu, Sep 8, 2016 at 7:08 PM, Deepak Jha <[hidden email]> wrote:
>>
>> > Hi,
>> > I've setup Flink HA on AWS ( 3 Taskmanagers and 2 Jobmanagers each are
>> on
>> > EC2 m4.large instance with checkpoint enabled on S3 ). My topology works
>> > fine, but after few hours I do see that Taskmanagers gets detached with
>> > Jobmanager. I tried to reach Jobmanager using telnet at the same time
>> and
>> > it worked but Taskmanager does not succeed in connecting again. It
>> attaches
>> > only after I restart it. I tried following settings but still the
>> problem
>> > persists.
>> >
>> > akka.ask.timeout: 20 s
>> > akka.lookup.timeout: 20 s
>> > akka.watch.heartbeat.interval: 20 s
>> >
>> > Please find attached snapshot on one of the Taskmanager. Is there any
>> > setting that I need to do ?
>> >
>> > --
>> > Thanks,
>> > Deepak Jha
>> >
>> >
>>
>
>
>
> --
> Thanks,
> Deepak Jha
>
>


--
Thanks,
Deepak Jha
Reply | Threaded
Open this post in threaded view
|

Re: Flink HA on AWS: Network related issue

Till Rohrmann
Hi Deepak,

it seems that the JobManager's deathwatch declares the TaskManager to be
unreachable which will automatically quarantine it. You're right that in
such a case, the TaskManager should shut down and be restarted so that it
can again reconnect to the JobManager. This is, however, not yet supported
automatically.

For the moment, I'd recommend you to make the deathwatch a little bit less
aggressive via the following settings:

- akka.watch.heartbeat.interval: Heartbeat interval for Akka’s DeathWatch
mechanism to detect dead TaskManagers. If TaskManagers are wrongly marked
dead because of lost or delayed heartbeat messages, then you should
increase this value. A thorough description of Akka’s DeathWatch can be
found here (DEFAULT: akka.ask.timeout/10).
- akka.watch.heartbeat.pause: Acceptable heartbeat pause for Akka’s
DeathWatch mechanism. A low value does not allow a irregular heartbeat. A
thorough description of Akka’s DeathWatch can be found here (DEFAULT:
akka.ask.timeout).
- akka.watch.threshold: Threshold for the DeathWatch failure detector. A
low value is prone to false positives whereas a high value increases the
time to detect a dead TaskManager. A thorough description of Akka’s
DeathWatch can be found here (DEFAULT: 12).

I hope this helps you to work around the problem for the moment until we've
added the automatic shut down and restart.

Cheers,
Till

On Mon, Sep 12, 2016 at 5:55 AM, Deepak Jha <[hidden email]> wrote:

> Hi Till,
> One more thing i noticed after looking into following message in
> taskmanager log
>
> 2016-09-11 17:57:25,310 PDT [WARN]  ip-10-6-0-15
> [flink-akka.actor.default-dispatcher-31] Remoting - Tried to associate
> with
> unreachable remote address [akka.tcp://flink@10.6.22.22:50050]. Address is
> now gated for 5000 ms, all messages to this address will be delivered to
> dead letters. Reason: *The remote system has quarantined this system. No
> further associations to the remote system are possible until this system is
> restarted*.
>
> So in this case ideally the local ActorSystem should go down so that
> service supervisor/runit will restart the system and taskmanager will again
> be able to connect to the remote system.. If it does not happen
> automatically then we have to monitor logs in some way and then try to
> ensure that it restarts. Ideally flink taskmanager Actor System should go
> down. Please let me know if my understanding is wrong.
>
>
>
>
>
> On Fri, Sep 9, 2016 at 8:01 AM, Deepak Jha <[hidden email]> wrote:
>
> > Hi Till,
> > I'm getting following message in Jobmanager log
> >
> > 2016-09-09 07:46:55,093 PDT [WARN]  ip-10-8-11-249
> > [flink-akka.actor.default-dispatcher-985] akka.remote.RemoteWatcher -
> *Detected
> > unreachable: [akka.tcp://flink@10.8.4.57:6121
> > <http://flink@10.8.4.57:6121>]*
> > 2016-09-09 07:46:55,094 PDT [INFO]  ip-10-8-11-249
> > [flink-akka.actor.default-dispatcher-985] o.a.f.runtime.jobmanager.
> JobManager
> > - Task manager akka.tcp://flink@10.8.4.57:6121/user/taskmanager
> > terminated.
> > 2016-09-09 07:46:55,094 PDT [INFO]  ip-10-8-11-249
> > [flink-akka.actor.default-dispatcher-985] o.a.f.r.instance.
> InstanceManager
> > - Unregistered task manager akka.tcp://flink@10.8.4.57:
> > 6121/user/taskmanager. Number of registered task managers 2. Number of
> > available slots 4.
> > 2016-09-09 07:46:55,096 PDT [WARN]  ip-10-8-11-249
> > [flink-akka.actor.default-dispatcher-982] Remoting - Association to
> > [akka.tcp://flink@10.8.4.57:6121] having UID [-1223410403] is
> > irrecoverably failed. *UID is now quarantined and all messages to this
> > UID will be delivered to dead letters. Remote actorsystem must be
> restarted
> > to recover from this situation.*
> > 2016-09-09 07:46:55,097 PDT [INFO]  ip-10-8-11-249
> > [flink-akka.actor.default-dispatcher-982] akka.actor.LocalActorRef -
> > Message [akka.remote.transport.AssociationHandle$Disassociated] from
> > Actor[akka://flink/deadLetters] to Actor[akka://flink/system/
> > endpointManager/reliableEndpointWriter-akka.
> tcp%3A%2F%2Fflink%4010.8.4.57%
> > 3A6121-0/endpointWriter/endpointReader-akka.tcp%3A%2F%
> > 2Fflink%4010.8.4.57%3A6121-0#393939009] was not delivered. [54] dead
> > letters encountered. This logging can be turned off or adjusted with
> > configuration settings 'akka.log-dead-letters' and
> > 'akka.log-dead-letters-during-shutdown'.
> > 2016-09-09 07:46:55,098 PDT [INFO]  ip-10-8-11-249
> > [flink-akka.actor.default-dispatcher-985] akka.actor.LocalActorRef -
> > Message [akka.remote.transport.AssociationHandle$Disassociated] from
> > Actor[akka://flink/deadLetters] to Actor[akka://flink/system/transports/
> > akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%
> > 2Fflink%4010.8.4.57%3A51291-2#1151730456] was not delivered. [55] dead
> > letters encountered. This logging can be turned off or adjusted with
> > configuration settings 'akka.log-dead-letters' and
> > 'akka.log-dead-letters-during-shutdown'.
> > 2016-09-09 07:46:58,479 PDT [INFO]  ip-10-8-11-249
> > [ForkJoinPool-3-worker-1] o.a.f.r.c.ZooKeeperCompletedCheckpointStore -
> > Recovering checkpoints from ZooKeeper.
> >
> > Hope it helps. I'm using Flink 1.0.2
> >
> > On Fri, Sep 9, 2016 at 12:34 AM, Till Rohrmann <[hidden email]>
> > wrote:
> >
> >> Hi Deepak,
> >>
> >> could you check the logs whether the JobManager has been quarantined and
> >> thus, cannot be connected to anymore? The logs should at least contain a
> >> hint why the TaskManager lost the connection initially.
> >>
> >> Cheers,
> >> Till
> >>
> >> On Thu, Sep 8, 2016 at 7:08 PM, Deepak Jha <[hidden email]> wrote:
> >>
> >> > Hi,
> >> > I've setup Flink HA on AWS ( 3 Taskmanagers and 2 Jobmanagers each are
> >> on
> >> > EC2 m4.large instance with checkpoint enabled on S3 ). My topology
> works
> >> > fine, but after few hours I do see that Taskmanagers gets detached
> with
> >> > Jobmanager. I tried to reach Jobmanager using telnet at the same time
> >> and
> >> > it worked but Taskmanager does not succeed in connecting again. It
> >> attaches
> >> > only after I restart it. I tried following settings but still the
> >> problem
> >> > persists.
> >> >
> >> > akka.ask.timeout: 20 s
> >> > akka.lookup.timeout: 20 s
> >> > akka.watch.heartbeat.interval: 20 s
> >> >
> >> > Please find attached snapshot on one of the Taskmanager. Is there any
> >> > setting that I need to do ?
> >> >
> >> > --
> >> > Thanks,
> >> > Deepak Jha
> >> >
> >> >
> >>
> >
> >
> >
> > --
> > Thanks,
> > Deepak Jha
> >
> >
>
>
> --
> Thanks,
> Deepak Jha
>
Reply | Threaded
Open this post in threaded view
|

Re: Flink HA on AWS: Network related issue

Deepak Jha
Hi Till,
There is a way to shutdown actor systems by
setting  taskmanager.maxRegistrationDuration to a reasonable duration
(eg: 900 seconds). Default value sets it to Inf. In this case I noticed
that Taskmanager goes down and runit restarts the service and it gets
connected with Jobmanager.

 As I said earlier as well that retries to connect to Jobmanager does not
work even though telnet works at the same time to the same Jobmanager on
port 50050.  So retry does cache something which does not allow it to
reconnect. My flink cluster is on aws ( m4.large instances), not sure if
anyone else has observed this behavior.



On Thursday, September 15, 2016, Till Rohrmann <[hidden email]> wrote:

> Hi Deepak,
>
> it seems that the JobManager's deathwatch declares the TaskManager to be
> unreachable which will automatically quarantine it. You're right that in
> such a case, the TaskManager should shut down and be restarted so that it
> can again reconnect to the JobManager. This is, however, not yet supported
> automatically.
>
> For the moment, I'd recommend you to make the deathwatch a little bit less
> aggressive via the following settings:
>
> - akka.watch.heartbeat.interval: Heartbeat interval for Akka’s DeathWatch
> mechanism to detect dead TaskManagers. If TaskManagers are wrongly marked
> dead because of lost or delayed heartbeat messages, then you should
> increase this value. A thorough description of Akka’s DeathWatch can be
> found here (DEFAULT: akka.ask.timeout/10).
> - akka.watch.heartbeat.pause: Acceptable heartbeat pause for Akka’s
> DeathWatch mechanism. A low value does not allow a irregular heartbeat. A
> thorough description of Akka’s DeathWatch can be found here (DEFAULT:
> akka.ask.timeout).
> - akka.watch.threshold: Threshold for the DeathWatch failure detector. A
> low value is prone to false positives whereas a high value increases the
> time to detect a dead TaskManager. A thorough description of Akka’s
> DeathWatch can be found here (DEFAULT: 12).
>
> I hope this helps you to work around the problem for the moment until we've
> added the automatic shut down and restart.
>
> Cheers,
> Till
>
> On Mon, Sep 12, 2016 at 5:55 AM, Deepak Jha <[hidden email]
> <javascript:;>> wrote:
>
> > Hi Till,
> > One more thing i noticed after looking into following message in
> > taskmanager log
> >
> > 2016-09-11 17:57:25,310 PDT [WARN]  ip-10-6-0-15
> > [flink-akka.actor.default-dispatcher-31] Remoting - Tried to associate
> > with
> > unreachable remote address [akka.tcp://flink@10.6.22.22:50050]. Address
> is
> > now gated for 5000 ms, all messages to this address will be delivered to
> > dead letters. Reason: *The remote system has quarantined this system. No
> > further associations to the remote system are possible until this system
> is
> > restarted*.
> >
> > So in this case ideally the local ActorSystem should go down so that
> > service supervisor/runit will restart the system and taskmanager will
> again
> > be able to connect to the remote system.. If it does not happen
> > automatically then we have to monitor logs in some way and then try to
> > ensure that it restarts. Ideally flink taskmanager Actor System should go
> > down. Please let me know if my understanding is wrong.
> >
> >
> >
> >
> >
> > On Fri, Sep 9, 2016 at 8:01 AM, Deepak Jha <[hidden email]
> <javascript:;>> wrote:
> >
> > > Hi Till,
> > > I'm getting following message in Jobmanager log
> > >
> > > 2016-09-09 07:46:55,093 PDT [WARN]  ip-10-8-11-249
> > > [flink-akka.actor.default-dispatcher-985] akka.remote.RemoteWatcher -
> > *Detected
> > > unreachable: [akka.tcp://flink@10.8.4.57:6121
> > > <http://flink@10.8.4.57:6121>]*
> > > 2016-09-09 07:46:55,094 PDT [INFO]  ip-10-8-11-249
> > > [flink-akka.actor.default-dispatcher-985] o.a.f.runtime.jobmanager.
> > JobManager
> > > - Task manager akka.tcp://flink@10.8.4.57:6121/user/taskmanager
> > > terminated.
> > > 2016-09-09 07:46:55,094 PDT [INFO]  ip-10-8-11-249
> > > [flink-akka.actor.default-dispatcher-985] o.a.f.r.instance.
> > InstanceManager
> > > - Unregistered task manager akka.tcp://flink@10.8.4.57 <javascript:;>:
> > > 6121/user/taskmanager. Number of registered task managers 2. Number of
> > > available slots 4.
> > > 2016-09-09 07:46:55,096 PDT [WARN]  ip-10-8-11-249
> > > [flink-akka.actor.default-dispatcher-982] Remoting - Association to
> > > [akka.tcp://flink@10.8.4.57:6121] having UID [-1223410403] is
> > > irrecoverably failed. *UID is now quarantined and all messages to this
> > > UID will be delivered to dead letters. Remote actorsystem must be
> > restarted
> > > to recover from this situation.*
> > > 2016-09-09 07:46:55,097 PDT [INFO]  ip-10-8-11-249
> > > [flink-akka.actor.default-dispatcher-982] akka.actor.LocalActorRef -
> > > Message [akka.remote.transport.AssociationHandle$Disassociated] from
> > > Actor[akka://flink/deadLetters] to Actor[akka://flink/system/
> > > endpointManager/reliableEndpointWriter-akka.
> > tcp%3A%2F%2Fflink%4010.8.4.57%
> > > 3A6121-0/endpointWriter/endpointReader-akka.tcp%3A%2F%
> > > 2Fflink%4010.8.4.57%3A6121-0#393939009] was not delivered. [54] dead
> > > letters encountered. This logging can be turned off or adjusted with
> > > configuration settings 'akka.log-dead-letters' and
> > > 'akka.log-dead-letters-during-shutdown'.
> > > 2016-09-09 07:46:55,098 PDT [INFO]  ip-10-8-11-249
> > > [flink-akka.actor.default-dispatcher-985] akka.actor.LocalActorRef -
> > > Message [akka.remote.transport.AssociationHandle$Disassociated] from
> > > Actor[akka://flink/deadLetters] to Actor[akka://flink/system/
> transports/
> > > akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%
> > > 2Fflink%4010.8.4.57%3A51291-2#1151730456] was not delivered. [55] dead
> > > letters encountered. This logging can be turned off or adjusted with
> > > configuration settings 'akka.log-dead-letters' and
> > > 'akka.log-dead-letters-during-shutdown'.
> > > 2016-09-09 07:46:58,479 PDT [INFO]  ip-10-8-11-249
> > > [ForkJoinPool-3-worker-1] o.a.f.r.c.ZooKeeperCompletedCheckpointStore
> -
> > > Recovering checkpoints from ZooKeeper.
> > >
> > > Hope it helps. I'm using Flink 1.0.2
> > >
> > > On Fri, Sep 9, 2016 at 12:34 AM, Till Rohrmann <[hidden email]
> <javascript:;>>
> > > wrote:
> > >
> > >> Hi Deepak,
> > >>
> > >> could you check the logs whether the JobManager has been quarantined
> and
> > >> thus, cannot be connected to anymore? The logs should at least
> contain a
> > >> hint why the TaskManager lost the connection initially.
> > >>
> > >> Cheers,
> > >> Till
> > >>
> > >> On Thu, Sep 8, 2016 at 7:08 PM, Deepak Jha <[hidden email]
> <javascript:;>> wrote:
> > >>
> > >> > Hi,
> > >> > I've setup Flink HA on AWS ( 3 Taskmanagers and 2 Jobmanagers each
> are
> > >> on
> > >> > EC2 m4.large instance with checkpoint enabled on S3 ). My topology
> > works
> > >> > fine, but after few hours I do see that Taskmanagers gets detached
> > with
> > >> > Jobmanager. I tried to reach Jobmanager using telnet at the same
> time
> > >> and
> > >> > it worked but Taskmanager does not succeed in connecting again. It
> > >> attaches
> > >> > only after I restart it. I tried following settings but still the
> > >> problem
> > >> > persists.
> > >> >
> > >> > akka.ask.timeout: 20 s
> > >> > akka.lookup.timeout: 20 s
> > >> > akka.watch.heartbeat.interval: 20 s
> > >> >
> > >> > Please find attached snapshot on one of the Taskmanager. Is there
> any
> > >> > setting that I need to do ?
> > >> >
> > >> > --
> > >> > Thanks,
> > >> > Deepak Jha
> > >> >
> > >> >
> > >>
> > >
> > >
> > >
> > > --
> > > Thanks,
> > > Deepak Jha
> > >
> > >
> >
> >
> > --
> > Thanks,
> > Deepak Jha
> >
>


--
Sent from Gmail Mobile