[jira] [Created] (FLINK-12437) Taskmanager doesn't initiate registration after jobmanager marks it terminated

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Created] (FLINK-12437) Taskmanager doesn't initiate registration after jobmanager marks it terminated

Shang Yuanchun (Jira)
Abdul Qadeer created FLINK-12437:
------------------------------------

             Summary: Taskmanager doesn't initiate registration after jobmanager marks it terminated
                 Key: FLINK-12437
                 URL: https://issues.apache.org/jira/browse/FLINK-12437
             Project: Flink
          Issue Type: Bug
            Reporter: Abdul Qadeer


This issue is observed in Standalone cluster deployment mode with Zookeeper HA enabled in Flink 1.4.0. A few taskmanagers restarted due to Out of Metaspace.
 The offending taskmanager `pipelineruntime-taskmgr-6789dd578b-dcp4r` first successfully registers with jobmanager, and the remote watcher marks it terminated soon after as seen in logs. There were other taskmanagers that were terminated around same time but they had been quarantined by jobmanager with message similar to:
{noformat}
Association to [akka.tcp://flink@10.60.5.121:8070] having UID [864976677] is irrecoverably failed. UID is now quarantined and all messages to this UID will be delivered to dead letters. Remote actorsystem must be restarted to recover from this situation.
{noformat}
They came back up and successfully registered with jobmanager. This didn't happen for the offending taskmanager:
  
 At JobManager:
{noformat}
{"timeMillis":1557073368155,"thread":"flink-akka.actor.default-dispatcher-49","level":"INFO","loggerName":"org.apache.flink.runtime.instance.InstanceManager","message":"Registered TaskManager at pipelineruntime-taskmgr-6789dd578b-dcp4r (akka.tcp://flink@10.60.5.85:8070/user/taskmanager) as ae61ac607f0ab35ab5066f7dc221e654. Current number of registered hosts is 8. Current number of alive task slots is 51.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":125,"threadPriority":5}
...
...
{"timeMillis":1557073391386,"thread":"flink-akka.actor.default-dispatcher-82","level":"INFO","loggerName":"org.apache.flink.runtime.instance.InstanceManager","message":"Unregistered task manager /10.60.5.85. Number of registered task managers 7. Number of available slots 45.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":159,"threadPriority":5}
...
...
{"timeMillis":1557073391483,"thread":"flink-akka.actor.default-dispatcher-82","level":"INFO","loggerName":"org.apache.flink.runtime.instance.InstanceManager","message":"Unregistered task manager /10.60.5.85. Number of registered task managers 6. Number of available slots 39.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":159,"threadPriority":5}
...
...
{"timeMillis":1557073370389,"thread":"flink-akka.actor.default-dispatcher-35","level":"INFO","loggerName":"akka.actor.LocalActorRef","message":"Message [akka.remote.ReliableDeliverySupervisor$Ungate$] from Actor[akka://flink/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2Fflink%4010.60.5.85%3A8070-3#1863607260] to Actor[akka://flink/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2Fflink%4010.60.5.85%3A8070-3#1863607260] was not delivered. [22] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":98,"threadPriority":5}
{noformat}
At TaskManager:
{noformat}
{"timeMillis":1557073366068,"thread":"pool-2-thread-1","level":"INFO","loggerName":"org.apache.flink.runtime.taskmanager.TaskManager","message":"Starting TaskManager","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":40,"threadPriority":5}
{"timeMillis":1557073366073,"thread":"pool-2-thread-1","level":"INFO","loggerName":"org.apache.flink.runtime.taskmanager.TaskManager","message":"Starting TaskManager actor system at 10.60.5.85:8070.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":40,"threadPriority":5}
{"timeMillis":1557073366077,"thread":"pool-2-thread-1","level":"INFO","loggerName":"org.apache.flink.runtime.taskmanager.TaskManager","message":"Trying to start actor system at 10.60.5.85:8070","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":40,"threadPriority":5}
{"timeMillis":1557073366510,"thread":"flink-akka.actor.default-dispatcher-4","level":"INFO","loggerName":"akka.event.slf4j.Slf4jLogger","message":"Slf4jLogger started","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":51,"threadPriority":5}
{"timeMillis":1557073366694,"thread":"flink-akka.actor.default-dispatcher-4","level":"INFO","loggerName":"akka.remote.Remoting","message":"Starting remoting","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":51,"threadPriority":5}
{"timeMillis":1557073367049,"thread":"flink-akka.actor.default-dispatcher-4","level":"INFO","loggerName":"akka.remote.Remoting","message":"Remoting started; listening on addresses :[akka.tcp://flink@10.60.5.85:8070]","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":51,"threadPriority":5}
{"timeMillis":1557073367051,"thread":"flink-akka.actor.default-dispatcher-4","level":"INFO","loggerName":"akka.remote.Remoting","message":"Remoting now listens on addresses: [akka.tcp://flink@10.60.5.85:8070]","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":51,"threadPriority":5}
{"timeMillis":1557073367089,"thread":"pool-2-thread-1","level":"INFO","loggerName":"org.apache.flink.runtime.taskmanager.TaskManager","message":"Actor system started at akka.tcp://flink@10.60.5.85:8070","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":40,"threadPriority":5}
{"timeMillis":1557073367138,"thread":"pool-2-thread-1","level":"INFO","loggerName":"org.apache.flink.runtime.metrics.MetricRegistryImpl","message":"Configuring FlinkMetricsReporter with {class=com.cisco.ndp.pipeline.processor.flink.metrics.FlinkMetricsReporter}.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":40,"threadPriority":5}
{"timeMillis":1557073367139,"thread":"pool-2-thread-1","level":"INFO","loggerName":"com.cisco.ndp.pipeline.processor.flink.metrics.FlinkMetricsReporter","message":"Metrics Reporter Open","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":40,"threadPriority":5}
{"timeMillis":1557073367139,"thread":"pool-2-thread-1","level":"INFO","loggerName":"org.apache.flink.runtime.metrics.MetricRegistryImpl","message":"Reporting metrics for reporter ndp of type com.cisco.ndp.pipeline.processor.flink.metrics.FlinkMetricsReporter.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":40,"threadPriority":5}
{"timeMillis":1557073367142,"thread":"pool-2-thread-1","level":"INFO","loggerName":"org.apache.flink.runtime.taskmanager.TaskManager","message":"Starting TaskManager actor","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":40,"threadPriority":5}
{"timeMillis":1557073367176,"thread":"pool-2-thread-1","level":"INFO","loggerName":"org.apache.flink.runtime.io.network.netty.NettyConfig","message":"NettyConfig [server address: /10.60.5.85, server port: 0, ssl enabled: false, memory segment size (bytes): 32768, transport type: NIO, number of server threads: 3 (manual), number of client threads: 3 (manual), server connect backlog: 0 (use Netty's default), client connect timeout (sec): 120, send/receive buffer size (bytes): 0 (use Netty's default)]","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":40,"threadPriority":5}
{"timeMillis":1557073367187,"thread":"pool-2-thread-1","level":"INFO","loggerName":"org.apache.flink.runtime.taskexecutor.TaskManagerConfiguration","message":"Messages have a max timeout of 100000 ms","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":40,"threadPriority":5}
{"timeMillis":1557073367198,"thread":"pool-2-thread-1","level":"INFO","loggerName":"org.apache.flink.runtime.taskexecutor.TaskManagerServices","message":"Temporary file directory '/tmp': total 373 GB, usable 295 GB (79.09% usable)","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":40,"threadPriority":5}
{"timeMillis":1557073367608,"thread":"pool-2-thread-1","level":"INFO","loggerName":"org.apache.flink.runtime.io.network.buffer.NetworkBufferPool","message":"Allocated 639 MB for network buffer pool (number of memory segments: 20467, bytes per segment: 32768).","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":40,"threadPriority":5}
{"timeMillis":1557073367710,"thread":"pool-2-thread-1","level":"WARN","loggerName":"org.apache.flink.runtime.query.QueryableStateUtils","message":"Could not load Queryable State Client Proxy. Probable reason: flink-queryable-state-runtime is not in the classpath. Please put the corresponding jar from the opt to the lib folder.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":40,"threadPriority":5}
{"timeMillis":1557073367711,"thread":"pool-2-thread-1","level":"WARN","loggerName":"org.apache.flink.runtime.query.QueryableStateUtils","message":"Could not load Queryable State Server. Probable reason: flink-queryable-state-runtime is not in the classpath. Please put the corresponding jar from the opt to the lib folder.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":40,"threadPriority":5}
{"timeMillis":1557073367712,"thread":"pool-2-thread-1","level":"INFO","loggerName":"org.apache.flink.runtime.io.network.NetworkEnvironment","message":"Starting the network environment and its components.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":40,"threadPriority":5}
{"timeMillis":1557073367753,"thread":"pool-2-thread-1","level":"INFO","loggerName":"org.apache.flink.runtime.io.network.netty.NettyClient","message":"Successful initialization (took 34 ms).","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":40,"threadPriority":5}
{"timeMillis":1557073367805,"thread":"pool-2-thread-1","level":"INFO","loggerName":"org.apache.flink.runtime.io.network.netty.NettyServer","message":"Successful initialization (took 51 ms). Listening on SocketAddress /10.60.5.85:38873.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":40,"threadPriority":5}
{"timeMillis":1557073367808,"thread":"pool-2-thread-1","level":"INFO","loggerName":"org.apache.flink.runtime.taskexecutor.TaskManagerServices","message":"Limiting managed memory to 0.7 of the currently free heap space (4005 MB), memory will be allocated lazily.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":40,"threadPriority":5}
{"timeMillis":1557073367819,"thread":"pool-2-thread-1","level":"INFO","loggerName":"org.apache.flink.runtime.io.disk.iomanager.IOManager","message":"I/O manager uses directory /tmp/flink-io-5f657721-13dd-40aa-9c00-2a15d5666280 for spill files.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":40,"threadPriority":5}
{"timeMillis":1557073367826,"thread":"pool-2-thread-1","level":"INFO","loggerName":"org.apache.flink.runtime.filecache.FileCache","message":"User file cache uses directory /tmp/flink-dist-cache-30b1f2fd-9457-435b-a601-ae0b4e37dc6d","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":40,"threadPriority":5}
{"timeMillis":1557073367862,"thread":"flink-akka.actor.default-dispatcher-3","level":"INFO","loggerName":"org.apache.flink.runtime.filecache.FileCache","message":"User file cache uses directory /tmp/flink-dist-cache-3dfb3cd5-b261-4df3-a662-a1cd91047c72","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":50,"threadPriority":5}
{"timeMillis":1557073367888,"thread":"flink-akka.actor.default-dispatcher-3","level":"INFO","loggerName":"org.apache.flink.runtime.taskmanager.TaskManager","message":"Starting TaskManager actor at akka://flink/user/taskmanager#1157564383.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":50,"threadPriority":5}
{"timeMillis":1557073367889,"thread":"flink-akka.actor.default-dispatcher-3","level":"INFO","loggerName":"org.apache.flink.runtime.taskmanager.TaskManager","message":"TaskManager data connection information: pipelineruntime-taskmgr-6789dd578b-dcp4r-57b5f60d8144eb16425ec5bd9666768f @ pipelineruntime-taskmgr-6789dd578b-dcp4r (dataPort=38873)","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":50,"threadPriority":5}
{"timeMillis":1557073367890,"thread":"flink-akka.actor.default-dispatcher-3","level":"INFO","loggerName":"org.apache.flink.runtime.taskmanager.TaskManager","message":"TaskManager has 6 task slot(s).","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":50,"threadPriority":5}
{"timeMillis":1557073367892,"thread":"flink-akka.actor.default-dispatcher-3","level":"INFO","loggerName":"org.apache.flink.runtime.taskmanager.TaskManager","message":"Memory usage stats: [HEAP: 842/6554/6554 MB, NON HEAP: 62/64/1776 MB (used/committed/max)]","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":50,"threadPriority":5}
{"timeMillis":1557073367892,"thread":"flink-akka.actor.default-dispatcher-3","level":"INFO","loggerName":"org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService","message":"Starting ZooKeeperLeaderRetrievalService.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":50,"threadPriority":5}
{"timeMillis":1557073367965,"thread":"pool-2-thread-1-EventThread","level":"INFO","loggerName":"org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService","message":"Leader node has changed with Leader=akka.tcp://flink@10.60.5.53:6123/user/jobmanager, session ID=270a3383-8f1e-4f2d-b1d6-f7af727e9ea0.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":46,"threadPriority":5}
{"timeMillis":1557073367966,"thread":"pool-2-thread-1-EventThread","level":"INFO","loggerName":"org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService","message":"New leader information: Leader=akka.tcp://flink@10.60.5.53:6123/user/jobmanager, session ID=270a3383-8f1e-4f2d-b1d6-f7af727e9ea0.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":46,"threadPriority":5}
{"timeMillis":1557073367975,"thread":"flink-akka.actor.default-dispatcher-3","level":"INFO","loggerName":"org.apache.flink.runtime.taskmanager.TaskManager","message":"Trying to register at JobManager akka.tcp://flink@10.60.5.53:6123/user/jobmanager (attempt 1, timeout: 500 milliseconds)","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":50,"threadPriority":5}
{"timeMillis":1557073368168,"thread":"flink-akka.actor.default-dispatcher-3","level":"INFO","loggerName":"org.apache.flink.runtime.taskmanager.TaskManager","message":"Successful registration at JobManager (akka.tcp://flink@10.60.5.53:6123/user/jobmanager), starting network stack and library cache.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":50,"threadPriority":5}
{"timeMillis":1557073368177,"thread":"flink-akka.actor.default-dispatcher-3","level":"INFO","loggerName":"org.apache.flink.runtime.taskmanager.TaskManager","message":"Determined BLOB server address to be /10.60.5.53:43987. Starting BLOB cache.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":50,"threadPriority":5}
{"timeMillis":1557073368184,"thread":"flink-akka.actor.default-dispatcher-3","level":"INFO","loggerName":"org.apache.flink.runtime.blob.PermanentBlobCache","message":"Created BLOB cache storage directory /tmp/blobStore-ffdc49ba-e86f-4240-93ad-7566c43e9b0d","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":50,"threadPriority":5}
{"timeMillis":1557073368189,"thread":"flink-akka.actor.default-dispatcher-3","level":"INFO","loggerName":"org.apache.flink.runtime.blob.TransientBlobCache","message":"Created BLOB cache storage directory /tmp/blobStore-764277b6-6e46-4c8f-b7ee-80f746edefab","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":50,"threadPriority":5}
{"timeMillis":1557073391398,"thread":"flink-akka.actor.default-dispatcher-3","level":"INFO","loggerName":"akka.actor.EmptyLocalActorRef","message":"Message [org.apache.flink.runtime.messages.JobManagerMessages$LeaderSessionMessage] from Actor[akka.tcp://flink@10.60.5.53:6123/temp/$R4] to Actor[akka://flink/user/taskmanager#-1883282689] was not delivered. [1] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":50,"threadPriority":5}
{"timeMillis":1557073391399,"thread":"flink-akka.actor.default-dispatcher-3","level":"INFO","loggerName":"akka.actor.EmptyLocalActorRef","message":"Message [org.apache.flink.runtime.messages.JobManagerMessages$LeaderSessionMessage] from Actor[akka.tcp://flink@10.60.5.53:6123/temp/$S4] to Actor[akka://flink/user/taskmanager#-1883282689] was not delivered. [2] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":50,"threadPriority":5}
{"timeMillis":1557073391399,"thread":"flink-akka.actor.default-dispatcher-3","level":"INFO","loggerName":"akka.actor.EmptyLocalActorRef","message":"Message [org.apache.flink.runtime.messages.JobManagerMessages$LeaderSessionMessage] from Actor[akka.tcp://flink@10.60.5.53:6123/temp/$T4] to Actor[akka://flink/user/taskmanager#-1883282689] was not delivered. [3] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":50,"threadPriority":5}
{"timeMillis":1557073391400,"thread":"flink-akka.actor.default-dispatcher-3","level":"INFO","loggerName":"akka.actor.EmptyLocalActorRef","message":"Message [org.apache.flink.runtime.messages.JobManagerMessages$LeaderSessionMessage] from Actor[akka.tcp://flink@10.60.5.53:6123/temp/$U4] to Actor[akka://flink/user/taskmanager#-1883282689] was not delivered. [4] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":50,"threadPriority":5}
{"timeMillis":1557073391400,"thread":"flink-akka.actor.default-dispatcher-3","level":"INFO","loggerName":"akka.actor.EmptyLocalActorRef","message":"Message [org.apache.flink.runtime.messages.JobManagerMessages$LeaderSessionMessage] from Actor[akka.tcp://flink@10.60.5.53:6123/temp/$V4] to Actor[akka://flink/user/taskmanager#-1883282689] was not delivered. [5] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":50,"threadPriority":5}
{"timeMillis":1557073391401,"thread":"flink-akka.actor.default-dispatcher-3","level":"INFO","loggerName":"akka.actor.EmptyLocalActorRef","message":"Message [org.apache.flink.runtime.messages.JobManagerMessages$LeaderSessionMessage] from Actor[akka.tcp://flink@10.60.5.53:6123/temp/$W4] to Actor[akka://flink/user/taskmanager#-1883282689] was not delivered. [6] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":50,"threadPriority":5}
{"timeMillis":1557073391401,"thread":"flink-akka.actor.default-dispatcher-3","level":"INFO","loggerName":"akka.actor.EmptyLocalActorRef","message":"Message [org.apache.flink.runtime.messages.JobManagerMessages$LeaderSessionMessage] from Actor[akka.tcp://flink@10.60.5.53:6123/temp/$X4] to Actor[akka://flink/user/taskmanager#-1883282689] was not delivered. [7] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":50,"threadPriority":5}
{"timeMillis":1557073391474,"thread":"flink-akka.actor.default-dispatcher-3","level":"INFO","loggerName":"akka.actor.EmptyLocalActorRef","message":"Message [org.apache.flink.runtime.messages.JobManagerMessages$LeaderSessionMessage] from Actor[akka.tcp://flink@10.60.5.53:6123/temp/$Y4] to Actor[akka://flink/user/taskmanager#-1883282689] was not delivered. [8] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":50,"threadPriority":5}
{"timeMillis":1557073391475,"thread":"flink-akka.actor.default-dispatcher-3","level":"INFO","loggerName":"akka.actor.EmptyLocalActorRef","message":"Message [org.apache.flink.runtime.messages.JobManagerMessages$LeaderSessionMessage] from Actor[akka.tcp://flink@10.60.5.53:6123/temp/$Z4] to Actor[akka://flink/user/taskmanager#-1883282689] was not delivered. [9] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":50,"threadPriority":5}
{"timeMillis":1557073391477,"thread":"flink-akka.actor.default-dispatcher-3","level":"INFO","loggerName":"akka.actor.EmptyLocalActorRef","message":"Message [org.apache.flink.runtime.messages.JobManagerMessages$LeaderSessionMessage] from Actor[akka.tcp://flink@10.60.5.53:6123/temp/$04] to Actor[akka://flink/user/taskmanager#-1883282689] was not delivered. [10] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":50,"threadPriority":5}
...
...
...
{"timeMillis":1557073691534,"thread":"flink-akka.actor.default-dispatcher-2","level":"INFO","loggerName":"akka.actor.EmptyLocalActorRef","message":"Message [org.apache.flink.runtime.messages.JobManagerMessages$LeaderSessionMessage] from Actor[akka.tcp://flink@10.60.5.53:6123/temp/$sab] to Actor[akka://flink/user/taskmanager#-1883282689] was not delivered. [316] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":49,"threadPriority":5}

{noformat}
TCP dump at taskmanager:
{noformat}
19:55:58.214944 IP 10.60.5.85.45008 > 10.60.5.53.6123: tcp 715
        0x0000:  4500 02ff 2809 4000 4006 f0ee 0a3c 0555  E...(.@.@....<.U
        0x0010:  0a3c 0535 afd0 17eb a107 10ac 0270 79da  .<.5.........py.
        0x0020:  8018 ce96 21f3 0000 0101 080a f2c0 c93f  ....!..........?
        0x0030:  b74c ec05 0000 02c7 0ac4 0512 c105 0a3d  .L.............=
        0x0040:  0a3b 616b 6b61 2e74 6370 3a2f 2f66 6c69  .;akka.tcp://fli
        0x0050:  6e6b 4031 302e 3630 2e35 2e35 333a 3631  nk@10.60.5.53:61
        0x0060:  3233 2f75 7365 722f 6a6f 626d 616e 6167  23/user/jobmanag
        0x0070:  6572 2331 3231 3433 3237 3831 3312 bf04  er#1214327813...
        0x0080:  0aba 04ac ed00 0573 7200 3f6f 7267 2e61  .......sr.?org.a
        0x0090:  7061 6368 652e 666c 696e 6b2e 7275 6e74  pache.flink.runt
        0x00a0:  696d 652e 6d65 7373 6167 6573 2e54 6173  ime.messages.Tas
        0x00b0:  6b4d 616e 6167 6572 4d65 7373 6167 6573  kManagerMessages
        0x00c0:  2448 6561 7274 6265 6174 1fb7 fffd 259b  $Heartbeat....%.
        0x00d0:  c539 0200 024c 000c 6163 6375 6d75 6c61  .9...L..accumula
        0x00e0:  746f 7273 7400 164c 7363 616c 612f 636f  torst..Lscala/co
        0x00f0:  6c6c 6563 7469 6f6e 2f53 6571 3b4c 000a  llection/Seq;L..
        0x0100:  696e 7374 616e 6365 4944 7400 2e4c 6f72  instanceIDt..Lor
        0x0110:  672f 6170 6163 6865 2f66 6c69 6e6b 2f72  g/apache/flink/r
        0x0120:  756e 7469 6d65 2f69 6e73 7461 6e63 652f  untime/instance/
        0x0130:  496e 7374 616e 6365 4944 3b78 7073 7200  InstanceID;xpsr.
        0x0140:  2473 6361 6c61 2e63 6f6c 6c65 6374 696f  $scala.collectio
        0x0150:  6e2e 6d75 7461 626c 652e 4172 7261 7942  n.mutable.ArrayB
        0x0160:  7566 6665 7215 38b0 5383 828e 7302 0003  uffer.8.S...s...
        0x0170:  4900 0b69 6e69 7469 616c 5369 7a65 4900  I..initialSizeI.
        0x0180:  0573 697a 6530 5b00 0561 7272 6179 7400  .size0[..arrayt.
        0x0190:  135b 4c6a 6176 612f 6c61 6e67 2f4f 626a  .[Ljava/lang/Obj
        0x01a0:  6563 743b 7870 0000 0010 0000 0000 7572  ect;xp........ur
        0x01b0:  0013 5b4c 6a61 7661 2e6c 616e 672e 4f62  ..[Ljava.lang.Ob
        0x01c0:  6a65 6374 3b90 ce58 9f10 7329 6c02 0000  ject;..X..s)l...
        0x01d0:  7870 0000 0010 7070 7070 7070 7070 7070  xp....pppppppppp
        0x01e0:  7070 7070 7070 7372 002c 6f72 672e 6170  ppppppsr.,org.ap
        0x01f0:  6163 6865 2e66 6c69 6e6b 2e72 756e 7469  ache.flink.runti
        0x0200:  6d65 2e69 6e73 7461 6e63 652e 496e 7374  me.instance.Inst
        0x0210:  616e 6365 4944 0000 0000 0000 0001 0200  anceID..........
        0x0220:  0078 7200 206f 7267 2e61 7061 6368 652e  .xr..org.apache.
        0x0230:  666c 696e 6b2e 7574 696c 2e41 6273 7472  flink.util.Abstr
        0x0240:  6163 7449 4400 0000 0000 0000 0102 0003  actID...........
        0x0250:  4a00 096c 6f77 6572 5061 7274 4a00 0975  J..lowerPartJ..u
        0x0260:  7070 6572 5061 7274 4c00 0874 6f53 7472  pperPartL..toStr
        0x0270:  696e 6774 0012 4c6a 6176 612f 6c61 6e67  ingt..Ljava/lang
        0x0280:  2f53 7472 696e 673b 7870 ae61 ac60 7f0a  /String;xp.a.`..
        0x0290:  b35a b506 6f7d c221 e654 7400 2061 6536  .Z..o}.!.Tt..ae6
        0x02a0:  3161 6336 3037 6630 6162 3335 6162 3530  1ac607f0ab35ab50
        0x02b0:  3636 6637 6463 3232 3165 3635 3410 0122  66f7dc221e654.."
        0x02c0:  3e0a 3c61 6b6b 612e 7463 703a 2f2f 666c  >.<akka.tcp://fl
        0x02d0:  696e 6b40 3130 2e36 302e 352e 3835 3a38  ink@10.60.5.85:8
        0x02e0:  3037 302f 7573 6572 2f74 6173 6b6d 616e  070/user/taskman
        0x02f0:  6167 6572 2331 3135 3735 3634 3338 33    ager#1157564383
19:55:58.214996 IP 10.60.5.53.6123 > 10.60.5.85.45008: tcp 0
        0x0000:  4500 0034 c1fe 4000 3f06 5ac4 0a3c 0535  E..4..@.?.Z..<.5
        0x0010:  0a3c 0555 17eb afd0 0270 79da a107 1377  .<.U.....py....w
        0x0020:  8010 ce93 1f28 0000 0101 080a b74c ff8d  .....(.......L..
        0x0030:  f2c0 c93f                                ...?
{noformat}
After this, the taskmanager never registers again at the jobmanager.

This run had the following akka configuration:

akka.watch.heartbeat.pause: 60 s

akka.ask.timeout: 100 s

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)