Steven Zhen Wu created FLINK-19401:
-------------------------------------- Summary: Job stuck in restart loop due to "Could not find registered job manager" Key: FLINK-19401 URL: https://issues.apache.org/jira/browse/FLINK-19401 Project: Flink Issue Type: Bug Components: Runtime / Coordination Affects Versions: 1.10.1 Reporter: Steven Zhen Wu Flink job sometimes got into a restart loop for many hours and can't recover until redeployed. We had some issue with Kafka that initially caused the job to restart. Below is the first of the many exceptions for "ResourceManagerException: Could not find registered job manager" error. {code} 2020-09-19 00:03:31,614 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl [flink-akka.actor.default-dispatcher-35973] - Requesting new slot [SlotRequestId{171f1df017dab3a42c032abd07908b9b}] and profile ResourceP rofile{UNKNOWN} from resource manager. 2020-09-19 00:03:31,615 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl [flink-akka.actor.default-dispatcher-35973] - Requesting new slot [SlotRequestId{cc7d136c4ce1f32285edd4928e3ab2e2}] and profile ResourceProfile{UNKNOWN} from resource manager. 2020-09-19 00:03:31,615 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl [flink-akka.actor.default-dispatcher-35973] - Requesting new slot [SlotRequestId{024c8a48dafaf8f07c49dd4320d5cc94}] and profile ResourceProfile{UNKNOWN} from resource manager. 2020-09-19 00:03:31,615 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl [flink-akka.actor.default-dispatcher-35973] - Requesting new slot [SlotRequestId{a591eda805b3081ad2767f5641d0db06}] and profile ResourceProfile{UNKNOWN} from resource manager. 2020-09-19 00:03:31,620 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [flink-akka.actor.default-dispatcher-35973] - Source: k2-csevpc -> k2-csevpcRaw -> (vhsPlaybackEvents -> Flat Map, merchImpressionsClientLog -> Flat Map) (56/640) (1b0d3dd1f19890886ff373a3f08809e8) switched from SCHEDULED to FAILED. java.util.concurrent.CompletionException: java.util.concurrent.CompletionException: org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: No pooled slot available and request to ResourceManager for new slot failed at org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.lambda$new$0(SlotSharingManager.java:433) at java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:836) at java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811) at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) at org.apache.flink.runtime.concurrent.FutureUtils.lambda$forward$21(FutureUtils.java:1065) at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774) at java.util.concurrent.CompletableFuture.uniWhenCompleteStage(CompletableFuture.java:792) at java.util.concurrent.CompletableFuture.whenComplete(CompletableFuture.java:2153) at org.apache.flink.runtime.concurrent.FutureUtils.forward(FutureUtils.java:1063) at org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager.createRootSlot(SlotSharingManager.java:155) at org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.allocateMultiTaskSlot(SchedulerImpl.java:511) at org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.allocateSharedSlot(SchedulerImpl.java:311) at org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.internalAllocateSlot(SchedulerImpl.java:160) at org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.allocateSlotInternal(SchedulerImpl.java:143) at org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.allocateSlot(SchedulerImpl.java:113) at org.apache.flink.runtime.executiongraph.SlotProviderStrategy$NormalSlotProviderStrategy.allocateSlot(SlotProviderStrategy.java:115) at org.apache.flink.runtime.scheduler.DefaultExecutionSlotAllocator.lambda$allocateSlotsFor$0(DefaultExecutionSlotAllocator.java:104) at java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:995) at java.util.concurrent.CompletableFuture.thenCompose(CompletableFuture.java:2137) at org.apache.flink.runtime.scheduler.DefaultExecutionSlotAllocator.allocateSlotsFor(DefaultExecutionSlotAllocator.java:102) at org.apache.flink.runtime.scheduler.DefaultScheduler.allocateSlots(DefaultScheduler.java:342) at org.apache.flink.runtime.scheduler.DefaultScheduler.allocateSlotsAndDeploy(DefaultScheduler.java:311) at org.apache.flink.runtime.scheduler.strategy.EagerSchedulingStrategy.allocateSlotsAndDeploy(EagerSchedulingStrategy.java:76) at org.apache.flink.runtime.scheduler.strategy.EagerSchedulingStrategy.restartTasks(EagerSchedulingStrategy.java:57) at org.apache.flink.runtime.scheduler.DefaultScheduler.lambda$restartTasks$1(DefaultScheduler.java:268) at java.util.concurrent.CompletableFuture.uniRun(CompletableFuture.java:719) at java.util.concurrent.CompletableFuture$UniRun.tryFire(CompletableFuture.java:701) at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:402) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:195) at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152) at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21) at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170) at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) at akka.actor.Actor$class.aroundReceive(Actor.scala:517) at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592) at akka.actor.ActorCell.invoke(ActorCell.scala:561) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258) at akka.dispatch.Mailbox.run(Mailbox.scala:225) at akka.dispatch.Mailbox.exec(Mailbox.scala:235) at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Caused by: java.util.concurrent.CompletionException: org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: No pooled slot available and request to ResourceManager for new slot failed at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292) at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308) at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:607) at java.util.concurrent.CompletableFuture.uniApplyStage(CompletableFuture.java:628) at java.util.concurrent.CompletableFuture.thenApply(CompletableFuture.java:1996) at org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.requestNewAllocatedSlot(SlotPoolImpl.java:438) at org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.requestNewAllocatedSlot(SchedulerImpl.java:236) at org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.allocateMultiTaskSlot(SchedulerImpl.java:506) ... 39 more Caused by: org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: No pooled slot available and request to ResourceManager for new slot failed at org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.slotRequestToResourceManagerFailed(SlotPoolImpl.java:360) at org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.lambda$requestSlotFromResourceManager$1(SlotPoolImpl.java:348) at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774) at java.util.concurrent.CompletableFuture.uniWhenCompleteStage(CompletableFuture.java:792) at java.util.concurrent.CompletableFuture.whenComplete(CompletableFuture.java:2153) at org.apache.flink.runtime.concurrent.FutureUtils.whenCompleteAsyncIfNotDone(FutureUtils.java:941) at org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.requestSlotFromResourceManager(SlotPoolImpl.java:342) at org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.requestNewAllocatedSlotInternal(SlotPoolImpl.java:309) at org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.requestNewAllocatedSlot(SlotPoolImpl.java:437) ... 41 more Caused by: java.util.concurrent.CompletionException: org.apache.flink.runtime.resourcemanager.exceptions.ResourceManagerException: Could not find registered job manager for job 70216adbeed914b35d77717c4b7b13ea. at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292) at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308) at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:607) at java.util.concurrent.CompletableFuture.uniApplyStage(CompletableFuture.java:628) at java.util.concurrent.CompletableFuture.thenApply(CompletableFuture.java:1996) at org.apache.flink.runtime.rpc.akka.AkkaInvocationHandler.invokeRpc(AkkaInvocationHandler.java:214) at org.apache.flink.runtime.rpc.akka.AkkaInvocationHandler.invoke(AkkaInvocationHandler.java:129) at org.apache.flink.runtime.rpc.akka.FencedAkkaInvocationHandler.invoke(FencedAkkaInvocationHandler.java:78) at com.sun.proxy.$Proxy94.requestSlot(Unknown Source) at org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.requestSlotFromResourceManager(SlotPoolImpl.java:337) ... 43 more Caused by: org.apache.flink.runtime.resourcemanager.exceptions.ResourceManagerException: Could not find registered job manager for job 70216adbeed914b35d77717c4b7b13ea. at org.apache.flink.runtime.resourcemanager.ResourceManager.requestSlot(ResourceManager.java:443) at sun.reflect.GeneratedMethodAccessor135.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcInvocation(AkkaRpcActor.java:284) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:199) ... 20 more {code} Grepped through our log. job manager registration happened once when the job was deployed a few days ago. {code} 2020-09-16 17:23:28,081 INFO com.netflix.spaas.runtime.resourcemanager.TitusResourceManager [flink-akka.actor.default-dispatcher-60] - Registered job manager [hidden email]://flink@100.118.253.133:43917/user/jobmanager_0 for job 70216adbeed914b35d77717c4b7b13ea. {code} Then there were a flurry of 9 registrations in the same milli-seconds that happened ~30 mins after the first error of " Could not find registered job manager". {code} 2020-09-19 00:33:07,827 INFO com.netflix.spaas.runtime.resourcemanager.TitusResourceManager [flink-akka.actor.default-dispatcher-35963] - Registered job manager [hidden email]://flink@100.118.253.133:43917/user/jobmanager_0 for job 70216adbeed914b35d77717c4b7b13ea. 2020-09-19 00:33:07,827 INFO com.netflix.spaas.runtime.resourcemanager.TitusResourceManager [flink-akka.actor.default-dispatcher-35963] - Registered job manager [hidden email]://flink@100.118.253.133:43917/user/jobmanager_0 for job 70216adbeed914b35d77717c4b7b13ea. 2020-09-19 00:33:07,827 INFO com.netflix.spaas.runtime.resourcemanager.TitusResourceManager [flink-akka.actor.default-dispatcher-35963] - Registered job manager [hidden email]://flink@100.118.253.133:43917/user/jobmanager_0 for job 70216adbeed914b35d77717c4b7b13ea. 2020-09-19 00:33:07,827 INFO com.netflix.spaas.runtime.resourcemanager.TitusResourceManager [flink-akka.actor.default-dispatcher-35963] - Registered job manager [hidden email]://flink@100.118.253.133:43917/user/jobmanager_0 for job 70216adbeed914b35d77717c4b7b13ea. 2020-09-19 00:33:07,827 INFO com.netflix.spaas.runtime.resourcemanager.TitusResourceManager [flink-akka.actor.default-dispatcher-35963] - Registered job manager [hidden email]://flink@100.118.253.133:43917/user/jobmanager_0 for job 70216adbeed914b35d77717c4b7b13ea. 2020-09-19 00:33:07,827 INFO com.netflix.spaas.runtime.resourcemanager.TitusResourceManager [flink-akka.actor.default-dispatcher-35963] - Registered job manager [hidden email]://flink@100.118.253.133:43917/user/jobmanager_0 for job 70216adbeed914b35d77717c4b7b13ea. 2020-09-19 00:33:07,827 INFO com.netflix.spaas.runtime.resourcemanager.TitusResourceManager [flink-akka.actor.default-dispatcher-35963] - Registered job manager [hidden email]://flink@100.118.253.133:43917/user/jobmanager_0 for job 70216adbeed914b35d77717c4b7b13ea. 2020-09-19 00:33:07,827 INFO com.netflix.spaas.runtime.resourcemanager.TitusResourceManager [flink-akka.actor.default-dispatcher-35963] - Registered job manager [hidden email]://flink@100.118.253.133:43917/user/jobmanager_0 for job 70216adbeed914b35d77717c4b7b13ea. 2020-09-19 00:33:07,827 INFO com.netflix.spaas.runtime.resourcemanager.TitusResourceManager [flink-akka.actor.default-dispatcher-35963] - Registered job manager [hidden email]://flink@100.118.253.133:43917/user/jobmanager_0 for job 70216adbeed914b35d77717c4b7b13ea. 2020-09-19 00:33:07,828 INFO org.apache.flink.runtime.jobmaster.JobMaster [flink-akka.actor.default-dispatcher-35968] - JobManager successfully registered at ResourceManager, leader id: bf239dac186bc8ba901a8702f4bb42e3. {code} I have the job manager logs with INFO level that I can share offline if needed. -- This message was sent by Atlassian Jira (v8.3.4#803005) |
Free forum by Nabble | Edit this page |