[jira] [Created] (FLINK-16470) Network failure causes Checkpoint Coordinator to flood disk with exceptions

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Created] (FLINK-16470) Network failure causes Checkpoint Coordinator to flood disk with exceptions

Shang Yuanchun (Jira)
Jason Kania created FLINK-16470:
-----------------------------------

             Summary: Network failure causes Checkpoint Coordinator to flood disk with exceptions
                 Key: FLINK-16470
                 URL: https://issues.apache.org/jira/browse/FLINK-16470
             Project: Flink
          Issue Type: Bug
          Components: API / Core
    Affects Versions: 1.9.2
         Environment: Latest patch current Ubuntu release with latest java 8 JRE.
            Reporter: Jason Kania


When a networking error occurred that prevented access to the shared folder mounted over NFS, the CheckpointCoordinator flooded the logs with the following:

 

{{org.apache.flink.util.FlinkException: Could not retrieve checkpoint 158365 from state handle under /0000000000000158365. This indicates that the retrieved state handle is broken. Try cleaning the state handle store.}}
{{ at org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore.retrieveCompletedCheckpoint(ZooKeeperCompletedCheckpointStore.java:345)}}
{{ at org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore.recover(ZooKeeperCompletedCheckpointStore.java:175)}}
{{ at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.restoreLatestCheckpointedState(CheckpointCoordinator.java:1014)}}
{{ at org.apache.flink.runtime.executiongraph.failover.AdaptedRestartPipelinedRegionStrategyNG.resetTasks(AdaptedRestartPipelinedRegionStrategyNG.java:205)}}
{{ at org.apache.flink.runtime.executiongraph.failover.AdaptedRestartPipelinedRegionStrategyNG.lambda$createResetAndRescheduleTasksCallback$1(AdaptedRestartPipelinedRegionStrategyNG.java:149)}}
{{ at org.apache.flink.runtime.concurrent.FutureUtils.lambda$scheduleWithDelay$3(FutureUtils.java:202)}}
{{ at org.apache.flink.runtime.concurrent.FutureUtils.lambda$scheduleWithDelay$4(FutureUtils.java:226)}}
{{ at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)}}
{{ at java.util.concurrent.FutureTask.run(FutureTask.java:266)}}
{{ at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:397)}}
{{ at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:190)}}
{{ at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)}}
{{ at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152)}}
{{ at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)}}
{{ at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)}}
{{ at scala.PartialFunction.applyOrElse(PartialFunction.scala:123)}}
{{ at scala.PartialFunction.applyOrElse$(PartialFunction.scala:122)}}
{{ at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)}}
{{ at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)}}
{{ at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172)}}
{{ at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172)}}
{{ at akka.actor.Actor.aroundReceive(Actor.scala:517)}}
{{ at akka.actor.Actor.aroundReceive$(Actor.scala:515)}}
{{ at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)}}
{{ at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)}}
{{ at akka.actor.ActorCell.invoke(ActorCell.scala:561)}}
{{ at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)}}
{{ at akka.dispatch.Mailbox.run(Mailbox.scala:225)}}
{{ at akka.dispatch.Mailbox.exec(Mailbox.scala:235)}}
{{ at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)}}
{{ at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)}}
{{ at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)}}
{{ at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)}}
{{Caused by: java.io.FileNotFoundException: /mnt/shared/completedCheckpoint53ed9d9197f7 (No such file or directory)}}
{{ at java.io.FileInputStream.open0(Native Method)}}
{{ at java.io.FileInputStream.open(FileInputStream.java:195)}}
{{ at java.io.FileInputStream.<init>(FileInputStream.java:138)}}
{{ at org.apache.flink.core.fs.local.LocalDataInputStream.<init>(LocalDataInputStream.java:50)}}
{{ at org.apache.flink.core.fs.local.LocalFileSystem.open(LocalFileSystem.java:142)}}
{{ at org.apache.flink.runtime.state.filesystem.FileStateHandle.openInputStream(FileStateHandle.java:68)}}
{{ at org.apache.flink.runtime.state.RetrievableStreamStateHandle.openInputStream(RetrievableStreamStateHandle.java:64)}}
{{ at org.apache.flink.runtime.state.RetrievableStreamStateHandle.retrieveState(RetrievableStreamStateHandle.java:57)}}
{{ at org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore.retrieveCompletedCheckpoint(ZooKeeperCompletedCheckpointStore.java:339)}}
{{ ... 32 more}}

 

The result was very high CPU utilization, high disk IO and a cascade of other application failures. In this situation, there should be a backoff on the retries to not bring down the whole node. Recovery may have been possible because the network outage lasted less than 30 seconds.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)