(DEPRECATED) Apache Flink Mailing List archive.

Recovery problem in Flink 1.6.3

Classic

List

Threaded

1 message

John Stone

Recovery problem in Flink 1.6.3

This is the first of two recovery problems I'm seeing running Flink 1.6.3 in Kubernetes. I'm posting them in separate messages for brevity and because the second is not directly related to the first. Any advice is appreciated.

Setup:
Flink 1.6.3 in Kubernetes (flink:1.6.3-hadoop28-scala_2.11). One JobManager and two TaskManagers (TM_1, TM_2). Each pod has 4 CPUs. Each TaskManager has 16 task slots. High availability is enabled. S3 (s3a) for storage. RocksDB with incremental snapshots. It doesn't matter if local recover is enabled - I've managed to replicate with both local recovery enabled and disabled.

Problem:
Flink cannot recover a job unless there are the same number of free task slots as the job's parallelism.

Replication steps:
Create a job with a parallelism of either 17 or 32 - enough to force the job to use both TMs. After the job has successfully is fully running and has taken a checkpoint, delete one of the TaskManagers (TM_1). Kubernetes will spawn a new TaskManager (TM_3) which will successfully connect to the JobManager.

Actual Behavior:
The running job will be canceled and redeployed but will be caught in a SCHEDULED state (shows as CREATED in the web UI). JobManager will repeatively attempt to request slots from the ResourceManager. The tasks in the job will never resume.

Expected Behavior:
Job should be fully unscheduled from TM_2. TM_2 and TM_3 should pick up the job. The job should successfully resume from the last checkpoint.

Known Workarounds:
1) Cancel and resubmit the job.
2) Using the above example, have a free TaskManager (TM_4) that also has 16 available slots.

Log snip:
2019-01-10 19:42:50,299 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Co-Process-Broadcast -> Flat Map -> Sink: Unnamed (29/32) (6078b9c76953c7c27b05b522880d3d1b) switched from CANCELING to CANCELED.
2019-01-10 19:42:50,299 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Try to restart or fail the job streamProcessorJob (c44a91b76ea99ead6fdf9b13a98c15bb) if no longer possible.
2019-01-10 19:42:50,299 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Job streamProcessorJob (c44a91b76ea99ead6fdf9b13a98c15bb) switched from state FAILING to RESTARTING.
2019-01-10 19:42:50,299 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Restarting the job streamProcessorJob (c44a91b76ea99ead6fdf9b13a98c15bb).
2019-01-10 19:42:50,302 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Job streamProcessorJob (c44a91b76ea99ead6fdf9b13a98c15bb) switched from state RESTARTING to CREATED.
2019-01-10 19:42:50,302 INFO org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Recovering checkpoints from ZooKeeper.
2019-01-10 19:42:50,308 INFO org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Found 1 checkpoints in ZooKeeper.
2019-01-10 19:42:50,308 INFO org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying to fetch 1 checkpoints from storage.
2019-01-10 19:42:50,308 INFO org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying to retrieve checkpoint 1.
2019-01-10 19:42:50,386 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Restoring job c44a91b76ea99ead6fdf9b13a98c15bb from latest valid checkpoint: Checkpoint 1 @ 1547149215694 for c44a91b76ea99ead6fdf9b13a98c15bb.
2019-01-10 19:42:50,388 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - No master state to restore
2019-01-10 19:42:50,388 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Job streamProcessorJob (c44a91b76ea99ead6fdf9b13a98c15bb) switched from state CREATED to RUNNING.
2019-01-10 19:42:50,388 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Source: Purchase Order Kafka Consumer (1/1) (49b728769a3a2b3a3a6ba45cd4445e3b) switched from CREATED to SCHEDULED.
2019-01-10 19:42:50,388 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Source: Purchase Order Bundle Kafka Consumer (1/1) (1220cf4b9f5eb937191bb2232a482899) switched from CREATED to SCHEDULED.
2019-01-10 19:42:50,389 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Source: ControlInstruction Kafka Consumer -> Filter -> Filter -> Map (1/1) (29f69ee8fbc208cd7c63e99907d11386) switched from CREATED to SCHEDULED.
2019-01-10 19:42:50,389 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Source: Ticket Update Kafka Consumer (1/1) (4bdfbcb7280fb7a7c9ea2d5aa02efa41) switched from CREATED to SCHEDULED.
2019-01-10 19:42:50,389 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPool - Requesting new slot [SlotRequestId{83bcd1c29b885a7799bf6e5d73d1961c}] and profile ResourceProfile{cpuCores=-1.0, heapMemoryInMB=-1, directMemoryInMB=0, nativeMemoryInMB=0, networkMemoryInMB=0} from resource manager.
2019-01-10 19:42:50,389 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Source: Purchase Order Product Kafka Consumer (1/1) (9bfdbe9141c8b6715b890a35b026bb3b) switched from CREATED to SCHEDULED.

...snip...

2019-01-10 19:42:50,471 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Co-Process-Broadcast -> Flat Map -> Sink: Unnamed (32/32) (cd851b29b5533f8a3d4812ac7e8d47ab) switched from CREATED to SCHEDULED.
2019-01-10 19:42:52,733 INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - Registering TaskManager with ResourceID 6c84753ff746978c90a2e6d7627e6263 (akka.tcp://flink@flink-taskmanager-5748955d9-4wmj9:6126/user/taskmanager_0) at ResourceManager
2019-01-10 19:42:52,736 INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - Registering TaskManager with ResourceID 6c84753ff746978c90a2e6d7627e6263 (akka.tcp://flink@flink-taskmanager-5748955d9-4wmj9:6126/user/taskmanager_0) at ResourceManager
2019-01-10 19:42:55,572 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@flink-taskmanager-5748955d9-vgtl4:6126] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@flink-taskmanager-5748955d9-vgtl4:6126]] Caused by: [flink-taskmanager-5748955d9-vgtl4: Name or service not known]
2019-01-10 19:44:20,388 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Checkpoint triggering task Source: ControlInstruction Kafka Consumer -> Filter -> Filter -> Map (1/1) of job c44a91b76ea99ead6fdf9b13a98c15bb is not in state RUNNING but SCHEDULED instead. Aborting checkpoint.
2019-01-10 19:45:50,388 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Checkpoint triggering task Source: ControlInstruction Kafka Consumer -> Filter -> Filter -> Map (1/1) of job c44a91b76ea99ead6fdf9b13a98c15bb is not in state RUNNING but SCHEDULED instead. Aborting checkpoint.

Many thanks,

John Stone